[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings

Fri Oct 21 10:41:39 CEST 2011

questions below...

On Miércoles, 19 de Octubre de 2011 12:14:30 Joe Armstrong escribió:
> cookbook # 1 - draft 1
> 
> <aside>
>  We're going to write a cookbook.
> 
>  This will be free (in an electronic version, PDF, epub)
>  And you will be able to buy a paper version (POD)
> 
>  The development model is
> 
>   - a few authors
>   - many reviewers (you are the reviewers)
>     the reviewers report errors/suggest changes
>     the authors make the changes
> 
>  The POD version we hope will generate some income
>  this will be split according to the contributions. Authors
>  will be paid as will reviewers whose suggestions are incorporated.
> 
>  Payment (if we make a profit) will be in direct relation to the size
> of the contribution
> 
>  Expensive things like professional proof reading, will be
>  sponsorship, or crowd sourced, or otherwise financed.
> 
>  To start the ball rolling I have some text below.
> 
>  Please comment on this text. If your comments are accepted one day you
> might get paid :-)
> 
>  Note: 1) By commenting you are implicitly agreeing that if your comments
> are accepted into the final text then you will be subject to the
> licensing conditions of that text. The text will always be free and
> open source.
> 
> </aside>
> 
> Cookbook Question:
> 
> I have often seen the words "UTF-8 string" used in sentences like
> "Java has UTF-8 strings". What does this mean when applied to Erlang?
> 
> ----------------------------------------------------------------------
> 
> Answer:
> 
> In Erlang strings are syntactic sugar for "lists of integers"
> 
> Imagine the string "10(Euro)" - (Euro) is the glyph representing the
> Euro currency symbol.
> 
> The term "UF8-string" representing "10(euro)" in Erlang could
> mean one of two things:
> 
>    Either a) [49,48,8364]           (ie its a list of three unicode
> integers) Or     b) [49,48,226,130,172]    (ie its the UTF-8 encoding of
> the unicode characters)
> 
> The so words "UTF-8" string might mean a) or might mean b)
> 
> Erlang folks have always said "unicode/UTF-8 is easy in Erlang
> since strings are just lists of integers" - by this we mean that
> Erlang programs should always manipulate strings given the type a)
> interpretation. *all* library functions assume type a) encoding.
> 
> The type b) interpretation only has meaning when you write data to a
> file etc. and should be as invisible to the user as possible (but when
> things go wrong and you get the wrong character printed you need to
> understand the difference)
> 
> Question 1) How can we get a unicode characters into a list item?
>             or what does a string literal look like?
> 
>    > X = "10\x{20ac}"
> 
>    [49,48,8364]
> 
>    This is not described in my book since the change came after the
>    book was published (is it in the other Erlang books yet?)
> 
> Question 2) How can we convert between representations a) and b) above?
> 
>    Easy - though one has to dig in the documentation a bit.
> 
>    > B = unicode:characters_to_binary(X, unicode, utf8).
> 
>    <<49,48,226,130,172>>
> 
>    > unicode:characters_to_list(B).
> 
>    [49,48,8364]
> 
> Question 3) Can I write "10(Euro)" in an editor which supports
> unicode/UTF-8 and does the erlang tool chain support this?

I would say No! (last year i posted a mail complaining about my spanish 
messages getting garbled  when i used ñ, ó, á etc...

But right now it works!!

Ive just added some national caracters into one of my strings and the seems 
survive the compilation step..

			...io:format("Procesando ó ñ ü fichero ~s / ~s ~.16b ~n",
[filename:dirname(Path),filename:basename(Path),Digest]),....

thist outputs:

Procesando ó ñ ü fichero /home/sinosuke / .bash_history 
84a45c9c62121aec0d1860534377577a 
Procesando ó ñ ü fichero /home/sinosuke / .xim.template 
93d3a1252fe3069130b1cece05fd6d44 
Procesando ó ñ ü fichero /home/sinosuke / .xinitrc.template 
d3f5ce074afc0ef61c89d1d08c582457 

(Im using kate on OpenSSUE 11.4 X64 and erlang/OTP  R14B04 (erts-5.8.5) and my 
sources are in utf-8)

the same sources opened in ISO-8859-1 mode show:

			io:format("Procesando Ã³ Ã± Ã¼ fichero ~s / ~s ~.16b ~n",
[filename:dirname(Path),filename:basename(Path),Digest]),

those Ã³ Ã± Ã¼ are the infamous Unicode codepoints, are they?

I hope this helps about this question.

regards, Angel

> 
> Will "erlc foo.erl" automatically detect that foo.erl is unicode
> encoded and do the right thing when scanning and tokenising strings?
> 
>    Answer: I don't know?
> 
> Question 4)  Can string literals be improved on?
> 
> I hope so -- In Html I can say (I hope) €
> 
> I'd like to say:
> 
>       X = "10€" in Erlang
> 
>       People who know far more about this than I do can tell me if this
> is OK
> 
> 
> ----------------------------------------------------------------------
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions