[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings

Fri Oct 21 21:28:39 CEST 2011

On Fri, Oct 21, 2011 at 2:35 PM, Richard Carlsson
<carlsson.richard@REDACTED> wrote:
> On 10/21/2011 10:41 AM, Angel J. Alvarez Miguel wrote:
>>
>> (Im using kate on OpenSSUE 11.4 X64 and erlang/OTP  R14B04 (erts-5.8.5)
>> and my
>> sources are in utf-8)
>
> No, don't make this mistake. To the Erlang compiler, your sources are in
> Latin-1, plain and simple. As far as the compiler knows, you have actually
> written "Ã³ Ã± Ã¼" and nothing else. When you print the string with
> io:format, you are printing the Latin-1 text "Ã³ Ã± Ã¼" (the bytes [195,
> 179, 32, 195, 177, 32, 195, 188]) to the standard output. That your console
> re-interprets these bytes as "ó ñ ü" just means that you have managed to
> fool the system for this particular use case.
>
> (By the way, those characters are already in the Latin-1 charset, so you
> don't *need* UTF-8 at all unless you have some additional characters you
> want to use that are above 255 in Unicode.)
>
> If/when Erlang supports other encodings in source code (this will probably
> require adding a compiler flag for specifying the input encoding), a string
> literal such as "ᚱ" should be equivalent to [5809], not [225,154,177], just
> like your "óñü" should be equivalent to [243,241,252] (which is what you
> would have got if your editor had been set to Latin-1 to begin with).
>
> One can think about it like this: taking an existing, working, Latin-1
> source file, converting it to UTF-8 (or any other encoding), and compiling
> it with a flag that informs the compiler what the input encoding is, should
> not change the semantics of the program in any respect whatsoever compared
> to compiling the original source file. Thus, a string literal that today
> contains "ß" ([223]) in a plain old Latin-1 encoded Erlang source file must
> *always* mean [223] no matter what you change the input encoding to.
>
>>> Will "erlc foo.erl" automatically detect that foo.erl is unicode
>>> encoded and do the right thing when scanning and tokenising strings?
>
> No. Erlang source code is (currently) Latin-1 by definition. No matter what
> your editor thinks it is using, the compiler will interpret the bytes as
> Latin-1.

I hate to say this - but just about the only thing XML got right was
the declaration

   <?xml version="1.0" encoding="UTF-8" standalone="no" ?>

Should we have

   -erlang("1.0","UTF=8","no"). :-)

as the first line :-)

(( I have argued in vain for a version for years - to allow for
incompatible changes to
the syntax ))

/Joe


>
>   /ᚱᛁᚴᚼᛅᚱᛏ
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions
>