[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings

Tue Oct 25 06:26:26 CEST 2011

>> Cookbook Question:
>> 
>> I have often seen the words "UTF-8 string" used in sentences like
>> "Java has UTF-8 strings". What does this mean when applied to Erlang?

Minor note: it means very little when applied to Java and less of that
is actually true.

 - Java *source* code, including strings, may be encoded in various
   ways, including UTF-8.
 - The String *class* in Java is *NOT* based on UTF-8 but on UTF-16.
 - There is no reason in principle why Java _couldn't_ have a UTF-8
   string type, but there doesn't happen to be one in java.lang.*
   or java.util.*   (I rolled my own Latin1 string class for some
   tasks.)

>> In Erlang strings are syntactic sugar for "lists of integers"

In Java strings are syntactic sugar for "slices of arrays of 16-bit integers".

>> 
>> Imagine the string "10(Euro)" - (Euro) is the glyph representing the
>> Euro currency symbol.
>> 
>> The term "UF8-string" representing "10(euro)" in Erlang could
>> mean one of two things:
>> 
>>   Either a) [49,48,8364]           (ie its a list of three unicode
>> integers)

For "integer" read "codepoint".

Dealing with Unicode *codepoints* in Erlang is tolerably straightforward;
the difficulties are
 - dealing with external *encodings*
 - dealing with the *semantics* of Unicode, like the way that
   different sequences of codepoints may represent the same
   sequence of characters, so that list equality and string equality
   are arguably different things.