[erlang-questions] cookbook entry #1 - unicode/UTF-8 strings

Richard O'Keefe ok@REDACTED
Tue Oct 25 06:26:26 CEST 2011


>> Cookbook Question:
>> 
>> I have often seen the words "UTF-8 string" used in sentences like
>> "Java has UTF-8 strings". What does this mean when applied to Erlang?

Minor note: it means very little when applied to Java and less of that
is actually true.

 - Java *source* code, including strings, may be encoded in various
   ways, including UTF-8.
 - The String *class* in Java is *NOT* based on UTF-8 but on UTF-16.
 - There is no reason in principle why Java _couldn't_ have a UTF-8
   string type, but there doesn't happen to be one in java.lang.*
   or java.util.*   (I rolled my own Latin1 string class for some
   tasks.)

>> In Erlang strings are syntactic sugar for "lists of integers"

In Java strings are syntactic sugar for "slices of arrays of 16-bit integers".

>> 
>> Imagine the string "10(Euro)" - (Euro) is the glyph representing the
>> Euro currency symbol.
>> 
>> The term "UF8-string" representing "10(euro)" in Erlang could
>> mean one of two things:
>> 
>>   Either a) [49,48,8364]           (ie its a list of three unicode
>> integers)

For "integer" read "codepoint".

Dealing with Unicode *codepoints* in Erlang is tolerably straightforward;
the difficulties are
 - dealing with external *encodings*
 - dealing with the *semantics* of Unicode, like the way that
   different sequences of codepoints may represent the same
   sequence of characters, so that list equality and string equality
   are arguably different things.




More information about the erlang-questions mailing list