[erlang-bugs] Unicode bug in io:format

Sun Nov 27 23:35:06 CET 2011

On 11-22 14:02, Erik Søe Sørensen wrote:
> On 22-11-2011 13:11, eurekafag wrote:
> >Many thanks for this thorough research! However I have two things
> >to mention. Setting or getting encoding introduces noticeable
> >delay in launching without -noinput, but with it it starts just as
> >fast as usual. Pretty strange.
> Yes, I noticed that too; the delay is so long that there is probably
> a timeout somewhere.
> 
> >And another a bit illogical issue: to print UTF-8 strings one
> >should NOT set binary type /utf8. This works fine with encoding
> >set: io:format("~ts~n", [<<"Тестовая строка">>]).
> >This fails in both noinput-cases with encoding set:
> >io:format("~ts~n", [<<"Тестовая строка"/utf8>>]).
> Remember that still, *source files are always interpreted as latin-1*.
> 
> From http://www.erlang.org/doc/apps/stdlib/unicode_usage.html :
> 
>    It is convenient to be able to write a list of Unicode characters in
>    the string syntax. However, the language specifies strings as being
>    in the ISO-latin-1 character set which the compiler tool chain as
>    well as many other tools expect.
> 
>    Also the source code is (for now) still expected to be written using
>    the ISO-latin-1 character set, why Unicode characters beyond that
>    range cannot be entered in string literals.
> 
> Which means that the "/utf8" modifier will always do a latin1->utf8
> encoding.
> So, yes, if you ensure that your source files are UTF-8 encoded, you
> can use the string literals as they are, and expect them to be
> UTF-8.

This is undocumented 'feature' of how erl_scan works on binaries strings,
it basically eats bytes from input until it will find trailing
closing " (modulo handling of \ to escape characters).
However it is not supported.

It doesn't work on normal list strings also.

I'm trying to solve this by introducing proper utf8 support

https://github.com/baryluk/otp/compare/master...source_code_encoding_in_compiler_and_epp

This patch make string literals and character constants ($ą, $ę, $ó)
to support utf8. Unfortunetly it makes string binaries to be broken
after this patch. It is due how binaries are construcuted,
it is however easly fixable (will just need to reverse encoding
back from unicode codepoints to utf8 when parsing binaries).

Patch is still WIP because few minor things must be smoothed out.
(recursive inclusion of files, BOM detection, and cooperation with -compile()
directive)

> 
> >I guess it's because of double encoding (by explicitly defined
> >encoding and that suffix) but I was confused at first. It's better
> >not to set encoding but declare it in binary strings like they do
> >in Python prepending strings with 'u' literal, which doesn't work
> >in Erlang for all cases.
> Well, for the u"..." syntax, Python also needs to know the encoding
> of the source file. Unlike Erlang, however, Python can be told what
> the encoding is (and can recognize Unicode files which begin with a
> BOM character).

Yes, I also work on BOM detection in this patch.

You can use for example compile:file("somemodule.erl", [{encoding, utf8}])

with for now default encoding being 'latin1', one can also explicitly
say {encoding, default_encoding}, which again will be 'latin1'
but in the future could use BOM to detect each file properly.
One can force detection using {encoding, detect_unicode_encoding},
however it is not yet working correctly, especially with UTF-16 and UTF-32 files,
but it will be fixed quickly.

I do this mainly because I want to easly write webpages
in Erlang, and do not want to use be forced to use external files,
and lookups to other files / databases.
I'm ok with external files for example for l10n/i18n, but
instead of using .po files, and _() equivalent .erl files + function call,
will give me better performance, and simpler toolchain.
Even without considering l10n/i18n, there still can be some
characters (like mathematics, punctations, elipsis, etc) I would like
to have and see them literally in code, than using html entities,
or other not nice to eyes things.

Regards,
Witek

-- 
Witold Baryluk