[erlang-bugs] eunit_surefire doesn't ensure proper UTF-8 encoding

Siri Hansen erlangsiri@REDACTED
Fri Jan 31 14:28:29 CET 2014


Thanks for the report - I have written a ticket for this. A contribution
will of course speed up the handling... :)
/siri@REDACTED


2013-11-15 Samuel <samuelrivas@REDACTED>:

> We have seen this in the past, but we fixed it in our own surefire
> rebar plugin. If I remember right, the problem is not in
> eunit_surefire.erl but in eunit itself. Below some information I dug
> out of my emal (unfortunately I never found time to produce a proper
> patch for this:
>
> ------
> I am pretty sure the patch
>
> https://github.com/richcarl/eunit/commit/9f505f1b8881f44c1e5d37df005533b2af6d6a7e
> does not solve the right problem.
>
> As far as I can understand, the output is already in binary state when
> it reaches the eunit_surefile code, which means that it is already
> encoded. The patch seems to work because the encoding happened to be
> latin1 (by coincidence) and then re-encoding to UTF8 works.
>
> The root issue seems to be in eunit_proc, that ignores the encoding of
> the io_requests and then buffer_to_binary just does list_to_binary.
>
> The patch seems to work because it does the right thing for codepoints
> between 127 and 255, as they are the same as the latin1 encoding for
> them. Thus they get properly encoded to utf-8 when writing the xml
> file, but will probably fail if the binary passed to eunit_surefile
> were properly encoded in utf-8.
>
> There is a major issue with that, and is that eunit_proc will crash if
> any test outputs a codepoint higher than 255, I think I have a proper
> fix for that but I haven't had the time to test it thoroughly yet.
> When fixed, the surefile report must be written in raw again, as the
> binaries should be utf8 encoded already.
>
> Next patch makes it work again, but is a hack, as it assumes the
> strings to be unicode in the list form and utf8 in the binary form
> (which I guess is true in current OTP implementation):
>
> -buffer_to_binary(Buf) -> list_to_binary(lists:reverse(Buf)).
> +buffer_to_binary(Buf) -> unicode:characters_to_binary(lists:reverse(Buf)).
>
>
> As an example, the attached suite causes this when run:
>
> > eunit_unicode_crash:test().
>
> =ERROR REPORT==== 27-Aug-2012::14:26:49 ===
> Error in process <0.78.0> with exit value:
>
> {badarg,[{erlang,list_to_binary,[[[[1013],"\n"]]],[]},{eunit_proc,buffer_to_binary,1,[{file,"eunit_proc.erl"},{line,276}]},{eunit_proc,group_leader_loop,3,[{file,"eunit_proc.erl"},{line,600}]}]}
>
> eunit_unicode_crash: unicode_test (module
> 'eunit_unicode_crash')...*skipped*
> undefined
> *unexpected termination of test process*
> ::{badarg,[{erlang,list_to_binary,[[[[1013],"\n"]]],[]},
>            {eunit_proc,buffer_to_binary,1,
>                        [{file,"eunit_proc.erl"},{line,276}]},
>            {eunit_proc,group_leader_loop,3,
>                        [{file,"eunit_proc.erl"},{line,600}]}]}
>
> On 13 November 2013 13:38, Magnus Henoch <magnus@REDACTED>
> wrote:
> > Compile the following module and run eunit_xml_encoding_bug:doit() from
> > an Erlang shell:
> >
> > -module(eunit_xml_encoding_bug).
> >
> > -compile(export_all).
> >
> > -include_lib("eunit/include/eunit.hrl").
> >
> > doit() ->
> >     eunit:test(?MODULE, [{report, {eunit_surefire,[]}}]).
> >
> > my_test_() ->
> >     ?_test(io:format([128,10])).
> >
> > This creates a file called TEST-eunit_xml_encoding_bug.xml which claims
> > to be in UTF-8 (its first line is '<?xml version="1.0" encoding="UTF-8"
> ?>')
> > but contains an improperly encoded character.  Most XML tools will
> > refuse to do anything with such an XML file.  For example xmllint says:
> >
> > $ xmllint /tmp/TEST-eunit_xml_encoding_bug.xml
> > /tmp/TEST-eunit_xml_encoding_bug.xml:4: parser error : Input is not
> proper UTF-8, indicate encoding !
> >
> > And opening the file in Firefox yields:
> >
> > XML Parsing Error: not well-formed
> > Location: file:///tmp/TEST-eunit_xml_encoding_bug.xml
> > Line Number 4, Column 17:
> >
> > I came across this problem when running a Quickcheck property inside
> > Eunit.  The Quickcheck property would output random binary data with
> > io:format("~p"), and sometimes that would end up being high bytes which
> > were valid Latin-1 but invalid UTF-8.
> >
> > As eunit_surefire declares its output files to be in UTF-8 encoding, I
> > think it should check that the contents of <system-out> etc are properly
> > encoded, and if not do something about it, e.g. convert from Latin-1 to
> > UTF-8 or insert replacement characters (U+FFFD).
> >
> > Regards,
> > Magnus
> > _______________________________________________
> > erlang-bugs mailing list
> > erlang-bugs@REDACTED
> > http://erlang.org/mailman/listinfo/erlang-bugs
>
>
>
> --
> Samuel
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20140131/338e9560/attachment.htm>


More information about the erlang-bugs mailing list