[erlang-questions] string:lexeme/s2 - an old man's rant

Richard O'Keefe raoknz@REDACTED
Wed May 8 02:39:55 CEST 2019


Let's look at the documentation for tokens/2:

http://erlang.org/doc/man/string.html#tokens-2

The first thing I notice is that we are told *that*
the function is obsolete but not *why* it is, and
that's important.

The second thing I notice is that we are told
to use lexemes/2 instead, but we are not told *how*
to do that.  An example showing an old call and its
new equivalent would do wonders.

The third thing I notice is the reason that the
second thing matters.  Consider the following
examples:
  tokens("aaa", "x") => ["aaa"]
  tokens("aa", "x")  => ["aa"]
  tokens("a", "x")   => ["a"]
so by continuity we expect
  tokens("", "x")    => [""]
BUT the result is actually [].  True, the
description says that the result is a list
of non-empty strings, but I don't really see
why that is so important that our natural
expectation that tokens(S, [X]) => [X]
whenever S is *any* string not containing X
should be violated, and if it is, then I
would definitely expect an exception.

The fourth thing I notice is that the treatment
of multi-element separator lists is odd.  I have
had occasion to use separators with more than
one code-point, and for Unicode that could be
essential.  I have also had occasion to use
split at C1, then at C2, then at C3, then at C4, ...
I've also had occasion to split on one separator
and then split the pieces into smaller pieces,
so multiple levels of splitting.  (Think of
/etc/passwd for a simple example.)  But the only
time I ever want multiple *alternative* separators
is when asking for white-space separation, and
*that* is when I want non-empty pieces.  It is
also the only time I ever want separators coalesced.
Given a string like "||x|yy||w" and the separator
"|", I've always wanted ["","","x","yy","","w"]
as the answer.  But there's a particular point
here:  which of us knows off-hand just what all
the Zs, Zl, and Zp characters of Unicode actually
are?  It would make a *lot* of sense to have
   tokens(String) -> list of non-empty pieces
   tokens(String, Sep) -> list of possibly empty
     pieces separated by the non-empty substring Sep.

The fifth thing I notice is that there is no
specification of what happens if SeparatorList is
empty.

All things considered, this is a function I am never
going to use, because it is less work to write my own
than to try to figure out this documentation.  And I
had to look at the code to figure some of it out.

I get seriously confused by some of the code in
string.erl.  We find
%% Fetch first grapheme cluster ..
next_grapheme(CD) -> ..
Which is it?  Grapheme or grapheme cluster?  These
are *different* (but overlapping) things!  And
where is the locale argument so that the function
knows what a "user-perceived character" actually *is*?
How come an empty list counts as a grapheme_cluster()?

What if I have something like
"foo:bar::uggle::zoosh" and I want to split it at
"::" but NOT at ":"?  "::" is not a grapheme cluster,'
so it looks like neither of these functions will help
me.

Writing good documentation is HARD.  At dear departed
Quintus, we started with a full time technical writer
and expanded to three, nearly as many as developers.

The *name* 'lexemes' is arguably the *least* confusing
thing in the documentation.  If it were called z3k_u4y/2
that would increase my confusion very little.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20190508/9c806f05/attachment.htm>


More information about the erlang-questions mailing list