Author: Patrik Nyblom <pan(at)erlang(dot)org>,
            Fredrik Svahn <Fredrik(dot)Svahn(at)gmail>
    Status: Draft
    Type: Standards Track
    Created: 29-Sep-2010
    Erlang-Version: OTP_R14B
    Post-History:
    Replaces: 9
****
EEP 35: Binary string module(s)
----

Abstract
========

This EEP contains developed suggestions regarding the module ``binary_string``
first suggested in [EEP 9][]. The module name is now however changed to ``bstring``.

[EEP 9][] suggests several modules and is partially superseded by later
EEP's (i.e. [EEP 11][] and [EEP 31][]), while still containing valuable suggestions not
yet implemented. This last remaining module suggested in [EEP 9][] will therefore
appear in this separate EEP. This is made in agreement with
the original author of [EEP 9][].

The module ``bstring`` is suggested to contain functions for
convenient manipulation of textual data stored in binaries,
i.e. binary strings. It somewhat resembles the ``string`` module
(which is list oriented), but is not to be viewed simply as a
``string`` module for binaries.

The module suggested handles binary character encoding in both the
standard character encodings of Erlang, namely ISO-Latin-1 and UTF-8.

Motivation
==========

Text strings are traditionally represented as lists of integers in
Erlang. While this is convenient and more or less built into the
syntax of the language (i.e. "ABC" is syntactic sugar for [$A,$B,$C]),
a more compact representation is often desired. Also, in some
circumstances binaries can be more efficient to manipulate in terms of
algorithm complexity than lists are (especially in the fixed character
width case of ISO-Latin-1).

More modules have been added to the standard libraries lately to aid
the usage of binaries for text strings, both as representing
ISO-Latin-1 characters and Unicode strings encoded in UTF-8. Most
notably the ``re`` library, but also the ``unicode`` module are fairly
new additions to ``stdlib`` which will make life easier for the
programmer when it comes to manipulating binary encoded strings. Also
a module for fast searching and replacing in byte oriented binaries is
present (the module ``binary``), but no traditional string manipulation module is
yet in the libraries. To ease use of binary encoded strings, such a module is
needed.

Rationale
=========

The module ``string`` for text oriented operations on lists has been
present in the standard libraries for so long that most programmers
don't remember a time when it wasn't there. It is said to originally
be a merge of two different string modules, written and designed by
two different programmers with possibly slightly different goals and
definitely slightly different views on function naming. While
sometimes criticized for duplicated functionality and inconsistent
function naming, among other things, the module has remained useful
throughout the entire lifespan of Erlang/OTP. The string
representation used has also withstood the evolution of Unicode.

It is worth to note that the only functions in the ``string`` module
that actually are language or region dependent are later additions to
the module. Those functions (like ``to_upper``, ``to_lower``, ``to_integer`` and
``to_float``), or their binary equivalence, are not part of the module
interface I suggest for ``bstring`` for the simple reason that they
need language support not yet present in Erlang. A future EEP might
suggest such language support (i.e. some kind of "locale" support), but
that is future work not covered by this EEP.

So, however criticized, the string module is very useful for
manipulating lists, and the same functionality for binary strings is
desirable. While a lot of the functionality will be similar, there are
some major issues to consider when implementing a module for
manipulating strings encoded in binaries:

- Unicode - Binaries can have different encodings. A Character encoded
  as UTF-8 might take more than one (up to four) byte positions, and
  even the same character can have different encodings in ISO-Latin-1
  and UTF-8 (all codepoints from 128 to 255). The functions need to be
  informed of the character encoding explicitly, The encoding
  information is not present in the binaries.

- Mixed character encodings - As characters can be encoded in
  different ways, two strings in the same program could have different
  encodings. Supplying the functions with non-homogeneous string
  encoding data should be consistently solved throughout the module,
  as should the selection of returned encoding where applicable.

- Default character encoding - As functions will take extra arguments
  to specify encoding, a consistent default might be useful. Choosing
  the default is not entirely simple, as the tradition states
  ISO-Latin-1, while the future suggests UTF-8.

- Languages - Erlang has no notion of "Locale" or preferred number
  format. A general string module can not assume neither a specific
  notion of uppercase or lowercase letters, nor a specific number
  encoding format (especially true for floating point numbers).

- Word separators - The space character is certainly not the only word
  separator for textual data (in any language). The notion of words
  separated by spaces imposes a restriction of the relevant languages.

- Left to right or right to left - Notions like left or right to
  denote the beginning or end of a string are certainly not language
  independent. While strings in a language have a beginning and an end,
  that beginning and end may be placed both to the left, the right or
  even at the top, bottom or center of the graphical representation. A
  string manipulation module should not use naming implying a
  left-to-right script, or any other type of script.

- Naming and duplicated functionality - The original ``string`` module
  has been accused of having somewhat inconsistent naming and
  functionality duplicated. In fact the only duplicated functions are
  ``substr`` and ``sub_string``. Some cleanup of the interface might
  be needed.

- Byte oriented versus character oriented return values - When dealing
  with Unicode data, a character may take more than one byte, why
  i.e. counting the number of characters in a string tells you very
  little about the actual size of the string in bytes. Furthermore,
  later processing of a binary might require byte-oriented
  manipulation of a string rather than character oriented (i.e. you
  want to manipulate the string using the ``binary`` module or with
  bit-syntax), while characters are actually what constitutes a
  string, not bytes. You would want both.

- New or replaced functionality - New functionality have been suggested from several sources,
  most notably [EEP 9][]. For example the function ``split`` suggested in [EEP 9][] is very similar to
  <c>string:tokens/2</c>. Should we keep ``tokens`` anyway, for example?

I'll address the different issues below.

Unicode
-------

The interface has to support both ISO-Latin-1 and UTF-8. The ``unicode`` module supports even more encodings, but Erlang/OTP uses UTF-8 for all "internal" interfaces and UTF-8 is the expected encoding of a binary Unicode string. Even though UTF-8 is compatible with ISO-Latin-1 in the 7bit ASCII range, characters with codepoints between 128 and 255 are encoded differently in the "plain" ISO-Latin-1 encoding and in UTF-8. This means that all functions in the ``bstring`` module need to have the actual encoding as one or more extra parameters.

One could invent a more abstract binary string format where the data is for example represented as a tuple with the string and the encoding packed together. However no other module supports such a string construct and I don't think that would really add something, neither functionality nor readability. Consider code like:

    bstring:tokens(Bin,latin1,[$ ,$\n])

compared to:

    bstring:tokens({Bin,latin1}, [$ ,$\n]).

or even:

    bstring:tokens(#bstring{data = Bin, encoding = latin1}, [$ ,$\n]).

In many cases the extra information needs to be added in connection to the call, making the code no more readable or simple to write than with the separate extra argument. Consider if we had a default value for encoding. The code:

    f(Data) ->
        bstring:tokens(Data,[$ ,$\n]).

would not in any way indicate if ``Data`` was supposed to be a binary with the default encoding or some kind of complex data structure indicating both the actual string and it's encoding.

I think the extra argument for the encoding is straight forward and simple, and it makes programming easier when using the binary string in other modules as well (i.e. ``re``, ``binary``, ``file`` etc). I think we should simply not have a special string datatype for this module, character encoding should be supplied as a separate argument.

Mixed character encodings
-------------------------

To ease transition between character encodings, I think the interface should accept different encodings for both different parameters and the return value. This makes it possible to convert on the fly and for the functions to decide on the most efficient character conversion path for the supplied arguments and the return value.

The downside of this approach is that some functions will take a lot of parameters telling different character encodings, for example a string concatenation routine could look like:

    concat(BString1, Encoding1, BString2, Encoding2, Encoding3) -> BString3

being called like:

    US = bstring:concat(SA,latin1, SB, latin1, unicode),

which might look a little awkward to write. On the other hand, conversion is made on the fly and you will not need to explicitly call the ``unicode`` module to convert the result.

I think implicit conversion is so useful that it is worth the extra arguments. For example a ``concat`` function would be more or less useless without it, the bit syntax would be much easier to use if no conversion should be allowed.

Default character encoding
--------------------------

Choosing a default character encoding is not obvious. While ISO-Latin-1 is the default in Erlang (i.e. <<"korvsmörgås">> gives a ISO-Latin-1 encoded binary string), UTF-8 usage is expected to grow in the future.

Although its tempting to select UTF-8 as the default encoding, I think we should stick to ISO-Latin-1 as the default even for this module. There are several reasons:

- We need not, as a rule, impose new standards in every module we add
  to the standard library. Consistence certainly adds value, and both
  the bit-syntax, the source code encoding and things like the
  io:format routine has ISO-Latin-1 as default. Lets not make this
  module inconsistent with the others.

- The ``string`` module is often used to manipulate arbitrary lists
  of integers, not always actually representing textual data. In the
  same way can ``bstring`` probably be used to manipulate arbitrary
  blobs of bytes if ISO-latin-1 versions are used. ISO-Latin-1 is
  actually the raw bytes uninterpreted, why any binary data can be
  worked on in a ISO-Latin-1 oriented routine. Using UTF-8 encoding as
  default would narrow the use for the default functions to only work
  on real text data.

- The pure ISO-Latin-1 implementations of the functions will be the
  most efficient ones as no data checking at all is needed. Any byte
  value is acceptable in any version. Some functions are usable on
  UTF-8 strings even though they expect ISO-Latin-1 data. The
  difference between the ISO-Latin-1 version and the UTF-8 version
  being only indata control. If the data given to, for example
  ``bstring:concat`` is already checked for correct UTF-8, the simpler
  ISO-Latin-1 version of the function is both more efficient and
  guaranteed to give as correct output as the input:

        CorrectUtf8_1 = give_me_good_string(),
        CorrectUtf8_2 = give_me_another_good_string(),
        CorrectUtf8_3 = bstring:concat(CorrectUtf8_1, latin1, CorrectUtf8_2, latin1, latin1),
        ...

  Simply put, ISO-Latin-1 versions of the functions are more generally
  useful than pure UTF-8 versions and are also more efficient.

- A wrapper module providing pure UTF-8 interfaces can easily be
  written. The overhead of going via a wrapper would be relatively
  lower for an UTF-8 wrapper than for an ISO-Latin-1 ditto, as the
  overhead of character decoding/encoding of UTF-8 strings in the
  module would be quite high. Simply put, a wrapper would cost very
  little compared to the cost of checking the data for UTF-8
  correctness.

  I actually suggest a module ``ubstring`` that has the part of the
  ``bstring`` interface where a default encoding is implied, but with
  the difference that UTF-8 is expected. For example, a function
  ``ubstring:tokens/2`` would look like this:

        tokens(S,L) -> bstring:tokens(S,unicode,L).

  Quite simple.

To conclude, I think all functions should exist in a version where no
encoding is supplied and ISO-Latin-1 encoded data is expected.

Languages
---------

Even though Unicode characters can be used to express text in most
known, living and dead scripts, language and region knowledge is a
completely different thing. String interfaces often impose language
specific properties of the string, like left-to-right writing
direction, the notion of words built up by space separated groups of
characters, ways of representing numbers and decimal points etc. As
Erlang does not (yet) have a way of specifying such language-, or
region-specific properties of a string, the interface should not
contain language-dependent functionality. The ``string`` module did not
originally contain such functions (except that character alignment
functions were named ``left`` and ``right``), but unfortunately
functions like ``to_float`` and ``to_upper`` have been added.

I think that having language-dependent functions in the ``string``
module was a mistake and I do not want to make that mistake
again. Hence I have not included such functions or names in
``bstring``.

I rather suggest "Locale" functionality as a subject of a future
EEP. For those who consider that simple, try to write a correct
``to_upper`` function for just all European languages, make sure it
works on all platforms that can run Erlang... Maybe not rocket science, but a
_lot_ of metadata is required. Data that is not always available in
the underlying OS, but probably needs to be distributed with Erlang/OTP for
consistent functionality. Definitely worth it's own EEP.

Word separators
---------------

In connection with language independence, I think we should drop the
notion of _words_ as a group of characters separated by space. The word
"token" is more general and does not in the same way indicate language
constructs. The ``string`` module has the ASCII space character as a
default for word separation, which I think should be dropped in
``bstring``. Whatever should separate tokens should be supplied,
possibly as alternatives. I therefore suggest the functions
``bstring:num_tokens`` and ``bstring:nth_token`` to fulfill the
functionality of ``string:words`` and ``string:sub_word``.

As in [EEP 9][], I suggest a new function ``split`` to handle the case
of multi-character separators for tokens. A compilation of ``split``
and ``join`` makes a convenient ``replace`` function too.

Left-to-right or Right-to-left
------------------------------

As mentioned earlier, I don't think direction of the graphical
representation should be implied in the interface, why I suggest using
notions like leading and trailing (meaning leading and trailing
characters in the binary) rather than any directional notions. I also
think aligning strings (like in ``strings:right`` etc.) could be solved
in one function ``align``, taking one of the atoms ``leading``,
``trailing`` or ``center`` as a parameter, if it should at all be
implemented.

Naming and duplicated functionality
-----------------------------------

I definitely do not think we should have all interfaces from
``string`` duplicated to ``bstring``. Especially interfaces that are
aliases should not be carried along to the ``bstring`` module. Most
functions in the ``string`` module however have short and fairly
describing names, often similar to names found in other languages. I
think using a ``r`` prefix for functionality working from the end of
the string towards the beginning is a good choice, as is ``c`` for
complement.

Byte oriented versus character oriented return values
-----------------------------------------------------

Some functions in ``string``, that are certainly useful, return numbers
denoting character positions. The same functions should definitely be
present in the ``bstring`` module and the return values should
definitely be character oriented. However byte offsets are definitely
useful, for example if we use a function like ``span`` to find the
first character not in a set of characters, we might want the byte
offset of that first character too.

I suggest adding some interfaces returning byte offsets, or _part()'s_
like the ones used in the ``binary`` module and by ``re``, to cope
with the need for byte offsets and lengths in some circumstances. A
``b`` suffix to the function name could denote such functionality, so
that ``bstring:span`` returns a character position while
``bstring:spanb`` returns a byte position and ``btring:str`` returns a
character position and ``bstring:strb`` returns a _part()_. Although
this will in the end give rise to more functions in the interface,
having return-type-changing options in an option list is not the way
to go (I know, I have them in ``re``, but it's still not generally a
good idea...).

New or replaced functionality
-----------------------------

When writing a general string module, there is no end to the new, more
or less esoteric, functionality one could add. I think we, at least
in an initial implementation, should stick to the functionality
outlined in [EEP 9][], namely extending ``str`` and friends to
optionally take a list of alternative strings to search for, add a
function ``split`` to take care of multi-character separators (as
opposed to single character separators in the function ``tokens``) and
a substitution function, which I think should be named ``replace`` as
in other modules.

The use of pre-compiled matches from the ``binary`` module is however
not a good idea, as the ``binary`` module has no notion of character
encoding. Search strings need to be given in defined character
encodings and both the "haystacks" and the "needles" encoding need to
be known when doing an efficient search. So - no pre-compiled search
expressions.

Excerpt of a suggested manual page
----------------------------------

As made obvious above, I prefer the name ``bstring`` for a binary
string module in favor of the more verbose name ``binary_string``
originally suggested. In that module ``bstring``, I suggest the
following interfaces, expressed as in a manual page of OTP.

DATA TYPES
----------

    encoding() = latin1 | unicode | utf8
        - The encoding of characters in the binary data, both input and output
    bstring()
        - Binary with characters encoded either in ISO-Latin-1 or UTF-8
    unicode_char() = non_negative_integer()
        - An integer representing a valid unicode codepoint
    non_negative_integer()
        - An integer >= 0

EXPORTS
-------

### ``align(BString, Alignment, Number, Char) -> Result``

### ``align(BString, Encoding, Alignment, Number, Char) -> Result``

Types:

    BString = Result = bstring()
    Encoding = encoding()
    Alignment = leading | trailing | center
    Number = non_negative_integer()
    Char = unicode_char()

Aligns the characters in ``BString`` in a ``Result`` of ``Number`` characters according to the ``Alignment`` parameter. Alignment is done by inserting the character ``Char`` in the beginning or end (or both) of the binary string.

The resulting binary string will contain exactly ``Number`` characters, the string is truncated if it contains more characters than ``Number`` - either at the end if ``Alignment`` is ``leading``, or at the beginning if ``Alignment`` is ``trailing``, or at both ends if ``Alignment`` is ``center`` . If ``Encoding`` is ``unicode``, the ``Result`` may well contain more bytes than ``Number``, as one character may require several bytes.

Example:

    > bstring:align(<<"Hello">>, latin1, center, 10, $.).
    <<"..Hello...">>

If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string.

Raises a ``badarg`` exception if ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, ``Encoding`` or ``Alignment`` has an invalid value, the character ``Char`` cannot be encoded in the character encoding given as ``Encoding`` or any of the parameters are of the wrong type.

### ``chr(BString, Character) -> Position``

### ``chr(BString, Encoding, Character) -> Position``

### ``rchr(BString, Character) -> Position``

### ``rchr(BString, Encoding, Character) -> Position``

Types:

    BString = bstring()
    Encoding = encoding()
    Character = unicode_char()
    Position = integer()

Returns the (zero-based) character position of the first/last occurrence of ``Character`` in ``BString`` . ``-1`` is returned if ``Character`` does not occur.

Note that the character position is not the same as the byte position. Use the ``chrb`` and ``rchrb`` functions to get the byte positions.

If ``Character`` cannot be represented in the encoding, it is not an error, you are just certain to get ``-1`` as a return value.

If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string.

Raises a ``badarg`` exception if the searched part of ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, the ``Encoding`` has an invalid value, or any of the parameters are of the wrong type.

### ``chrb(BString, Character) -> {BytePosition, ByteLength}``

### ``chrb(BString, Encoding, Character) -> {BytePosition, ByteLength}``

### ``rchrb(BString, Character) -> {BytePosition, ByteLength}``

### ``rchrb(BString, Encoding, Character) -> {BytePosition, ByteLength}``

Types:

    BString = bstring()
    Encoding = encoding()
    Character = unicode_char()
    BytePosition = integer()
    ByteLength = non_negative_integer()

Works as ``chr`` and ``rchr`` respectively, but returns the byte position and byte length of the character.

If the character is not found, ``{-1,0}`` is returned.

### ``concat(BString1, BString2) -> BString3``

### ``concat(BString1, Encoding1, BString2, Encoding2, Encoding3) -> BString3``

Types:

    BString1 = BString2 = BString3 = bstring()
    Encoding1 = Encoding2 = Encoding3 = encoding()

Concatenates two binary strings to form a new string. Returns the new binary string in the encoding given by Encoding3.

If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string.

Raises a ``badarg`` exception if ``BString1`` or ``Bstring2`` does not contain characters encoded according to the ``Encoding1`` and ``Encoding2`` parameters, the encoding parameters has an invalid value, the codepoints in the in-parameters cannot be represented in the output encoding or any of the parameters are of the wrong type.

### ``equal(BString1, BString2) -> bool()``

### ``equal(BString1, Encoding1, BString2, Encoding2) -> bool()``

Types:

    BString1 = BString2 = bstring()
    Encoding1 = Encoding2 = encoding()

Tests whether two binary strings are equal. Returns ``true`` if they are, otherwise ``false`` .

``Encoding1`` is the encoding of ``BString1`` and ``Encoding2`` is the encoding of ``BString2`` .

Note that the strings can have different encoding and that it is the character values encoded in the strings that are compared. The binary strings are scanned as long as they are equal, meaning that if the function returns ``true``, both strings are correctly encoded, while a return value of ``false`` does not guarantee correct encoding in both binary strings. An exception is raised if faulty encoding is determined while comparing the strings, not if parts of the string not inspected contain encoding errors.

If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string.

Raises a ``badarg`` exception if wrongly encoded characters, according to the encoding parameters, are encountered during comparison, the encoding parameters has an invalid value or any of the parameters are of the wrong type.

### ``join(BStringList, Separator) -> Result``

### ``join(BStringList, BStringListEncoding, Separator, SeparatorEncoding, ResultEncoding) -> Result``

Types:

    BStringList = [bstring()]
    BStringListEncoding = SeparatorEncoding = ResultEncoding = encoding()
    Separator = bstring()
    Result = bstring()

Returns a binary string with the elements of ``BStringList`` separated by the binary string in ``Seperator`` .

All the binary strings in ``BStringList`` should have the same encoding (given as ``BStringListEncoding`` . The ``Separator`` can however have a different encoding (given as ``SeparatorEncoding`` ), as can the ``Result`` (given as ``ResultEncoding`` ).

Example:

    > bstring:join([<<"one">>, <<"two">>, <<"three">>], latin1, <<", ">>, latin1, latin1).
    <<"one, two, three">>

If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string.

Raises a ``badarg`` exception if binary strings in ``BStringList`` or the ``Separator`` do not contain characters encoded according to the ``BStringListEncoding`` and ``SeparatorEncoding`` parameters respectively, the encoding parameters has an invalid value, the codepoints in the in-parameters cannot be represented in the output encoding ``ResultEncoding`` or any of the parameters are of the wrong type.

### ``len(BString) -> Length``

### ``len(BString, Encoding) -> Length``

Types:

    BString = bstring()
    Encoding = encoding()
    Length = non_negative_integer()

Returns the number of characters in the binary string.

If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string.

Raises a ``badarg`` exception if ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, the ``Encoding`` has an invalid value or any of the parameters are of the wrong type.

### ``nth_token(BString, N, CharList) -> Result``

### ``nth_token(BString, Encoding, N, CharList) -> Result``

Types:

    BString = Result = bstring()
    Encoding = encoding()
    CharList = [ unicode_char() ]
    N = non_negative_integer()

Returns the token number ``N`` of ``BString`` (zero-based). Tokens are separated by the characters in ``CharList`` .

The returned token will have the same encoding as ``BString`` .

For example:

    > bstring:nth_token(<<" Hello old boy !">>,latin1,3,[$o, $ ]).
    <<"ld b">>

``CharList`` is to be viewed as a _set_ of characters, order is not significant. Codepoints given in ``CharList`` that cannot be represented by the ``Encoding``, is not an error.

Values of ``N`` >= number of tokens in ``BString`` will result in the empty binary string ``<<>>`` being returned.

If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string.

Raises a ``badarg`` exception if ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, the ``Encoding`` has an invalid value, or any of the parameters are of the wrong type.

### ``num_tokens(BString, CharList) -> Count``

### ``num_tokens(BString, Encoding, CharList) -> Count``

Types:

    BString = bstring()
    Encoding = encoding()
    CharList = [ unicode_char() ]
    Count = non_negative_integer()

Returns the number of tokens in ``String``, separated by the characters in ``CharList`` .

The result is the same as for length(bstring:tokens(BString,Encoding,CharList)), but avoids building the result.

For example:

    > num_tokens(<<" Hello old boy!">>, latin1, [$o, $ ]).
    4

``CharList`` is to be viewed as a _set_ of characters, order is not significant. Codepoints given in ``CharList`` that cannot be represented by the ``Encoding``, is not an error.

If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string.

Raises a ``badarg`` exception if ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, the ``Encoding`` has an invalid value, or any of the parameters are of the wrong type.

### ``span(BString, Chars) -> Length``

### ``span(BString, Encoding, Chars) -> Length``

### ``rspan(BString, Chars) -> Length``

### ``rspan(BString, Encoding, Chars) -> Length``

### ``cspan(BString, Chars) -> Length``

### ``cspan(BString, Encoding, Chars) -> Length``

### ``rcspan(BString, Chars) -> Length``

### ``rcspan(BString, Encoding, Chars) -> Length``

Types:

    BString = bstring()
    Encoding = encoding()
    Chars = [ integer() ]
    Length = non_negative_integer()

Returns the length (in characters) of the maximum initial (span and cspan) or trailing (rspan and rcspan) segment of BString, which consists entirely of characters from (span and rspan), or not from (cspan and rcspan) Chars.

``Chars`` is to be viewed as a _set_ of characters, order is not significant. Codepoints given in ``Char`` that cannot be represented by the ``Encoding``, is not an error.

For example:

    > bstring:span(<<"\t    abcdef">>,latin1," \t").
    5
    > bstring:cspan((<<"\t    abcdef">>,latin1, " \t").
    0

Codepoints in ``Chars`` that can not be represented by ``Encoding`` is not considered an error.

If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string.

Raises a ``badarg`` exception if the searched part of ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, the ``Encoding`` has an invalid value, or any of the parameters are of the wrong type.

### ``spanb(BString, Chars) -> ByteLength``

### ``spanb(BString, Encoding, Chars) -> ByteLength``

### ``rspanb(BString, Chars) -> ByteLength``

### ``rspanb(BString, Encoding, Chars) -> ByteLength``

### ``cspanb(BString, Chars) -> ByteLength``

### ``cspanb(BString, Encoding, Chars) -> ByteLength``

### ``rcspanb(BString, Chars) -> ByteLength``

### ``rcspanb(BString, Encoding, Chars) -> ByteLength``

Types:

    BString = bstring()
    Encoding = encoding()
    Chars = [ integer() ]
    ByteLength = non_negative_integer()

Work exactly as the functions ``span``, ``rspan``, ``cspan`` and ``rcspan`` respectively, but returns the number of bytes rather than the number of characters.

### ``split(BString, Separators, Where) -> Tokens``

### ``split(BString, Encoding, Separators, SepEncoding, Where, ReturnEncoding) -> Tokens``

Types:

    String = bstring()
    Encoding = SepEncoding = ReturnEncoding = encoding()
    Separators = [ bstring() ]
    Where = first | last | all
    Tokens = [bstring()]

Returns a list of tokens in ``BString``, separated by the binary strings in ``Separators`` .

The ``Tokens`` returned are encoded according to ``ReturnEncoding`` .

Example:

    > bstring:split(<<"abc defxxghix jkl">>, latin1, [<<"x">>,<<" ">>],all,latin1).
    [<<"abc">>, <<"def">>, <<"ghi">>, <<"jkl">>]

``Separators`` is to be viewed as a _set_ of binary strings, order is not significant. Codepoints given in ``Separators`` that cannot be represented by the ``Encoding``, is not an error.

The ``Where`` parameter specifies at which occurrence of any of the ``Separators`` the binary string is to be split, either at the ``first`` occurrence, the ``last`` occurrence or at ``all`` occurrences, in which case the ``Tokens`` may be an arbitrary long list.

If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string.

Raises a ``badarg`` exception if ``BString`` or ``Separators`` does not contain characters encoded according to the ``Encoding`` and ``SepEncoding`` parameters respectively, the resulting tokens cannot be encoded in the ``ReturnEncoding``, the ``Encoding`` has an invalid value, or any of the parameters are of the wrong type.

### ``str(BString, SubBStrings) -> Position``

### ``str(BString, Encoding, SubBStrings, SubEnc) -> Position``

### ``rstr(BString, SubBStrings) -> Position``

### ``rstr(BString, Encoding, SubBStrings, SubEnc) -> Position``

Types:

    BString = bstring()
    SubBString = bstring() | [ bstring() ]
    Encoding = SubEnc = encoding()
    Position = integer()

Returns the (zero-based) character position where the first/last occurrence of any of the ``SubBStrings`` begins in ``BString`` . ``-1`` is returned if ``SubBString`` does not exist in ``BString`` .

Note that the ``Character`` position is not the same as the byte position. Use the ``strb`` and ``rstrb`` functions to get the byte positions.

The encoding need not be the same for ``BString`` and ``SubBStrings``, however all strings in SubBStrings need to have the same encoding.

If the codepoints in SubBString can not be represented in the encoding of BString, that is not an error, but will always result in the return value -1.

Example:

    > bstring:str(<<" Hello Hello World World ">>,latin1,<<"Hello World">>,latin1).
    7

Note that if both encodings are the same and repeated searches with the same ``SubBStrings`` are to be performed, it is more efficient to use the ``binary:match/{2,3}`` functions with a precompiled pattern on the raw binary data.

If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string.

Raises a ``badarg`` exception if the searched part of ``BString`` or ``SubBString`` does not contain characters encoded according to the ``Encoding`` and ``SubEnc`` parameters, the ``Encoding`` has an invalid value, or any of the parameters are of the wrong type.

### ``strb(BString, SubBStrings) -> {BytePosition, ByteLength}``

### ``strb(BString, Encoding, SubBStrings, SubEnc) -> {BytePosition, ByteLength}``

### ``rstrb(BString, SubBStrings) -> {BytePosition, ByteLength}``

### ``rstrb(BString, Encoding, SubBStrings, SubEnc) -> {BytePosition, ByteLength}``

Types:

    BString = bstring()
    SubBString = bstring() | [ bstring() ]
    Encoding = SubEnc = encoding()
    BytePosition = integer()
    ByteLength = non_negative_integer()

Works as ``str`` and ``rstr`` respectively, but returns the byte position and byte length of the found substring.

Note that ``ByteLength`` is the length the found substring has in ``BString``, regardless of the encoding in ``SubBStrings``, so that ``ByteLength`` may be both larger and smaller than ``byte_size(SubBString)`` depending on the binary string's encoding.

If the substring is not found, ``{-1,0}`` is returned.

### ``strip(BString, Which, CharList) -> Result``

### ``strip(BString, Encoding, Which, CharList) -> Result``

Types:

    BString = Result = bstring()
    Encoding = encoding()
    Which = leading | trailing | both
    CharList = [ unicode_char() ]

Removes leading (``Which`` = ``leading``), trailing (``Which`` = ``trailing``) or both leading and trailing (``Which`` = ``both``) characters belonging to the set indicated by ``CharList`` from the binary string ``BString`` .

This is essentially the same as using ``spanb`` and/or ``rspanb`` in combination with bit syntax to remove the characters.

Example:

    > bstring:strip(<<"...He.llo.....">>, latin1, both, [$.]).
    <<"He.llo">>

If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string.

Raises a ``badarg`` exception if scanned part of ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, ``Encoding`` or ``Which`` has an invalid value, or any of the parameters are of the wrong type.

### ``replace(BString, Separators, Replacement, Where) -> Result``

### ``replace(BString, Encoding, Separators, SeparatorsEncoding, Replacement, ReplacementEncoding, Where, ResultEncoding) -> Result``

Types:

    BString = bstring()
    Encoding = SeparatorsEncoding = ReplacementEncoding, ResultEncoding = encoding()
    Separators = [ bstring() ]
    Replacement = bstring()
    Where = first | last | all
    Result = bstring()

Produces the same result as

    bstring:join(bstring:split(BString,Encoding,Separators,SeparatorsEncoding,Where,
                               unicode),
                 unicode,Replacement,ReplacementEncoding,ResultEncoding)

but with less overhead.

### ``substr(BString, Start, Length) -> SubBString``

### ``substr(BString, Encoding, Start, Length) -> SubBString``

Types:

    BString = SubBString = bstring()
    Encoding = bstring()
    Start = integer()
    Length = non_negative_integer() | infinity

Returns a substring of ``String``, starting at the zero-based character position ``Start``, and ending at the end of the binary string (if ``Length`` is ``infinity`` or up to, but not including, the character position ``Start+Length`` (if ``Length`` is a non negative integer).

The returned ``SubBString`` will have the same encoding as ``BString`` .

Example:

    > bstring:substr(<<"Hello World">>, latin1, 3, 5).
    <<"lo Wo">>

A negative value of ``Start`` denotes ``abs(Start)`` characters from the _end_ of ``BString``, so that ``-1`` is the last character position in the binary string.

Example:

    > bstring:substr(<<"Hello World">>, latin1, -1, 3).
    <<"rld">>

As the true length of an UTF-8 encoded binary string is quite costly to determine ( ``O(N)``, where ``N`` is the number of bytes in the binary), the function is very forgiving about positions given outside of the string, both ``Start`` s and ``Length`` s. Character positions outside of the string in either direction are collapsed to the empty binary string.

Examples:

    > bstring:substr(<<"01234">>, latin1, 5, 5).
    <<>>
    > bstring:substr(<<"01234">>, latin1, 4, 5).
    <<"4">>
    > bstring:substr(<<"01234">>, latin1, -5, 100).
    <<"01234">>
    > bstring:substr(<<"01234">>, latin1, -6, 1).
    <<>>
    > bstring:substr(<<"01234">>, latin1, -6, 2).
    <<"0">>

If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string.

If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string.

Raises a ``badarg`` exception if the searched part of ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, the ``Encoding`` has an invalid value, or any of the parameters are of the wrong type.

### ``tokens(BString, SeparatorList) -> Tokens``

### ``tokens(BString, Encoding, SeparatorList) -> Tokens``

Types:

    String = bstring()
    Encoding = encoding
    SeparatorList = [ non_negative_integer() ]
    Tokens = [bstring()]

Returns a list of tokens in ``BString``, separated by the characters in ``SeparatorList`` .

The ``Tokens`` returned are encoded in the same character encoding as the ``BString`` .

Example:

    > bstring:tokens(<<"abc defxxghix jkl">>, latin1, [$x,$ ]).
    [<<"abc">>, <<"def">>, <<"ghi">>, <<"jkl">>]

``SeparatorList`` is to be viewed as a _set_ of characters, order is not significant. Codepoints given in ``SeparatorList`` that cannot be represented by the ``Encoding``, is not an error.

If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string.

Raises a ``badarg`` exception if the searched part of ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, the ``Encoding`` has an invalid value, or any of the parameters are of the wrong type.

Performance
===========

This module can, and probably should, be implemented entirely in
Erlang, no BIF's or NIF's are needed.  Both the ``binary`` and
``unicode`` modules can be utilized to speed up conversion and indata
checking. The Unicode versions will definitely be slower than the
ISO-Latin-1 versions, as character encoding, decoding and checking is
bound to produce overhead.

The suggested wrapper ``ubstring`` should not impose any significant
cost compared to calling ``bstring`` with all encoding arguments set
to ``unicode``.

The idea is to make string manipulation using binaries convenient as
it has a great positive impact on systems memory-wise. Increased speed
compared to list-oriented strings is not the goal, although it may
well be a side-effect.

Reference implementation
========================

No specific reference implementation is made, the code will however be made available
on GitHub during any development.

Copyright
=========

This document is licensed under the [Creative Commons license][CCA3.0].

[EEP 9]: eep-0009.md
    "EEP 9, the original work from which this EEP is derived"

[EEP 11]: eep-0011.md
    "EEP 11, interesting extensions to EEP 9"

[EEP 31]: eep-0031.md
    "EEP 31, rewrite of EEP 9, module binary"

[CCA3.0]: http://creativecommons.org/licenses/by/3.0/
    "Creative Commons Attribution 3.0 License"

[EmacsVar]: <> "Local Variables:"
[EmacsVar]: <> "mode: indented-text"
[EmacsVar]: <> "indent-tabs-mode: nil"
[EmacsVar]: <> "sentence-end-double-space: t"
[EmacsVar]: <> "fill-column: 70"
[EmacsVar]: <> "coding: utf-8"
[EmacsVar]: <> "End:"