Author: Patrik Nyblom , Fredrik Svahn Status: Draft Type: Standards Track Created: 29-Sep-2010 Erlang-Version: R14B Post-History: Replaces: 9 **** EEP 35: Binary string module(s) ---- Abstract ======== This EEP contains developed suggestions regarding the module ``binary_string`` first suggested in [EEP 9][]. The module name is now however changed to ``bstring``. [EEP 9][] suggests several modules and is partially superseded by later EEP's (i.e. [EEP 11][] and [EEP 31][]), while still containing valuable suggestions not yet implemented. This last remaining module suggested in [EEP 9][] will therefore appear in this separate EEP. This is made in agreement with the original author of [EEP 9][]. The module ``bstring`` is suggested to contain functions for convenient manipulation of textual data stored in binaries, i.e. binary strings. It somewhat resembles the ``string`` module (which is list oriented), but is not to be viewed simply as a ``string`` module for binaries. The module suggested handles binary character encoding in both the standard character encodings of Erlang, namely ISO-Latin-1 and UTF-8. Motivation ========== Text strings are traditionally represented as lists of integers in Erlang. While this is convenient and more or less built into the syntax of the language (i.e. "ABC" is syntactic sugar for [$A,$B,$C]), a more compact representation is often desired. Also, in some circumstances binaries can be more efficient to manipulate in terms of algorithm complexity than lists are (especially in the fixed character width case of ISO-Latin-1). More modules have been added to the standard libraries lately to aid the usage of binaries for text strings, both as representing ISO-Latin-1 characters and Unicode strings encoded in UTF-8. Most notably the ``re`` library, but also the ``unicode`` module are fairly new additions to ``stdlib`` which will make life easier for the programmer when it comes to manipulating binary encoded strings. Also a module for fast searching and replacing in byte oriented binaries is present (the module ``binary``), but no traditional string manipulation module is yet in the libraries. To ease use of binary encoded strings, such a module is needed. Rationale ========= The module ``string`` for text oriented operations on lists has been present in the standard libraries for so long that most programmers don't remember a time when it wasn't there. It is said to originally be a merge of two different string modules, written and designed by two different programmers with possibly slightly different goals and definitely slightly different views on function naming. While sometimes criticized for duplicated functionality and inconsistent function naming, among other things, the module has remained useful throughout the entire lifespan of Erlang/OTP. The string representation used has also withstood the evolution of Unicode. It is worth to note that the only functions in the ``string`` module that actually are language or region dependent are later additions to the module. Those functions (like ``to_upper``, ``to_lower``, ``to_integer`` and ``to_float``), or their binary equivalence, are not part of the module interface I suggest for ``bstring`` for the simple reason that they need language support not yet present in Erlang. A future EEP might suggest such language support (i.e. some kind of "locale" support), but that is future work not covered by this EEP. So, however criticized, the string module is very useful for manipulating lists, and the same functionality for binary strings is desirable. While a lot of the functionality will be similar, there are some major issues to consider when implementing a module for manipulating strings encoded in binaries: - Unicode - Binaries can have different encodings. A Character encoded as UTF-8 might take more than one (up to four) byte positions, and even the same character can have different encodings in ISO-Latin-1 and UTF-8 (all codepoints from 128 to 255). The functions need to be informed of the character encoding explicitly, The encoding information is not present in the binaries. - Mixed character encodings - As characters can be encoded in different ways, two strings in the same program could have different encodings. Supplying the functions with non-homogeneous string encoding data should be consistently solved throughout the module, as should the selection of returned encoding where applicable. - Default character encoding - As functions will take extra arguments to specify encoding, a consistent default might be useful. Choosing the default is not entirely simple, as the tradition states ISO-Latin-1, while the future suggests UTF-8. - Languages - Erlang has no notion of "Locale" or preferred number format. A general string module can not assume neither a specific notion of uppercase or lowercase letters, nor a specific number encoding format (especially true for floating point numbers). - Word separators - The space character is certainly not the only word separator for textual data (in any language). The notion of words separated by spaces imposes a restriction of the relevant languages. - Left to right or right to left - Notions like left or right to denote the beginning or end of a string are certainly not language independent. While strings in a language have a beginning and an end, that beginning and end may be placed both to the left, the right or even at the top, bottom or center of the graphical representation. A string manipulation module should not use naming implying a left-to-right script, or any other type of script. - Naming and duplicated functionality - The original ``string`` module has been accused of having somewhat inconsistent naming and functionality duplicated. In fact the only duplicated functions are ``substr`` and ``sub_string``. Some cleanup of the interface might be needed. - Byte oriented versus character oriented return values - When dealing with Unicode data, a character may take more than one byte, why i.e. counting the number of characters in a string tells you very little about the actual size of the string in bytes. Furthermore, later processing of a binary might require byte-oriented manipulation of a string rather than character oriented (i.e. you want to manipulate the string using the ``binary`` module or with bit-syntax), while characters are actually what constitutes a string, not bytes. You would want both. - New or replaced functionality - New functionality have been suggested from several sources, most notably [EEP 9][]. For example the function ``split`` suggested in [EEP 9][] is very similar to string:tokens/2. Should we keep ``tokens`` anyway, for example? I'll address the different issues below. Unicode ------- The interface has to support both ISO-Latin-1 and UTF-8. The ``unicode`` module supports even more encodings, but Erlang/OTP uses UTF-8 for all "internal" interfaces and UTF-8 is the expected encoding of a binary Unicode string. Even though UTF-8 is compatible with ISO-Latin-1 in the 7bit ASCII range, characters with codepoints between 128 and 255 are encoded differently in the "plain" ISO-Latin-1 encoding and in UTF-8. This means that all functions in the ``bstring`` module need to have the actual encoding as one or more extra parameters. One could invent a more abstract binary string format where the data is for example represented as a tuple with the string and the encoding packed together. However no other module supports such a string construct and I don't think that would really add something, neither functionality nor readability. Consider code like: bstring:tokens(Bin,latin1,[$ ,$\n]) compared to: bstring:tokens({Bin,latin1}, [$ ,$\n]). or even: bstring:tokens(#bstring{data = Bin, encoding = latin1}, [$ ,$\n]). In many cases the extra information needs to be added in connection to the call, making the code no more readable or simple to write than with the separate extra argument. Consider if we had a default value for encoding. The code: f(Data) -> bstring:tokens(Data,[$ ,$\n]). would not in any way indicate if ``Data`` was supposed to be a binary with the default encoding or some kind of complex data structure indicating both the actual string and it's encoding. I think the extra argument for the encoding is straight forward and simple, and it makes programming easier when using the binary string in other modules as well (i.e. ``re``, ``binary``, ``file`` etc). I think we should simply not have a special string datatype for this module, character encoding should be supplied as a separate argument. Mixed character encodings ------------------------- To ease transition between character encodings, I think the interface should accept different encodings for both different parameters and the return value. This makes it possible to convert on the fly and for the functions to decide on the most efficient character conversion path for the supplied arguments and the return value. The downside of this approach is that some functions will take a lot of parameters telling different character encodings, for example a string concatenation routine could look like: concat(BString1, Encoding1, BString2, Encoding2, Encoding3) -> BString3 being called like: US = bstring:concat(SA,latin1, SB, latin1, unicode), which might look a little awkward to write. On the other hand, conversion is made on the fly and you will not need to explicitly call the ``unicode`` module to convert the result. I think implicit conversion is so useful that it is worth the extra arguments. For example a ``concat`` function would be more or less useless without it, the bit syntax would be much easier to use if no conversion should be allowed. Default character encoding -------------------------- Choosing a default character encoding is not obvious. While ISO-Latin-1 is the default in Erlang (i.e. <<"korvsmörgås">> gives a ISO-Latin-1 encoded binary string), UTF-8 usage is expected to grow in the future. Although its tempting to select UTF-8 as the default encoding, I think we should stick to ISO-Latin-1 as the default even for this module. There are several reasons: - We need not, as a rule, impose new standards in every module we add to the standard library. Consistence certainly adds value, and both the bit-syntax, the source code encoding and things like the io:format routine has ISO-Latin-1 as default. Lets not make this module inconsistent with the others. - The ``string`` module is often used to manipulate arbitrary lists of integers, not always actually representing textual data. In the same way can ``bstring`` probably be used to manipulate arbitrary blobs of bytes if ISO-latin-1 versions are used. ISO-Latin-1 is actually the raw bytes uninterpreted, why any binary data can be worked on in a ISO-Latin-1 oriented routine. Using UTF-8 encoding as default would narrow the use for the default functions to only work on real text data. - The pure ISO-Latin-1 implementations of the functions will be the most efficient ones as no data checking at all is needed. Any byte value is acceptable in any version. Some functions are usable on UTF-8 strings even though they expect ISO-Latin-1 data. The difference between the ISO-Latin-1 version and the UTF-8 version being only indata control. If the data given to, for example ``bstring:concat`` is already checked for correct UTF-8, the simpler ISO-Latin-1 version of the function is both more efficient and guaranteed to give as correct output as the input: CorrectUtf8_1 = give_me_good_string(), CorrectUtf8_2 = give_me_another_good_string(), CorrectUtf8_3 = bstring:concat(CorrectUtf8_1, latin1, CorrectUtf8_2, latin1, latin1), ... Simply put, ISO-Latin-1 versions of the functions are more generally useful than pure UTF-8 versions and are also more efficient. - A wrapper module providing pure UTF-8 interfaces can easily be written. The overhead of going via a wrapper would be relatively lower for an UTF-8 wrapper than for an ISO-Latin-1 ditto, as the overhead of character decoding/encoding of UTF-8 strings in the module would be quite high. Simply put, a wrapper would cost very little compared to the cost of checking the data for UTF-8 correctness. I actually suggest a module ``ubstring`` that has the part of the ``bstring`` interface where a default encoding is implied, but with the difference that UTF-8 is expected. For example, a function ``ubstring:tokens/2`` would look like this: tokens(S,L) -> bstring:tokens(S,unicode,L). Quite simple. To conclude, I think all functions should exist in a version where no encoding is supplied and ISO-Latin-1 encoded data is expected. Languages --------- Even though Unicode characters can be used to express text in most known, living and dead scripts, language and region knowledge is a completely different thing. String interfaces often impose language specific properties of the string, like left-to-right writing direction, the notion of words built up by space separated groups of characters, ways of representing numbers and decimal points etc. As Erlang does not (yet) have a way of specifying such language-, or region-specific properties of a string, the interface should not contain language-dependent functionality. The ``string`` module did not originally contain such functions (except that character alignment functions were named ``left`` and ``right``), but unfortunately functions like ``to_float`` and ``to_upper`` have been added. I think that having language-dependent functions in the ``string`` module was a mistake and I do not want to make that mistake again. Hence I have not included such functions or names in ``bstring``. I rather suggest "Locale" functionality as a subject of a future EEP. For those who consider that simple, try to write a correct ``to_upper`` function for just all European languages, make sure it works on all platforms that can run Erlang... Maybe not rocket science, but a _lot_ of metadata is required. Data that is not always available in the underlying OS, but probably needs to be distributed with Erlang/OTP for consistent functionality. Definitely worth it's own EEP. Word separators --------------- In connection with language independence, I think we should drop the notion of _words_ as a group of characters separated by space. The word "token" is more general and does not in the same way indicate language constructs. The ``string`` module has the ASCII space character as a default for word separation, which I think should be dropped in ``bstring``. Whatever should separate tokens should be supplied, possibly as alternatives. I therefore suggest the functions ``bstring:num_tokens`` and ``bstring:nth_token`` to fulfill the functionality of ``string:words`` and ``string:sub_word``. As in [EEP 9][], I suggest a new function ``split`` to handle the case of multi-character separators for tokens. A compilation of ``split`` and ``join`` makes a convenient ``replace`` function too. Left-to-right or Right-to-left ------------------------------ As mentioned earlier, I don't think direction of the graphical representation should be implied in the interface, why I suggest using notions like leading and trailing (meaning leading and trailing characters in the binary) rather than any directional notions. I also think aligning strings (like in ``strings:right`` etc.) could be solved in one function ``align``, taking one of the atoms ``leading``, ``trailing`` or ``center`` as a parameter, if it should at all be implemented. Naming and duplicated functionality ----------------------------------- I definitely do not think we should have all interfaces from ``string`` duplicated to ``bstring``. Especially interfaces that are aliases should not be carried along to the ``bstring`` module. Most functions in the ``string`` module however have short and fairly describing names, often similar to names found in other languages. I think using a ``r`` prefix for functionality working from the end of the string towards the beginning is a good choice, as is ``c`` for complement. Byte oriented versus character oriented return values ----------------------------------------------------- Some functions in ``string``, that are certainly useful, return numbers denoting character positions. The same functions should definitely be present in the ``bstring`` module and the return values should definitely be character oriented. However byte offsets are definitely useful, for example if we use a function like ``span`` to find the first character not in a set of characters, we might want the byte offset of that first character too. I suggest adding some interfaces returning byte offsets, or _part()'s_ like the ones used in the ``binary`` module and by ``re``, to cope with the need for byte offsets and lengths in some circumstances. A ``b`` suffix to the function name could denote such functionality, so that ``bstring:span`` returns a character position while ``bstring:spanb`` returns a byte position and ``btring:str`` returns a character position and ``bstring:strb`` returns a _part()_. Although this will in the end give rise to more functions in the interface, having return-type-changing options in an option list is not the way to go (I know, I have them in ``re``, but it's still not generally a good idea...). New or replaced functionality ----------------------------- When writing a general string module, there is no end to the new, more or less esoteric, functionality one could add. I think we, at least in an initial implementation, should stick to the functionality outlined in [EEP 9][], namely extending ``str`` and friends to optionally take a list of alternative strings to search for, add a function ``split`` to take care of multi-character separators (as opposed to single character separators in the function ``tokens``) and a substitution function, which I think should be named ``replace`` as in other modules. The use of pre-compiled matches from the ``binary`` module is however not a good idea, as the ``binary`` module has no notion of character encoding. Search strings need to be given in defined character encodings and both the "haystacks" and the "needles" encoding need to be known when doing an efficient search. So - no pre-compiled search expressions. Excerpt of a suggested manual page ---------------------------------- As made obvious above, I prefer the name ``bstring`` for a binary string module in favor of the more verbose name ``binary_string`` originally suggested. In that module ``bstring``, I suggest the following interfaces, expressed as in a manual page of OTP. ## DATA TYPES encoding() = latin1 | unicode | utf8 - The encoding of characters in the binary data, both input and output bstring() - Binary with characters encoded either in ISO-Latin-1 or UTF-8 unicode_char() = non_negative_integer() - An integer representing a valid unicode codepoint non_negative_integer() - An integer >= 0 ## EXPORTS ### ``align(BString, Alignment, Number, Char) -> Result`` ### ``align(BString, Encoding, Alignment, Number, Char) -> Result`` Types: BString = Result = bstring() Encoding = encoding() Alignment = leading | trailing | center Number = non_negative_integer() Char = unicode_char() Aligns the characters in ``BString`` in a ``Result`` of ``Number`` characters according to the ``Alignment`` parameter. Alignment is done by inserting the character ``Char`` in the beginning or end (or both) of the binary string. The resulting binary string will contain exactly ``Number`` characters, the string is truncated if it contains more characters than ``Number`` - either at the end if ``Alignment`` is ``leading``, or at the beginning if ``Alignment`` is ``trailing``, or at both ends if ``Alignment`` is ``center`` . If ``Encoding`` is ``unicode``, the ``Result`` may well contain more bytes than ``Number``, as one character may require several bytes. Example: > bstring:align(<<"Hello">>, latin1, center, 10, $.). <<"..Hello...">> If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string. Raises a ``badarg`` exception if ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, ``Encoding`` or ``Alignment`` has an invalid value, the character ``Char`` cannot be encoded in the character encoding given as ``Encoding`` or any of the parameters are of the wrong type. ### ``chr(BString, Character) -> Position`` ### ``chr(BString, Encoding, Character) -> Position`` ### ``rchr(BString, Character) -> Position`` ### ``rchr(BString, Encoding, Character) -> Position`` Types: BString = bstring() Encoding = encoding() Character = unicode_char() Position = integer() Returns the (zero-based) character position of the first/last occurrence of ``Character`` in ``BString`` . ``-1`` is returned if ``Character`` does not occur. Note that the character position is not the same as the byte position. Use the ``chrb`` and ``rchrb`` functions to get the byte positions. If ``Character`` cannot be represented in the encoding, it is not an error, you are just certain to get ``-1`` as a return value. If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string. Raises a ``badarg`` exception if the searched part of ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, the ``Encoding`` has an invalid value, or any of the parameters are of the wrong type. ### ``chrb(BString, Character) -> {BytePosition, ByteLength}`` ### ``chrb(BString, Encoding, Character) -> {BytePosition, ByteLength}`` ### ``rchrb(BString, Character) -> {BytePosition, ByteLength}`` ### ``rchrb(BString, Encoding, Character) -> {BytePosition, ByteLength}`` Types: BString = bstring() Encoding = encoding() Character = unicode_char() BytePosition = integer() ByteLength = non_negative_integer() Works as ``chr`` and ``rchr`` respectively, but returns the byte position and byte length of the character. If the character is not found, ``{-1,0}`` is returned. ### ``concat(BString1, BString2) -> BString3`` ### ``concat(BString1, Encoding1, BString2, Encoding2, Encoding3) -> BString3`` Types: BString1 = BString2 = BString3 = bstring() Encoding1 = Encoding2 = Encoding3 = encoding() Concatenates two binary strings to form a new string. Returns the new binary string in the encoding given by Encoding3. If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string. Raises a ``badarg`` exception if ``BString1`` or ``Bstring2`` does not contain characters encoded according to the ``Encoding1`` and ``Encoding2`` parameters, the encoding parameters has an invalid value, the codepoints in the in-parameters cannot be represented in the output encoding or any of the parameters are of the wrong type. ### ``equal(BString1, BString2) -> bool()`` ### ``equal(BString1, Encoding1, BString2, Encoding2) -> bool()`` Types: BString1 = BString2 = bstring() Encoding1 = Encoding2 = encoding() Tests whether two binary strings are equal. Returns ``true`` if they are, otherwise ``false`` . ``Encoding1`` is the encoding of ``BString1`` and ``Encoding2`` is the encoding of ``BString2`` . Note that the strings can have different encoding and that it is the character values encoded in the strings that are compared. The binary strings are scanned as long as they are equal, meaning that if the function returns ``true``, both strings are correctly encoded, while a return value of ``false`` does not guarantee correct encoding in both binary strings. An exception is raised if faulty encoding is determined while comparing the strings, not if parts of the string not inspected contain encoding errors. If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string. Raises a ``badarg`` exception if wrongly encoded characters, according to the encoding parameters, are encountered during comparison, the encoding parameters has an invalid value or any of the parameters are of the wrong type. ### ``join(BStringList, Separator) -> Result`` ### ``join(BStringList, BStringListEncoding, Separator, SeparatorEncoding, ResultEncoding) -> Result`` Types: BStringList = [bstring()] BStringListEncoding = SeparatorEncoding = ResultEncoding = encoding() Separator = bstring() Result = bstring() Returns a binary string with the elements of ``BStringList`` separated by the binary string in ``Seperator`` . All the binary strings in ``BStringList`` should have the same encoding (given as ``BStringListEncoding`` . The ``Separator`` can however have a different encoding (given as ``SeparatorEncoding`` ), as can the ``Result`` (given as ``ResultEncoding`` ). Example: > bstring:join([<<"one">>, <<"two">>, <<"three">>], latin1, <<", ">>, latin1, latin1). <<"one, two, three">> If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string. Raises a ``badarg`` exception if binary strings in ``BStringList`` or the ``Separator`` do not contain characters encoded according to the ``BStringListEncoding`` and ``SeparatorEncoding`` parameters respectively, the encoding parameters has an invalid value, the codepoints in the in-parameters cannot be represented in the output encoding ``ResultEncoding`` or any of the parameters are of the wrong type. ### ``len(BString) -> Length`` ### ``len(BString, Encoding) -> Length`` Types: BString = bstring() Encoding = encoding() Length = non_negative_integer() Returns the number of characters in the binary string. If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string. Raises a ``badarg`` exception if ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, the ``Encoding`` has an invalid value or any of the parameters are of the wrong type. ### ``nth_token(BString, N, CharList) -> Result`` ### ``nth_token(BString, Encoding, N, CharList) -> Result`` Types: BString = Result = bstring() Encoding = encoding() CharList = [ unicode_char() ] N = non_negative_integer() Returns the token number ``N`` of ``BString`` (zero-based). Tokens are separated by the characters in ``CharList`` . The returned token will have the same encoding as ``BString`` . For example: > bstring:nth_token(<<" Hello old boy !">>,latin1,3,[$o, $ ]). <<"ld b">> ``CharList`` is to be viewed as a _set_ of characters, order is not significant. Codepoints given in ``CharList`` that cannot be represented by the ``Encoding``, is not an error. Values of ``N`` >= number of tokens in ``BString`` will result in the empty binary string ``<<>>`` being returned. If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string. Raises a ``badarg`` exception if ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, the ``Encoding`` has an invalid value, or any of the parameters are of the wrong type. ### ``num_tokens(BString, CharList) -> Count`` ### ``num_tokens(BString, Encoding, CharList) -> Count`` Types: BString = bstring() Encoding = encoding() CharList = [ unicode_char() ] Count = non_negative_integer() Returns the number of tokens in ``String``, separated by the characters in ``CharList`` . The result is the same as for length(bstring:tokens(BString,Encoding,CharList)), but avoids building the result. For example: > num_tokens(<<" Hello old boy!">>, latin1, [$o, $ ]). 4 ``CharList`` is to be viewed as a _set_ of characters, order is not significant. Codepoints given in ``CharList`` that cannot be represented by the ``Encoding``, is not an error. If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string. Raises a ``badarg`` exception if ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, the ``Encoding`` has an invalid value, or any of the parameters are of the wrong type. ### ``span(BString, Chars) -> Length`` ### ``span(BString, Encoding, Chars) -> Length`` ### ``rspan(BString, Chars) -> Length`` ### ``rspan(BString, Encoding, Chars) -> Length`` ### ``cspan(BString, Chars) -> Length`` ### ``cspan(BString, Encoding, Chars) -> Length`` ### ``rcspan(BString, Chars) -> Length`` ### ``rcspan(BString, Encoding, Chars) -> Length`` Types: BString = bstring() Encoding = encoding() Chars = [ integer() ] Length = non_negative_integer() Returns the length (in characters) of the maximum initial (span and cspan) or trailing (rspan and rcspan) segment of BString, which consists entirely of characters from (span and rspan), or not from (cspan and rcspan) Chars. ``Chars`` is to be viewed as a _set_ of characters, order is not significant. Codepoints given in ``Char`` that cannot be represented by the ``Encoding``, is not an error. For example: > bstring:span(<<"\t abcdef">>,latin1," \t"). 5 > bstring:cspan((<<"\t abcdef">>,latin1, " \t"). 0 Codepoints in ``Chars`` that can not be represented by ``Encoding`` is not considered an error. If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string. Raises a ``badarg`` exception if the searched part of ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, the ``Encoding`` has an invalid value, or any of the parameters are of the wrong type. ### ``spanb(BString, Chars) -> ByteLength`` ### ``spanb(BString, Encoding, Chars) -> ByteLength`` ### ``rspanb(BString, Chars) -> ByteLength`` ### ``rspanb(BString, Encoding, Chars) -> ByteLength`` ### ``cspanb(BString, Chars) -> ByteLength`` ### ``cspanb(BString, Encoding, Chars) -> ByteLength`` ### ``rcspanb(BString, Chars) -> ByteLength`` ### ``rcspanb(BString, Encoding, Chars) -> ByteLength`` Types: BString = bstring() Encoding = encoding() Chars = [ integer() ] ByteLength = non_negative_integer() Work exactly as the functions ``span``, ``rspan``, ``cspan`` and ``rcspan`` respectively, but returns the number of bytes rather than the number of characters. ### ``split(BString, Separators, Where) -> Tokens`` ### ``split(BString, Encoding, Separators, SepEncoding, Where, ReturnEncoding) -> Tokens`` Types: String = bstring() Encoding = SepEncoding = ReturnEncoding = encoding() Separators = [ bstring() ] Where = first | last | all Tokens = [bstring()] Returns a list of tokens in ``BString``, separated by the binary strings in ``Separators`` . The ``Tokens`` returned are encoded according to ``ReturnEncoding`` . Example: > bstring:split(<<"abc defxxghix jkl">>, latin1, [<<"x">>,<<" ">>],all,latin1). [<<"abc">>, <<"def">>, <<"ghi">>, <<"jkl">>] ``Separators`` is to be viewed as a _set_ of binary strings, order is not significant. Codepoints given in ``Separators`` that cannot be represented by the ``Encoding``, is not an error. The ``Where`` parameter specifies at which occurrence of any of the ``Separators`` the binary string is to be split, either at the ``first`` occurrence, the ``last`` occurrence or at ``all`` occurrences, in which case the ``Tokens`` may be an arbitrary long list. If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string. Raises a ``badarg`` exception if ``BString`` or ``Separators`` does not contain characters encoded according to the ``Encoding`` and ``SepEncoding`` parameters respectively, the resulting tokens cannot be encoded in the ``ReturnEncoding``, the ``Encoding`` has an invalid value, or any of the parameters are of the wrong type. ### ``str(BString, SubBStrings) -> Position`` ### ``str(BString, Encoding, SubBStrings, SubEnc) -> Position`` ### ``rstr(BString, SubBStrings) -> Position`` ### ``rstr(BString, Encoding, SubBStrings, SubEnc) -> Position`` Types: BString = bstring() SubBString = bstring() | [ bstring() ] Encoding = SubEnc = encoding() Position = integer() Returns the (zero-based) character position where the first/last occurrence of any of the ``SubBStrings`` begins in ``BString`` . ``-1`` is returned if ``SubBString`` does not exist in ``BString`` . Note that the ``Character`` position is not the same as the byte position. Use the ``strb`` and ``rstrb`` functions to get the byte positions. The encoding need not be the same for ``BString`` and ``SubBStrings``, however all strings in SubBStrings need to have the same encoding. If the codepoints in SubBString can not be represented in the encoding of BString, that is not an error, but will always result in the return value -1. Example: > bstring:str(<<" Hello Hello World World ">>,latin1,<<"Hello World">>,latin1). 7 Note that if both encodings are the same and repeated searches with the same ``SubBStrings`` are to be performed, it is more efficient to use the ``binary:match/{2,3}`` functions with a precompiled pattern on the raw binary data. If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string. Raises a ``badarg`` exception if the searched part of ``BString`` or ``SubBString`` does not contain characters encoded according to the ``Encoding`` and ``SubEnc`` parameters, the ``Encoding`` has an invalid value, or any of the parameters are of the wrong type. ### ``strb(BString, SubBStrings) -> {BytePosition, ByteLength}`` ### ``strb(BString, Encoding, SubBStrings, SubEnc) -> {BytePosition, ByteLength}`` ### ``rstrb(BString, SubBStrings) -> {BytePosition, ByteLength}`` ### ``rstrb(BString, Encoding, SubBStrings, SubEnc) -> {BytePosition, ByteLength}`` Types: BString = bstring() SubBString = bstring() | [ bstring() ] Encoding = SubEnc = encoding() BytePosition = integer() ByteLength = non_negative_integer() Works as ``str`` and ``rstr`` respectively, but returns the byte position and byte length of the found substring. Note that ``ByteLength`` is the length the found substring has in ``BString``, regardless of the encoding in ``SubBStrings``, so that ``ByteLength`` may be both larger and smaller than ``byte_size(SubBString)`` depending on the binary string's encoding. If the substring is not found, ``{-1,0}`` is returned. ### ``strip(BString, Which, CharList) -> Result`` ### ``strip(BString, Encoding, Which, CharList) -> Result`` Types: BString = Result = bstring() Encoding = encoding() Which = leading | trailing | both CharList = [ unicode_char() ] Removes leading (``Which`` = ``leading``), trailing (``Which`` = ``trailing``) or both leading and trailing (``Which`` = ``both``) characters belonging to the set indicated by ``CharList`` from the binary string ``BString`` . This is essentially the same as using ``spanb`` and/or ``rspanb`` in combination with bit syntax to remove the characters. Example: > bstring:strip(<<"...He.llo.....">>, latin1, both, [$.]). <<"He.llo">> If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string. Raises a ``badarg`` exception if scanned part of ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, ``Encoding`` or ``Which`` has an invalid value, or any of the parameters are of the wrong type. ### ``replace(BString, Separators, Replacement, Where) -> Result`` ### ``replace(BString, Encoding, Separators, SeparatorsEncoding, Replacement, ReplacementEncoding, Where, ResultEncoding) -> Result`` Types: BString = bstring() Encoding = SeparatorsEncoding = ReplacementEncoding, ResultEncoding = encoding() Separators = [ bstring() ] Replacement = bstring() Where = first | last | all Result = bstring() Produces the same result as bstring:join(bstring:split(BString,Encoding,Separators,SeparatorsEncoding,Where, unicode), unicode,Replacement,ReplacementEncoding,ResultEncoding) but with less overhead. ### ``substr(BString, Start, Length) -> SubBString`` ### ``substr(BString, Encoding, Start, Length) -> SubBString`` Types: BString = SubBString = bstring() Encoding = bstring() Start = integer() Length = non_negative_integer() | infinity Returns a substring of ``String``, starting at the zero-based character position ``Start``, and ending at the end of the binary string (if ``Length`` is ``infinity`` or up to, but not including, the character position ``Start+Length`` (if ``Length`` is a non negative integer). The returned ``SubBString`` will have the same encoding as ``BString`` . Example: > bstring:substr(<<"Hello World">>, latin1, 3, 5). <<"lo Wo">> A negative value of ``Start`` denotes ``abs(Start)`` characters from the _end_ of ``BString``, so that ``-1`` is the last character position in the binary string. Example: > bstring:substr(<<"Hello World">>, latin1, -1, 3). <<"rld">> As the true length of an UTF-8 encoded binary string is quite costly to determine ( ``O(N)``, where ``N`` is the number of bytes in the binary), the function is very forgiving about positions given outside of the string, both ``Start`` s and ``Length`` s. Character positions outside of the string in either direction are collapsed to the empty binary string. Examples: > bstring:substr(<<"01234">>, latin1, 5, 5). <<>> > bstring:substr(<<"01234">>, latin1, 4, 5). <<"4">> > bstring:substr(<<"01234">>, latin1, -5, 100). <<"01234">> > bstring:substr(<<"01234">>, latin1, -6, 1). <<>> > bstring:substr(<<"01234">>, latin1, -6, 2). <<"0">> If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string. If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string. Raises a ``badarg`` exception if the searched part of ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, the ``Encoding`` has an invalid value, or any of the parameters are of the wrong type. ### ``tokens(BString, SeparatorList) -> Tokens`` ### ``tokens(BString, Encoding, SeparatorList) -> Tokens`` Types: String = bstring() Encoding = encoding SeparatorList = [ non_negative_integer() ] Tokens = [bstring()] Returns a list of tokens in ``BString``, separated by the characters in ``SeparatorList`` . The ``Tokens`` returned are encoded in the same character encoding as the ``BString`` . Example: > bstring:tokens(<<"abc defxxghix jkl">>, latin1, [$x,$ ]). [<<"abc">>, <<"def">>, <<"ghi">>, <<"jkl">>] ``SeparatorList`` is to be viewed as a _set_ of characters, order is not significant. Codepoints given in ``SeparatorList`` that cannot be represented by the ``Encoding``, is not an error. If the encoding is not given, it is assumed to be ``latin1``, implying that no interpretation is given to the bytes in the binary string. Raises a ``badarg`` exception if the searched part of ``BString`` does not contain characters encoded according to the ``Encoding`` parameter, the ``Encoding`` has an invalid value, or any of the parameters are of the wrong type. Performance =========== This module can, and probably should, be implemented entirely in Erlang, no BIF's or NIF's are needed. Both the ``binary`` and ``unicode`` modules can be utilized to speed up conversion and indata checking. The Unicode versions will definitely be slower than the ISO-Latin-1 versions, as character encoding, decoding and checking is bound to produce overhead. The suggested wrapper ``ubstring`` should not impose any significant cost compared to calling ``bstring`` with all encoding arguments set to ``unicode``. The idea is to make string manipulation using binaries convenient as it has a great positive impact on systems memory-wise. Increased speed compared to list-oriented strings is not the goal, although it may well be a side-effect. Reference implementation ======================== No specific reference implementation is made, the code will however be made available on GitHub during any development. Copyright ========= This document is licensed under the [Creative Commons license][CCA3.0]. [EEP 9]: eep-0009.md "EEP 9, the original work from which this EEP is derived" [EEP 11]: eep-0011.md "EEP 11, intresting extensions to EEP 9" [EEP 31]: eep-0031.md "EEP 31, rewrite of EEP 9, module binary" [CCA3.0]: http://creativecommons.org/licenses/by/3.0/ "Creative Commons Attribution 3.0 License" [EmacsVar]: <> "Local Variables:" [EmacsVar]: <> "mode: indented-text" [EmacsVar]: <> "indent-tabs-mode: nil" [EmacsVar]: <> "sentence-end-double-space: t" [EmacsVar]: <> "fill-column: 70" [EmacsVar]: <> "coding: utf-8" [EmacsVar]: <> "End:"