EEP: 9
Title: Library for working with binaries
Version: $Revision: 37 $
Last-Modified: $Date: 2008-07-11 15:47:13 +0200 (Fri, 11 Jul 2008) $
Author: Fredrik Svahn [Fredrik(dot)Svahn(at)gmail]
Status: Draft
Type: Standards Track
Content-Type: text/plain
Created: 28-Dec-2007
Erlang-Version: R12B-2
Post-History: 

Abstract

    This EEP suggests the addition of two binary help libraries with built-in
    functions for time critical activities such as searching and splitting
    erlang binaries as well as library functions for common operations on
    binaries. The EEP also suggest the addition of a regular expressions
    library using built in functions.


Rationale

    For the lists data type there is a help library providing functions for
    common operations such as searching and splitting lists.  This EEP suggests
    that a similar set of library functions should be created for binaries.
    Many of the proposed functions are based on answers to questions regarding
    binaries on the erlang-questions mailing list, e.g. "how do I convert a
    number to a binary?". 

    

Motivation

    Since binaries are typically used for time critical activities on larger
    amounts of data it is suggested that some operations on binaries are
    implemented as built-in functions, BIF:s.

    Specifically there seems to be a huge interest in the community for an 
    efficient regexp implementation for searching binaries. Also for maximum
    performance when searching and splitting binaries it is suggested that the
    the regexp search function is complemented by a high performance function
    for simple searches, e.g. locating and splitting binaries on newline
    characters. Tests show that e.g. the Boyer-Moore algorithm may be 
    significantly faster than regular expression algorithms for such purposes.

    When reviewing the EEP it is clear that there is also a strong demand for
    string operations on binaries for better performance. 

    The reference implementation sent separately to the OTP team gives 
    an indication of the expected performance improvements compared to e.g.
    the current regular expression module for searching on lists. Some results
    are available at the end of this EEP.


Suggested Changes

    This EEP suggests the addition of two new modules; one module named
    binary and one called binary_string.

    The EEP also suggests a new regular expression library based on Perl
    Compatible Regular Expressions (PCRE). The library should be able to 
    operate both on binary_strings and on strings.

    Finally, the following functions should be added to the erlang module:

        binary_to_atom(Binary) -> Atom
        atom_to_binary(Atom) -> Binary
        binary_to_existing_atom(Binary) -> Atom


Not Included

    At the moment the following is not included in the EEP:
    - Support for different encodings, e.g. UTF-8
    - Changes to the string module


The "binary_string" Module

    The binary_string module should be based on the current string module but 
    should operate on strings represented by binaries as opposed to the 
    current strings module which operates on strings represented by lists.

    Apart from operating on binaries the interface of binary_string should be 
    the same as for string with the following exceptions:

        1.  str/2 and rstr/2 should be modified to optionally take a list of 
            binaries or a MatchSpec such as the one returned by 
            binary:match_compile/2 as second argument. If the Keys argument 
            corresponds to several keys the function should return a tuple
            indicating the Key that matched and the matching Index, i.e. 

        str(Binary, Keys) -> Return
        rstr(Binary, Keys) -> Return
    
            Binary = binary()
            Keys = Key | [ Key ] | MatchSpec
            Key = string() | binary()
            MatchSpec = tuple() as returned by binary:match_compile/1
            Return = Index | {NeedleNumber, Index}
            Index = integer()

            str/rstr should be implemented as built-in functions using 
            efficient algorithms such as Boyer-Moore, Aho-Corasick or
            similar. Typically the function could be built on binary:match/2.

        2.  A new function split should be added. It should behave as tokens/2
            but take a list of separator binaries/strings instead of a list of
            separator characters.

        split(Binary, SplitKeys) -> List

            Binary = binary()
            SplitKeys = Key | [ Key ] | MatchSpec
            Key = string() | binary()
            MatchSpec = tuple() as returned by binary:match_compile/1
            List = [ binary() ]

            Splits Binary into a list of binaries based on matching the pattern
            specified in the SplitKeys binary.

            Examples:
            > binary_string:split(<<"cat and dog">>, <<"and">>).
            [<<"cat ">>, <<" dog">>] 

            > binary_string:split(<<"cat and dog">>, "and").
            [<<"cat ">>, <<" dog">>] 

            > binary_string:split(<<"cat and dog">>,["a","n",<<"d">>]).
            [<<"c">>,<<"t ">>,<<" ">>,<<"og">>]

            The resulting list should be the same as for regexp:split/2 
            (with the obvious exception for special characters such as "*", 
            ".", "^", etc).

            Please note that the third example should give the same result as
            binary_string:tokens(<<"cat and dog">>, "and"). 

        3.  The new functions substitute and globally_substitute should be 
            added. 

        substitute(OldBinary, Key, Replacement)-> NewBinary
            
            OldBinary, NewBinary, Replacement = binary()
            Keys = binary() | [ binary() ] | MatchSpec
            MatchSpec = tuple() as returned by binary:match_compile/1
            
            Creates a binary NewBinary from OldBinary by substituting the 
            first occurence of any of the binaries in Keys in OldBinary
            with the Replacement binary. 

            The Replacement binary need not have the same size as the matched 
            Key.

            Example:
            > binary_string:substitute(<<"cat anf dog">>,<<"anf">>,<<"and">>).
            [<<"cat and dog">>] 

        globally_substitute(OldBinary, Key, Replacement)-> NewBinary

            OldBinary, NewBinary, Replacement = binary()
            Keys = binary() | [ binary() ] | MatchSpec
            MatchSpec = tuple() as returned by binary:match_compile/1

            Same as substitute except that all non-overlapping occurrences of 
            a subbinary in OldBinary are replaced by the Replacement binary.



    It is suggested that the same functions are also added to the string
    module, but this is out of the scope of this EEP.

The "binary" Module

    The interface of the binary module should have the following exported 
    functions (please note that some functions are intentionally the same as
    in binary_string since it is believed they can be useful both for string
    and binary data manipulation):

        match(Binary, Keys) -> Return
        match(Binary, Keys, {StartIndex, EndIndex}) -> Return

            Binary = binary()
            Keys = binary() | [ binary() ] | MatchSpec
            MatchSpec = tuple() as returned by binary:match_compile/1
            StartIndex = EndIndex = integer()

            Return = Index | {KeyNumber, Index}
            Index = KeyNumber = integer()

            Returns position of first occurence in Binary of the first 
            matching binary in Keys or 0 if no match. If a list of keys
            is given, the function will return a tuple with the KeyNumber of
            the matched Key and the position in Binary where it was found.

            There has been a discussion on whether the function should return
            the matched Key instead of the KeyNumber. Returning the KeyNumber
            should be slightly more efficient, and since the matched key
            can easily be retrieved by lists:nth(KeyNumber, Keys) if needed it
            is suggested that the function returns the KeyNumber.

            Binary is searched from StartIndex to EndIndex. If StartIndex 
            and EndIndex are not specified the default is to search Binary 
            from the beginning to the end.

            Example:
            > binary:match(<<1,2,3,0,0,0,4>>, <<0,0,0>>).
            4 

            > binary:match(<<1,2,255,0,0,0,4,255>>, [<<0,0,0>>, <<255>>]).
            {2, 3}

            Suggestions on implementation: Should be implemented as one or
            more BIF:s using e.g. Boyer-Moore, Aho-Corasick or similar 
            efficient algorithms.
            
        matches(Binary, Keys) -> Return
        matches(Binary, Keys, {StartIndex, EndIndex}) -> Return

            Binary = binary()
            Keys = binary() | [ binary() ] | MatchSpec
            MatchSpec = tuple() as returned by binary:match_compile/1
            StartIndex = EndIndex = integer() 

            Return = [ Index ] | [ {KeyNumber, Index} ]
            Index = KeyNumber = integer()

            Finds all matches of the Keys in Haystack. Returns a list of the
            indexes for all non-overlapping ocurrences of the key or keys.

        split(Binary, SplitKeys) -> List
        split(Binary, SplitKeys, {StartIndex, EndIndex}) -> List

            Binary = binary()
            SplitKeys = binary() | [ binary() ] | MatchSpec
            MatchSpec = tuple() as returned by binary:match_compile/1
            StartIndex = EndIndex = integer()

            List = [ binary() ]

            Splits Binary into a list of binaries based on matching the pattern
            specified in SplitKeys.

            Example:
            > binary:split(<<1,255,4,0,0,0,2,3>>, <<0,0,0>>).
            [<<1,255,4>>, <<2,3>>] 

            > binary:split(<<0,1,0,0,4,255,255,9>>, [<<0,0>>, <<255,255>>]).
            [<<0,1>>,<<4>>,<<9>>] 

            The resulting list should basically be the same as for 
            regexp:split/2 (with the obvious exception for special characters
            such as "*", ".", "^", etc).

            The binaries in List are all subbinaries of Binary meaning that
            the data in Binary is not actually copied to new binaries.

        substitute(OldBinary, Key, Replacement)-> NewBinary
            
            OldBinary, NewBinary, Replacement = binary()
            Keys = binary() | [ binary() ] | MatchSpec
            MatchSpec = tuple() as returned by binary:match_compile/1
            
            Creates a binary NewBinary from OldBinary by substituting the 
            first occurence of any of the binaries in Keys in OldBinary
            with the Replacement binary. 

            The Replacement binary need not have the same size as the matched 
            Key.

        globally_substitute(OldBinary, Key, Replacement)-> NewBinary

            OldBinary, NewBinary, Replacement = binary()
            Keys = binary() | [ binary() ] | MatchSpec
            MatchSpec = tuple() as returned by binary:match_compile/1

            Same as substitute except that all non-overlapping occurrences of 
            a subbinary in OldBinary are replaced by the Replacement binary.

        match_compile(Keys) -> MatchSpec

            Keys = binary() | [ binary() ] 
            MatchSpec = tuple() 

            Builds an internal structure representing one or more search
            keys. The MatchSpec structure can be used to speed up searching if
            multiple searches with binary:match/2 or binary_string:str/2
            are to be performed with the same search keywords.


        binary:from_unsigned(Integer)-> Binary
        binary:to_unsigned(Binary)-> Integer

            Converts a positive integer the smallest possible representation
            in the binary data type format and vice versa.

            Example:
            > binary:from_unsigned(11111111). 
            <<169,138,199>>

            > binary:to_unsigned(<<169,138,199>>).
            11111111


        first(Binary1)-> Binary2
        first(SizeBytes, Binary1)-> Binary2

            Returns a subbinary with the first byte or the SizeBytes first 
            bytes in Binary1. 

            Example:
            > binary:first(2, <<"abc">>).                                  
            <<"ab">>


        last(Binary1)-> Binary2.
        last(SizeBytes, Binary1)-> Binary2

            Returns a subbinary with the last byte or the SizeBytes last bytes
            in Binary1. 

            Example:
            > binary:last(2, <<"abc">>).                                  
            <<"bc">>


        nth(N, Binary) -> Value

            N = integer(), 1 =< N =< size(Binary)
            Value = integer()

            Extracts a byte at position N from Binary. Same as

            T = N-1,
            <<_:T/binary, Value:Size/binary, _/binary>> = Binary, 
            Value.

            although this function is somewhat shorter and easier to write.

        extract(N, Size, Binary) -> SubBinary

            N = integer(), 1 =< N =< size(Binary)
            Size = integer()
            SubBinary = subbinary()

            Returns a subbinary of size Size starting at position N from 
            Binary. No data is copied in this operation.

            It has been discussed if there should be a function for copying
            a part of a binary rather than getting a subbinary. This would
            make it possible to get a small part of a binary and let the
            rest be garbage collected. Since it is possible to achieve the same
            result by converting the extracted part to a list and then back
            again to a binary and it is a very specialized operation which
            may confuse new users it has been excluded at this stage.

            When talking to designers many seem to prefer the name extract
            over the name subbinary for this function.

        duplicate(N, Byte)-> Binary

            Similar to lists:duplicate/2. Creates a new binary consisting of
            Byte repeated N times. 

            Example:
            > binary:duplicate(5, $a).
            <<"aaaaa">>




The Regular Expressions Library

        It is suggested that a new regular expression library based on
        built in functions is added. It should have the following interface
        functions (name of the module to be decided, for backwards
        compatibility reasons it should probably exists in parallell with
        the old regexp module):

        During a first round of feedback it has been suggested that the 
        final implementation should be a built in function based on the 
        Perl Compatible Regular Expressions (PCRE) library [1]. It is 
        optimised, well supported, and is more or less considered a standard 
        today. It is used in a number of prominent products and projects, e.g.
        Apples Safari, Apache, KDE, PHP, Postfix and Nmap.

        It is suggested that the module has the following exported functions:

        compile(Regex) -> MatchSpec

            Regex = string()
            MatchSpec = tuple() 

            Builds an internal structure representing one or more search
            keys. The MatchSpec structure can be used to speed up searching if
            multiple searches are to be performed with the same search 
            keywords.

        match(BinOrString, RegExp)-> Return
        match(BinOrString, RegExp, {StartIndex, EndIndex})-> Return

            BinOrString = binary() | string()
            RegExp = string() | MatchSpec
            MatchSpec = tuple() as returned by match_compile/1
            StartIndex = EndIndex = integer()
            Return = 0 | {Start, Length, [CapturedPatterns]}

            Finds the first, longest match of the regular expression RegExp 
            in BinOrString. This function searches for the longest possible 
            match and returns the first one found if there are several 
            expressions of the same length.

            The function supports pattern capturing. Patterns captured (if
            any) are returned in a list in the Return tuple.

            Examples:
            > binary:regex_match(<<"abcde">>, "b?cd").
            {2,3,[]}

            > binary:regex_match(<<"127.0.0.1">>, "(\d*)\.(\d*)\.").
            {1,6,[<<"127">>, <<"0">>]}

            Open questions: 
             - It might be a good idea to add an Options parameter (optional 
               of course), e.g. to specify that the partial matching feature 
               should be activated
             - handling of Encodings.

        matches(BinOrString, RegExp)-> Return
        matches(BinOrString, RegExp, {StartIndex, EndIndex})-> Return

            BinOrString = binary() | string()
            RegExp = string() | MatchSpec
            MatchSpec = tuple() as returned by match_compile/1
            StartIndex = EndIndex = integer()

            Return = 0 | [ {Start, Length, [CapturedPatterns]} ]

            Finds all matches of the regular expression RegExp in BinOrString. 

            Example:
            > binary:regex_matches(<<"aaa">>, "a").
            [{1,1,[]},{2,1,[]},{3,1,[]}]

        sub(BinOrString, RegExp, Replacement)-> NewStringOrBinary
            
            BinOrString = NewStringOrBinary = binary() | string()
            RegExp = string() | MatchSpec
            MatchSpec = tuple() as returned by match_compile/1
            Replacement = string()

            Substitutes the first occurence of a substring or subbinary 
            matching RegExp in BinOrString with Replacement. A & in the 
            Replacement string is replaced by the matched substring or
            subbinary of BinOrString. \& puts a literal & into the 
            replacement string or binary. The type of NewStringOrBinary
            will be the same as the type of BinOrString.

        gsub(BinOrString, RegExp, Replacement)-> Binary2

            Same as sub except that all non-overlapping occurrences of 
            a substring or subbinary matching RegExp in BinOrString are 
            replaced by the string Replacement.

        split(BinOrString, RegExp) -> List
        split(BinOrString, RegExp, {StartIndex, EndIndex}) -> List

            BinOrString = binary() | string()
            RegExp = string() | MatchSpec
            MatchSpec = tuple() as returned by match_compile/1
            StartIndex = EndIndex = integer()

            List = [ binary() ]

            Splits Binary into a list of binaries based on the pattern
            specified in RegExp.

            The resulting list should basically be the same as for 
            regexp:split/2. 


Performance

    Performance was measured for the functions considered most 
    important using the reference implementation. Some examples:

    1. Searching for a non-existing 1 and 3 byte binary in a ~1 Mb binary. 
    Notice how binary:match/2 gets faster the longer the needle is thanks to 
    the O(n/m) algorithm. All times in microseconds.

                 Search for:  1 byte   3 bytes   
        binary:match/2:        17598      6045 
        binary:regex_first/2:  47299     46701
        string:str/2:          68969     69637
        regexp:first_match/2: 460858    887485


    2. Splitting a ~1 Mb binary on newline chars. This particular binary 
    contained a newline every 60 chars on average.

        binary:split/2:  89142 microseconds
        regexp:split/2: 564911 microseconds

    3. Regex-DNA benchmark from computer language shootout

        prototype regexp bif:    1.9 seconds
        regexp module in R12B:  99.1 seconds

    In the examples at the computer language shootout PCRE has a slightly 
    lower performance compared to other algorithms such as the one in the 
    reference implementation or in particular the one featured in TCL. This
    may not necessarily mean that this is true for all types of patterns.


Reference implementation

    A reference implementation has been provided to the OTP team.


References

    [1] http://en.wikipedia.org/wiki/PCRE

    [2] see man page for pcrematching, also available here:
        http://www.pcre.org/pcre.txt

    [3] http://swtch.com/~rsc/regexp/regexp1.html

Copyright

    This document is licensed under the Creative Commons license.