Abstract
This EEP suggests the addition of two binary help libraries with built-in
functions for time critical activities such as searching and splitting
erlang binaries as well as library functions for common operations on
binaries. The EEP also suggest the addition of a regular expressions
library using built in functions.
Rationale
For the lists data type there is a help library providing functions for
common operations such as searching and splitting lists. This EEP suggests
that a similar set of library functions should be created for binaries.
Many of the proposed functions are based on answers to questions regarding
binaries on the erlang-questions mailing list, e.g. "how do I convert a
number to a binary?".
Motivation
Since binaries are typically used for time critical activities on larger
amounts of data it is suggested that some operations on binaries are
implemented as built-in functions, BIF:s.
Specifically there seems to be a huge interest in the community for an
efficient regexp implementation for searching binaries. Also for maximum
performance when searching and splitting binaries it is suggested that the
the regexp search function is complemented by a high performance function
for simple searches, e.g. locating and splitting binaries on newline
characters. Tests show that e.g. the Boyer-Moore algorithm may be
significantly faster than regular expression algorithms for such purposes.
When reviewing the EEP it is clear that there is also a strong demand for
string operations on binaries for better performance.
The reference implementation sent separately to the OTP team gives
an indication of the expected performance improvements compared to e.g.
the current regular expression module for searching on lists. Some results
are available at the end of this EEP.
Suggested Changes
This EEP suggests the addition of two new modules; one module named
binary and one called binary_string.
The EEP also suggests a new regular expression library based on Perl
Compatible Regular Expressions (PCRE). The library should be able to
operate both on binary_strings and on strings.
Finally, the following functions should be added to the erlang module:
binary_to_atom(Binary) -> Atom
atom_to_binary(Atom) -> Binary
binary_to_existing_atom(Binary) -> Atom
Not Included
At the moment the following is not included in the EEP:
- Support for different encodings, e.g. UTF-8
- Changes to the string module
The "binary_string" Module
The binary_string module should be based on the current string module but
should operate on strings represented by binaries as opposed to the
current strings module which operates on strings represented by lists.
Apart from operating on binaries the interface of binary_string should be
the same as for string with the following exceptions:
1. str/2 and rstr/2 should be modified to optionally take a list of
binaries or a MatchSpec such as the one returned by
binary:match_compile/2 as second argument. If the Keys argument
corresponds to several keys the function should return a tuple
indicating the Key that matched and the matching Index, i.e.
str(Binary, Keys) -> Return
rstr(Binary, Keys) -> Return
Binary = binary()
Keys = Key | [ Key ] | MatchSpec
Key = string() | binary()
MatchSpec = tuple() as returned by binary:match_compile/1
Return = Index | {NeedleNumber, Index}
Index = integer()
str/rstr should be implemented as built-in functions using
efficient algorithms such as Boyer-Moore, Aho-Corasick or
similar. Typically the function could be built on binary:match/2.
2. A new function split should be added. It should behave as tokens/2
but take a list of separator binaries/strings instead of a list of
separator characters.
split(Binary, SplitKeys) -> List
Binary = binary()
SplitKeys = Key | [ Key ] | MatchSpec
Key = string() | binary()
MatchSpec = tuple() as returned by binary:match_compile/1
List = [ binary() ]
Splits Binary into a list of binaries based on matching the pattern
specified in the SplitKeys binary.
Examples:
> binary_string:split(<<"cat and dog">>, <<"and">>).
[<<"cat ">>, <<" dog">>]
> binary_string:split(<<"cat and dog">>, "and").
[<<"cat ">>, <<" dog">>]
> binary_string:split(<<"cat and dog">>,["a","n",<<"d">>]).
[<<"c">>,<<"t ">>,<<" ">>,<<"og">>]
The resulting list should be the same as for regexp:split/2
(with the obvious exception for special characters such as "*",
".", "^", etc).
Please note that the third example should give the same result as
binary_string:tokens(<<"cat and dog">>, "and").
3. The new functions substitute and globally_substitute should be
added.
substitute(OldBinary, Key, Replacement)-> NewBinary
OldBinary, NewBinary, Replacement = binary()
Keys = binary() | [ binary() ] | MatchSpec
MatchSpec = tuple() as returned by binary:match_compile/1
Creates a binary NewBinary from OldBinary by substituting the
first occurence of any of the binaries in Keys in OldBinary
with the Replacement binary.
The Replacement binary need not have the same size as the matched
Key.
Example:
> binary_string:substitute(<<"cat anf dog">>,<<"anf">>,<<"and">>).
[<<"cat and dog">>]
globally_substitute(OldBinary, Key, Replacement)-> NewBinary
OldBinary, NewBinary, Replacement = binary()
Keys = binary() | [ binary() ] | MatchSpec
MatchSpec = tuple() as returned by binary:match_compile/1
Same as substitute except that all non-overlapping occurrences of
a subbinary in OldBinary are replaced by the Replacement binary.
It is suggested that the same functions are also added to the string
module, but this is out of the scope of this EEP.
The "binary" Module
The interface of the binary module should have the following exported
functions (please note that some functions are intentionally the same as
in binary_string since it is believed they can be useful both for string
and binary data manipulation):
match(Binary, Keys) -> Return
match(Binary, Keys, {StartIndex, EndIndex}) -> Return
Binary = binary()
Keys = binary() | [ binary() ] | MatchSpec
MatchSpec = tuple() as returned by binary:match_compile/1
StartIndex = EndIndex = integer()
Return = Index | {KeyNumber, Index}
Index = KeyNumber = integer()
Returns position of first occurence in Binary of the first
matching binary in Keys or 0 if no match. If a list of keys
is given, the function will return a tuple with the KeyNumber of
the matched Key and the position in Binary where it was found.
There has been a discussion on whether the function should return
the matched Key instead of the KeyNumber. Returning the KeyNumber
should be slightly more efficient, and since the matched key
can easily be retrieved by lists:nth(KeyNumber, Keys) if needed it
is suggested that the function returns the KeyNumber.
Binary is searched from StartIndex to EndIndex. If StartIndex
and EndIndex are not specified the default is to search Binary
from the beginning to the end.
Example:
> binary:match(<<1,2,3,0,0,0,4>>, <<0,0,0>>).
4
> binary:match(<<1,2,255,0,0,0,4,255>>, [<<0,0,0>>, <<255>>]).
{2, 3}
Suggestions on implementation: Should be implemented as one or
more BIF:s using e.g. Boyer-Moore, Aho-Corasick or similar
efficient algorithms.
matches(Binary, Keys) -> Return
matches(Binary, Keys, {StartIndex, EndIndex}) -> Return
Binary = binary()
Keys = binary() | [ binary() ] | MatchSpec
MatchSpec = tuple() as returned by binary:match_compile/1
StartIndex = EndIndex = integer()
Return = [ Index ] | [ {KeyNumber, Index} ]
Index = KeyNumber = integer()
Finds all matches of the Keys in Haystack. Returns a list of the
indexes for all non-overlapping ocurrences of the key or keys.
split(Binary, SplitKeys) -> List
split(Binary, SplitKeys, {StartIndex, EndIndex}) -> List
Binary = binary()
SplitKeys = binary() | [ binary() ] | MatchSpec
MatchSpec = tuple() as returned by binary:match_compile/1
StartIndex = EndIndex = integer()
List = [ binary() ]
Splits Binary into a list of binaries based on matching the pattern
specified in SplitKeys.
Example:
> binary:split(<<1,255,4,0,0,0,2,3>>, <<0,0,0>>).
[<<1,255,4>>, <<2,3>>]
> binary:split(<<0,1,0,0,4,255,255,9>>, [<<0,0>>, <<255,255>>]).
[<<0,1>>,<<4>>,<<9>>]
The resulting list should basically be the same as for
regexp:split/2 (with the obvious exception for special characters
such as "*", ".", "^", etc).
The binaries in List are all subbinaries of Binary meaning that
the data in Binary is not actually copied to new binaries.
substitute(OldBinary, Key, Replacement)-> NewBinary
OldBinary, NewBinary, Replacement = binary()
Keys = binary() | [ binary() ] | MatchSpec
MatchSpec = tuple() as returned by binary:match_compile/1
Creates a binary NewBinary from OldBinary by substituting the
first occurence of any of the binaries in Keys in OldBinary
with the Replacement binary.
The Replacement binary need not have the same size as the matched
Key.
globally_substitute(OldBinary, Key, Replacement)-> NewBinary
OldBinary, NewBinary, Replacement = binary()
Keys = binary() | [ binary() ] | MatchSpec
MatchSpec = tuple() as returned by binary:match_compile/1
Same as substitute except that all non-overlapping occurrences of
a subbinary in OldBinary are replaced by the Replacement binary.
match_compile(Keys) -> MatchSpec
Keys = binary() | [ binary() ]
MatchSpec = tuple()
Builds an internal structure representing one or more search
keys. The MatchSpec structure can be used to speed up searching if
multiple searches with binary:match/2 or binary_string:str/2
are to be performed with the same search keywords.
binary:from_unsigned(Integer)-> Binary
binary:to_unsigned(Binary)-> Integer
Converts a positive integer the smallest possible representation
in the binary data type format and vice versa.
Example:
> binary:from_unsigned(11111111).
<<169,138,199>>
> binary:to_unsigned(<<169,138,199>>).
11111111
first(Binary1)-> Binary2
first(SizeBytes, Binary1)-> Binary2
Returns a subbinary with the first byte or the SizeBytes first
bytes in Binary1.
Example:
> binary:first(2, <<"abc">>).
<<"ab">>
last(Binary1)-> Binary2.
last(SizeBytes, Binary1)-> Binary2
Returns a subbinary with the last byte or the SizeBytes last bytes
in Binary1.
Example:
> binary:last(2, <<"abc">>).
<<"bc">>
nth(N, Binary) -> Value
N = integer(), 1 =< N =< size(Binary)
Value = integer()
Extracts a byte at position N from Binary. Same as
T = N-1,
<<_:T/binary, Value:Size/binary, _/binary>> = Binary,
Value.
although this function is somewhat shorter and easier to write.
extract(N, Size, Binary) -> SubBinary
N = integer(), 1 =< N =< size(Binary)
Size = integer()
SubBinary = subbinary()
Returns a subbinary of size Size starting at position N from
Binary. No data is copied in this operation.
It has been discussed if there should be a function for copying
a part of a binary rather than getting a subbinary. This would
make it possible to get a small part of a binary and let the
rest be garbage collected. Since it is possible to achieve the same
result by converting the extracted part to a list and then back
again to a binary and it is a very specialized operation which
may confuse new users it has been excluded at this stage.
When talking to designers many seem to prefer the name extract
over the name subbinary for this function.
duplicate(N, Byte)-> Binary
Similar to lists:duplicate/2. Creates a new binary consisting of
Byte repeated N times.
Example:
> binary:duplicate(5, $a).
<<"aaaaa">>
The Regular Expressions Library
It is suggested that a new regular expression library based on
built in functions is added. It should have the following interface
functions (name of the module to be decided, for backwards
compatibility reasons it should probably exists in parallell with
the old regexp module):
During a first round of feedback it has been suggested that the
final implementation should be a built in function based on the
Perl Compatible Regular Expressions (PCRE) library [1]. It is
optimised, well supported, and is more or less considered a standard
today. It is used in a number of prominent products and projects, e.g.
Apples Safari, Apache, KDE, PHP, Postfix and Nmap.
It is suggested that the module has the following exported functions:
compile(Regex) -> MatchSpec
Regex = string()
MatchSpec = tuple()
Builds an internal structure representing one or more search
keys. The MatchSpec structure can be used to speed up searching if
multiple searches are to be performed with the same search
keywords.
match(BinOrString, RegExp)-> Return
match(BinOrString, RegExp, {StartIndex, EndIndex})-> Return
BinOrString = binary() | string()
RegExp = string() | MatchSpec
MatchSpec = tuple() as returned by match_compile/1
StartIndex = EndIndex = integer()
Return = 0 | {Start, Length, [CapturedPatterns]}
Finds the first, longest match of the regular expression RegExp
in BinOrString. This function searches for the longest possible
match and returns the first one found if there are several
expressions of the same length.
The function supports pattern capturing. Patterns captured (if
any) are returned in a list in the Return tuple.
Examples:
> binary:regex_match(<<"abcde">>, "b?cd").
{2,3,[]}
> binary:regex_match(<<"127.0.0.1">>, "(\d*)\.(\d*)\.").
{1,6,[<<"127">>, <<"0">>]}
Open questions:
- It might be a good idea to add an Options parameter (optional
of course), e.g. to specify that the partial matching feature
should be activated
- handling of Encodings.
matches(BinOrString, RegExp)-> Return
matches(BinOrString, RegExp, {StartIndex, EndIndex})-> Return
BinOrString = binary() | string()
RegExp = string() | MatchSpec
MatchSpec = tuple() as returned by match_compile/1
StartIndex = EndIndex = integer()
Return = 0 | [ {Start, Length, [CapturedPatterns]} ]
Finds all matches of the regular expression RegExp in BinOrString.
Example:
> binary:regex_matches(<<"aaa">>, "a").
[{1,1,[]},{2,1,[]},{3,1,[]}]
sub(BinOrString, RegExp, Replacement)-> NewStringOrBinary
BinOrString = NewStringOrBinary = binary() | string()
RegExp = string() | MatchSpec
MatchSpec = tuple() as returned by match_compile/1
Replacement = string()
Substitutes the first occurence of a substring or subbinary
matching RegExp in BinOrString with Replacement. A & in the
Replacement string is replaced by the matched substring or
subbinary of BinOrString. \& puts a literal & into the
replacement string or binary. The type of NewStringOrBinary
will be the same as the type of BinOrString.
gsub(BinOrString, RegExp, Replacement)-> Binary2
Same as sub except that all non-overlapping occurrences of
a substring or subbinary matching RegExp in BinOrString are
replaced by the string Replacement.
split(BinOrString, RegExp) -> List
split(BinOrString, RegExp, {StartIndex, EndIndex}) -> List
BinOrString = binary() | string()
RegExp = string() | MatchSpec
MatchSpec = tuple() as returned by match_compile/1
StartIndex = EndIndex = integer()
List = [ binary() ]
Splits Binary into a list of binaries based on the pattern
specified in RegExp.
The resulting list should basically be the same as for
regexp:split/2.
Performance
Performance was measured for the functions considered most
important using the reference implementation. Some examples:
1. Searching for a non-existing 1 and 3 byte binary in a ~1 Mb binary.
Notice how binary:match/2 gets faster the longer the needle is thanks to
the O(n/m) algorithm. All times in microseconds.
Search for: 1 byte 3 bytes
binary:match/2: 17598 6045
binary:regex_first/2: 47299 46701
string:str/2: 68969 69637
regexp:first_match/2: 460858 887485
2. Splitting a ~1 Mb binary on newline chars. This particular binary
contained a newline every 60 chars on average.
binary:split/2: 89142 microseconds
regexp:split/2: 564911 microseconds
3. Regex-DNA benchmark from computer language shootout
prototype regexp bif: 1.9 seconds
regexp module in R12B: 99.1 seconds
In the examples at the computer language shootout PCRE has a slightly
lower performance compared to other algorithms such as the one in the
reference implementation or in particular the one featured in TCL. This
may not necessarily mean that this is true for all types of patterns.
Reference implementation
A reference implementation has been provided to the OTP team.
References
[1] http://en.wikipedia.org/wiki/PCRE
[2] see man page for pcrematching, also available here:
http://www.pcre.org/pcre.txt
[3] http://swtch.com/~rsc/regexp/regexp1.html
Copyright
This document is licensed under the Creative Commons license.