12  External Term Format

12 External Term Format

The external term format is mainly used in the distribution mechanism of Erlang.

As Erlang has a fixed number of types, there is no need for a programmer to define a specification for the external format used within some application. All Erlang terms have an external representation and the interpretation of the different terms is application-specific.

In Erlang the BIF erlang:term_to_binary/1,2 is used to convert a term into the external format. To convert binary data encoding to a term, the BIF erlang:binary_to_term/1 is used.

The distribution does this implicitly when sending messages across node boundaries.

The overall format of the term format is as follows:

1 1 N
131 Tag Data

Table 12.1:   Term Format

Note

When messages are passed between connected nodes and a distribution header is used, the first byte containing the version number (131) is omitted from the terms that follow the distribution header. This is because the version number is implied by the version number in the distribution header.

The compressed term format is as follows:

1 1 4 N
131 80 UncompressedSize Zlib-compressedData

Table 12.2:   Compressed Term Format

Uncompressed size (unsigned 32-bit integer in big-endian byte order) is the size of the data before it was compressed. The compressed data has the following format when it has been expanded:

1 Uncompressed Size
Tag Data

Table 12.3:   Compressed Data Format when Expanded

As from ERTS 9.0 (OTP 20), atoms may contain any Unicode characters.

Atoms sent over node distribution are always encoded in UTF-8 using either ATOM_UTF8_EXT, SMALL_ATOM_UTF8_EXT or ATOM_CACHE_REF.

Atoms encoded with erlang:term_to_binary/1,2 or erlang:term_to_iovec/1,2 are by default still using the old deprecated Latin-1 format ATOM_EXT for atoms that only contain Latin-1 characters (Unicode code points 0-255). Atoms with higher code points will be encoded in UTF-8 using either ATOM_UTF8_EXT or SMALL_ATOM_UTF8_EXT.

The maximum number of allowed characters in an atom is 255. In the UTF-8 case, each character can need 4 bytes to be encoded.

The distribution header is sent by the erlang distribution to carry metadata about the coming control message and potential payload. It is primarily used to handle the atom cache in the Erlang distribution. Since OTP-22 it is also used to fragment large distribution messages into multiple smaller fragments. For more information about how the distribution uses the distribution header, see the documentation of the protocol between connected nodes in the distribution protocol documentation.

Any ATOM_CACHE_REF entries with corresponding AtomCacheReferenceIndex in terms encoded on the external format following a distribution header refer to the atom cache references made in the distribution header. The range is 0 <= AtomCacheReferenceIndex < 255, that is, at most 255 different atom cache references from the following terms can be made.

The non-fragmented distribution header format is as follows:

1 1 1 NumberOfAtomCacheRefs/2+1 | 0 N | 0
131 68 NumberOfAtomCacheRefs Flags AtomCacheRefs

Table 12.4:   Normal Distribution Header Format

Flags consist of NumberOfAtomCacheRefs/2+1 bytes, unless NumberOfAtomCacheRefs is 0. If NumberOfAtomCacheRefs is 0, Flags and AtomCacheRefs are omitted. Each atom cache reference has a half byte flag field. Flags corresponding to a specific AtomCacheReferenceIndex are located in flag byte number AtomCacheReferenceIndex/2. Flag byte 0 is the first byte after the NumberOfAtomCacheRefs byte. Flags for an even AtomCacheReferenceIndex are located in the least significant half byte and flags for an odd AtomCacheReferenceIndex are located in the most significant half byte.

The flag field of an atom cache reference has the following format:

1 bit 3 bits
NewCacheEntryFlag SegmentIndex

Table 12.5:  

The most significant bit is the NewCacheEntryFlag. If set, the corresponding cache reference is new. The three least significant bits are the SegmentIndex of the corresponding atom cache entry. An atom cache consists of 8 segments, each of size 256, that is, an atom cache can contain 2048 entries.

Another half byte flag field is located along with flag fields for atom cache references. When NumberOfAtomCacheRefs is even, this half byte is the least significant half byte of the byte that follows the atom cache references. When NumberOfAtomCacheRefs is odd, this half byte is the most significant half byte of the last byte of the atom cache references (on the wire, it will appear before the last cache reference). It has the following format:

3 bits 1 bit
CurrentlyUnused LongAtoms

Table 12.6:  

The least significant bit in that half byte is flag LongAtoms. If it is set, 2 bytes are used for atom lengths instead of 1 byte in the distribution header.

After the Flags field follow the AtomCacheRefs. The first AtomCacheRef is the one corresponding to AtomCacheReferenceIndex 0. Higher indices follow in sequence up to index NumberOfAtomCacheRefs - 1.

If the NewCacheEntryFlag for the next AtomCacheRef has been set, a NewAtomCacheRef on the following format follows:

1 1 | 2 Length
InternalSegmentIndex Length AtomText

Table 12.7:  

InternalSegmentIndex together with the SegmentIndex completely identify the location of an atom cache entry in the atom cache. Length is the number of bytes that AtomText consists of. Length is a 2 byte big-endian integer if flag LongAtoms has been set, otherwise a 1 byte integer. When distribution flag DFLAG_UTF8_ATOMS has been exchanged between both nodes in the distribution handshake, characters in AtomText are encoded in UTF-8, otherwise in Latin-1. The following CachedAtomRefs with the same SegmentIndex and InternalSegmentIndex as this NewAtomCacheRef refer to this atom until a new NewAtomCacheRef with the same SegmentIndex and InternalSegmentIndex appear.

For more information on encoding of atoms, see the section on UTF-8 encoded atoms above.

If the NewCacheEntryFlag for the next AtomCacheRef has not been set, a CachedAtomRef on the following format follows:

1
InternalSegmentIndex

Table 12.8:  

InternalSegmentIndex together with the SegmentIndex identify the location of the atom cache entry in the atom cache. The atom corresponding to this CachedAtomRef is the latest NewAtomCacheRef preceding this CachedAtomRef in another previously passed distribution header.

Messages sent between Erlang nodes can sometimes be quite large. Since OTP-22 it is possible to split large messages into smaller fragments in order to allow smaller messages to be interleaved between larges messages. It is only the message part of each distributed message that may be split using fragmentation. Therefore it is recommended to use the PAYLOAD control messages introduced in OTP-22.

Fragmented distribution messages are only used if the receiving node signals that it supports them via the DFLAG_FRAGMENTS distribution flag.

A process must complete the sending of a fragmented message before it can start sending any other message on the same distribution channel.

The start of a sequence of fragmented messages looks like this:

1 1 8 8 1 NumberOfAtomCacheRefs/2+1 | 0 N | 0
131 69 SequenceId FragmentId NumberOfAtomCacheRefs Flags AtomCacheRefs

Table 12.9:   Starting Fragmented Distribution Header Format

The continuation of a sequence of fragmented messages looks like this:

1 1 8 8
131 70 SequenceId FragmentId

Table 12.10:   Continuing Fragmented Distribution Header Format

The starting distribution header is very similar to a non-fragmented distribution header. The atom cache works the same as for normal distribution header and is the same for the entire sequence. The additional fields added are the sequence id and fragment id.

The sequence id is used to uniquely identify a fragmented message sent from one process to another on the same distributed connection. This is used to identify which sequence a fragment is a part of as the same process can be in the process of receiving multiple sequences at the same time.

As one process can only be sending one fragmented message at once, it can be convenient to use the local PID as the sequence id.

The Fragment ID is used to number the fragments in a sequence. The id starts at the total number of fragments and then decrements to 1 (which is the final fragment). So if a sequence consists of 3 fragments the fragment id in the starting header will be 3, and then fragments 2 and 1 are sent.

The fragments must be delivered in the correct order, so if an unordered distribution carrier is used, they must be ordered before delivered to the Erlang run-time.

Example:

As an example, let say that we want to send {call, <0.245.2>, {set_get_state, <<0:1024>>}} to registered process reg using a fragment size of 128. To send this message we need a distribution header, atom cache updates, the control message (which would be {6, <0.245.2>, [], reg} in this case) and finally the actual message. This would all be encoded into:

131,69,0,0,2,168,0,0,5,83,0,0,0,0,0,0,0,2,               %% Header with seq and frag id
5,4,137,9,10,5,236,3,114,101,103,9,4,99,97,108,108,      %% Atom cache updates
238,13,115,101,116,95,103,101,116,95,115,116,97,116,101,
104,4,97,6,103,82,0,0,0,0,85,0,0,0,0,2,82,1,82,2,        %% Control message
104,3,82,3,103,82,0,0,0,0,245,0,0,0,2,2,                 %% Actual message using cached atoms
104,2,82,4,109,0,0,0,128,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

131,70,0,0,2,168,0,0,5,83,0,0,0,0,0,0,0,1,               %% Cont Header with seq and frag id
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,               %% Rest of payload
0,0,0,0

Let us break that apart into its components. First we have the distribution header tags together with the sequence id and a fragment id of 2.

131,69,                   %% Start fragment header
0,0,2,168,0,0,5,83,       %% The sequence ID
0,0,0,0,0,0,0,2,           %% The fragment ID

Then we have the updates to the atom cache:

5,4,137,9,  %% 5 atoms and their flags
10,5,       %% The already cached atom ids
236,3,114,101,103,  %% The atom 'reg'
9,4,99,97,108,108,  %% The atom 'call'
238,13,115,101,116,95,103,101,116,95,115,116,97,116,101, %% The atom 'set_get_state'

The first byte says that we have 5 atoms that are part of the cache. Then follows three bytes that are the atom cache ref flags. Each of the flags uses 4 bits so they are a bit hard to read in decimal byte form. In binary half-byte form they look like this:

0000, 0100, 1000, 1001, 1001

As the high bit of the first two atoms in the cache are not set we know that they are already in the cache, so they do not have to be sent again (this is the node name of the receiving and sending node). Then follows the atoms that have to be sent, together with their segment ids.

Then the listing of the atoms comes, starting with 10 and 5 which are the atom refs of the already cached atoms. Then the new atoms are sent.

When the atom cache is setup correctly the control message is sent.

104,4,97,6,103,82,0,0,0,0,85,0,0,0,0,2,82,1,82,2,

Note that up until here it is not allowed to fragments the message. The entire atom cache and control message has to be part of the starting fragment. After the control message the payload of the message is sent using 128 bytes:

104,3,82,3,103,82,0,0,0,0,245,0,0,0,2,2,
104,2,82,4,109,0,0,0,128,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Since the payload is larger than 128-bytes it is split into two fragments. The second fragment does not have any atom cache update instructions so it is a lot simpler:

131,70,0,0,2,168,0,0,5,83,0,0,0,0,0,0,0,1, %% Continuation dist header 70 with seq and frag id
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, %% remaining payload
0,0,0,0
Note

The fragment size of 128 is only used as an example. Any fragments size may be used when sending fragmented messages.

1 1
82 AtomCacheReferenceIndex

Table 12.11:   ATOM_CACHE_REF

Refers to the atom with AtomCacheReferenceIndex in the distribution header.

1 1
97 Int

Table 12.12:   SMALL_INTEGER_EXT

Unsigned 8-bit integer.

1 4
98 Int

Table 12.13:   INTEGER_EXT

Signed 32-bit integer in big-endian format.

1 31
99 Float string

Table 12.14:   FLOAT_EXT

A finite float (i.e. not inf, -inf or NaN) is stored in string format. The format used in sprintf to format the float is "%.20e" (there are more bytes allocated than necessary). To unpack the float, use sscanf with format "%lf".

This term is used in minor version 0 of the external format; it has been superseded by NEW_FLOAT_EXT.

1 N 4 1
102 Node ID Creation

Table 12.15:   PORT_EXT

Same as NEW_PORT_EXT except the Creation field is only one byte and only two bits are significant, the rest are to be 0.

1 N 4 4
89 Node ID Creation

Table 12.16:   NEW_PORT_EXT

Same as V4_PORT_EXT except the ID field is only four bytes. Only 28 bits are significant; the rest are to be 0.

NEW_PORT_EXT was introduced in OTP 19, but only to be decoded and echoed back. Not encoded for local ports.

In OTP 23 distribution flag DFLAG_BIG_CREATION became mandatory. All ports are now encoded using NEW_PORT_EXT, even external ports received as PORT_EXT from older nodes.

1 N 8 4
120 Node ID Creation

Table 12.17:   V4_PORT_EXT

Encodes a port identifier (obtained from erlang:open_port/2). Node is the originating node, encoded as an atom. ID is a 64-bit big endian unsigned integer. The Creation works just like in NEW_PID_EXT. Port operations are not allowed across node boundaries.

In OTP 26 distribution flag DFLAG_V4_NC as well as V4_PORT_EXT became mandatory accepting full 64-bit ports to be decoded and echoed back.

1 N 4 4 1
103 Node ID Serial Creation

Table 12.18:   PID_EXT

Same as NEW_PID_EXT except the Creation field is only one byte and only two bits are significant, the rest are to be 0.

1 N 4 4 4
88 Node ID Serial Creation

Table 12.19:   NEW_PID_EXT

Encodes an Erlang process identifier object.

The name of the originating node, encoded as an atom.

A 32-bit big endian unsigned integer.

A 32-bit big endian unsigned integer.

A 32-bit big endian unsigned integer. All identifiers originating from the same node incarnation must have identical Creation values. This makes it possible to separate identifiers from old (crashed) nodes from a new one. The value zero is reserved and must be avoided for normal operations.

NEW_PID_EXT was introduced in OTP 19, but only to be decoded and echoed back. Not encoded for local processes.

In OTP 23 distribution flag DFLAG_BIG_CREATION became mandatory. All pids are now encoded using NEW_PID_EXT, even external pids received as PID_EXT from older nodes.

In OTP 26 distribution flag DFLAG_V4_NC became mandatory accepting full 64-bit pids to be decoded and echoed back.

1 1 N
104 Arity Elements

Table 12.20:   SMALL_TUPLE_EXT

Encodes a tuple. The Arity field is an unsigned byte that determines how many elements that follows in section Elements.

1 4 N
105 Arity Elements

Table 12.21:   LARGE_TUPLE_EXT

Same as SMALL_TUPLE_EXT except that Arity is an unsigned 4 byte integer in big-endian format.

1 4 N
116 Arity Pairs

Table 12.22:   MAP_EXT

Encodes a map. The Arity field is an unsigned 4 byte integer in big-endian format that determines the number of key-value pairs in the map. Key and value pairs (Ki => Vi) are encoded in section Pairs in the following order: K1, V1, K2, V2,..., Kn, Vn. Duplicate keys are not allowed within the same map.

As from Erlang/OTP 17.0

1
106

Table 12.23:   NIL_EXT

The representation for an empty list, that is, the Erlang syntax [].

1 2 Len
107 Length Characters

Table 12.24:   STRING_EXT

String does not have a corresponding Erlang representation, but is an optimization for sending lists of bytes (integer in the range 0-255) more efficiently over the distribution. As field Length is an unsigned 2 byte integer (big-endian), implementations must ensure that lists longer than 65535 elements are encoded as LIST_EXT.

1 4    
108 Length Elements Tail

Table 12.25:   LIST_EXT

Length is the number of elements that follows in section Elements. Tail is the final tail of the list; it is NIL_EXT for a proper list, but can be any type if the list is improper (for example, [a|b]).

1 4 Len
109 Len Data

Table 12.26:   BINARY_EXT

Binaries are generated with bit syntax expression or with erlang:list_to_binary/1, erlang:term_to_binary/1, or as input from binary ports. The Len length field is an unsigned 4 byte integer (big-endian).

1 1 1 n
110 n Sign d(0) ... d(n-1)

Table 12.27:   SMALL_BIG_EXT

Bignums are stored in unary form with a Sign byte, that is, 0 if the bignum is positive and 1 if it is negative. The digits are stored with the least significant byte stored first. To calculate the integer, the following formula can be used:

B = 256
(d0*B^0 + d1*B^1 + d2*B^2 + ... d(N-1)*B^(n-1))

1 4 1 n
111 n Sign d(0) ... d(n-1)

Table 12.28:   LARGE_BIG_EXT

Same as SMALL_BIG_EXT except that the length field is an unsigned 4 byte integer.

1 N 4 1
101 Node ID Creation

Table 12.29:   REFERENCE_EXT

The same as NEW_REFERENCE_EXT except ID is only one word (Len = 1).

1 2 N 1 N'
114 Len Node Creation ID ...

Table 12.30:   NEW_REFERENCE_EXT

The same as NEWER_REFERENCE_EXT except:

In the first word (4 bytes) of ID, only 18 bits are significant, the rest must be 0.

Only one byte long and only two bits are significant, the rest must be 0.

1 2 N 4 N'
90 Len Node Creation ID ...

Table 12.31:   NEWER_REFERENCE_EXT

Encodes a reference term generated with erlang:make_ref/0.

The name of the originating node, encoded as an atom.

A 16-bit big endian unsigned integer not larger than 5.

A sequence of Len big-endian unsigned integers (4 bytes each, so N' = 4 * Len), but is to be regarded as uninterpreted data.

Works just like in NEW_PID_EXT.

NEWER_REFERENCE_EXT was introduced in OTP 19, but only to be decoded and echoed back. Not encoded for local references.

In OTP 23 distribution flag DFLAG_BIG_CREATION became mandatory. All references are now encoded using NEWER_REFERENCE_EXT, even external references received as NEW_REFERENCE_EXT from older nodes.

In OTP 26 distribution flag DFLAG_V4_NC became mandatory. References now can contain up to 5 ID words.

1 4 N1 N2 N3 N4 N5
117 NumFree Pid Module Index Uniq Free vars ...

Table 12.32:   FUN_EXT

Not emitted since OTP R8, and not decoded since OTP 23.

1 4 1 16 4 4 N1 N2 N3 N4 N5
112 Size Arity Uniq Index NumFree Module OldIndex OldUniq Pid Free Vars

Table 12.33:   NEW_FUN_EXT

This is the encoding of internal funs: fun F/A and fun(Arg1,..) -> ... end.

The total number of bytes, including field Size.

The arity of the function implementing the fun.

The 16 bytes MD5 of the significant parts of the Beam file.

An index number. Each fun within a module has an unique index. Index is stored in big-endian byte order.

The number of free variables.

The module that the fun is implemented in, encoded as an atom.

An integer encoded using SMALL_INTEGER_EXT or INTEGER_EXT. Is typically a small index into the module's fun table.

An integer encoded using SMALL_INTEGER_EXT or INTEGER_EXT. Uniq is the hash value of the parse tree for the fun.

A process identifier as in PID_EXT. Represents the process in which the fun was created.

NumFree number of terms, each one encoded according to its type.

1 N1 N2 N3
113 Module Function Arity

Table 12.34:   EXPORT_EXT

This term is the encoding for external funs: fun M:F/A.

Module and Function are encoded as atoms.

Arity is an integer encoded using SMALL_INTEGER_EXT.

1 4 1 Len
77 Len Bits Data

Table 12.35:   BIT_BINARY_EXT

This term represents a bitstring whose length in bits does not have to be a multiple of 8. The Len field is an unsigned 4 byte integer (big-endian). The Bits field is the number of bits (1-8) that are used in the last byte in the data field, counting from the most significant bit to the least significant.

1 8
70 IEEE float

Table 12.36:   NEW_FLOAT_EXT

A finite float (i.e. not inf, -inf or NaN) is stored as 8 bytes in big-endian IEEE format.

This term is used in minor version 1 of the external format.

1 2 Len
118 Len AtomName

Table 12.37:   ATOM_UTF8_EXT

An atom is stored with a 2 byte unsigned length in big-endian order, followed by Len bytes containing the AtomName encoded in UTF-8.

For more information, see the section on encoding atoms in the beginning of this page.

1 1 Len
119 Len AtomName

Table 12.38:   SMALL_ATOM_UTF8_EXT

An atom is stored with a 1 byte unsigned length, followed by Len bytes containing the AtomName encoded in UTF-8. Longer atoms encoded in UTF-8 can be represented using ATOM_UTF8_EXT.

For more information, see the section on encoding atoms in the beginning of this page.

1 2 Len
100 Len AtomName

Table 12.39:   ATOM_EXT

An atom is stored with a 2 byte unsigned length in big-endian order, followed by Len numbers of 8-bit Latin-1 characters that forms the AtomName. The maximum allowed value for Len is 255.

1 1 Len
115 Len AtomName

Table 12.40:   SMALL_ATOM_EXT

An atom is stored with a 1 byte unsigned length, followed by Len numbers of 8-bit Latin-1 characters that forms the AtomName.

Note

SMALL_ATOM_EXT was introduced in ERTS 5.7.2 and require an exchange of distribution flag DFLAG_SMALL_ATOM_TAGS in the distribution handshake.

1 ...
121 ...

Table 12.41:   LOCAL_EXT

Marks that this is encoded on an alternative local external term format intended to only be decoded by a specific local decoder. The bytes following from here on may contain any unspecified type of encoding of terms. It is the responsibility of the user to only attempt to decode terms on the local external term format which has been produced by a matching encoder.

This tag is used by the Erlang runtime system upon encoding the local external term format when the local option is passed to term_to_binary/2, but can be used by other encoders as well providing similar functionality. The Erlang runtime system adds a hash immediately following the LOCAL_EXT tag which is verified on decoding in order to verify that encoder and decoder match which might be a good practice. This will very likely catch mistakes made by users, but is not guaranteed to, and is not intended to, prevent decoding of an intentionally forged encoding on the local external term format.

LOCAL_EXT was introduced in OTP 26.0.