Michał Muskała <micmus(at)whatsapp(dot)com>
Standards Track

EEP 68: JSON library #

Abstract #

This EEP proposes introducing a module json to the Erlang standard library with support for encoding and decoding JSON documents from and to Erlang data structures. The main reason is to cover a gap in the Erlang standard library with regards to such a vastly popular and widespread data format.

Rationale #

JSON is commonly used in many different use-cases:

  • by web services as a lightweight and human-readable data interchange format;
  • as a configuration language in static files;
  • as data interchange format by developer tooling;
  • and more.

There are many existing JSON libraries for Erlang and other BEAM languages, however adding such a support to standard library would offer unique benefits. Most notably being able to use it in situations where leveraging third-party libraries is complex or cumbersome – such as stand-alone escripts or fundamental tooling like a build system, or inside OTP itself.

There have been previous attempts to bring JSON support into OTP, most notably EEP 18, which ultimately weren’t adopted previously for various reasons. However, I believe the time is right to revisit this subject with a fresh take on an interface such support could take.

JSON is a well defined format specified in parallel in RFC 8259 and ECMA 404, however how this representation should be translated into Erlang is not fully clear since the data structures don’t present a direct, 1:1 mapping. To help with this, this EEP proposes an interface that presents both a convenient and “canonical” simple API, as well as an extensible and highly-customisable API with common underlying implementation.

This EEP proposes a JSON library which:

  • should be easy to adopt in large codebases using one of the popular, existing, open-source JSON libraries;
  • will allow the existing open-source libraries with custom features (like support for Elixir protocols) to become thin wrappers around this library;
  • will improve, or at least not regress, performance compared to leading open-source JSON libraries.

The proposed JSON library will provide:

  • JSON encoding, allowing for single-pass encoding of custom data types –- in particular, for Elixir, integrating with a protocol through a thin layer (implemented outside of OTP);
  • JSON decoding with some streaming support allowing to decode messages that don’t fully fit into memory;
  • JSON decoding with support for decoding values split across separate messages without fully concatenating them upfront;
  • focus on high-performance encoding and decoding;
  • full conformance to RFC 8259 and ECMA 404 standards, the decoder should pass the entire JSONTestSuite;
  • simple API for common use-cases with canonical data type mapping.

Design choices #

Data mapping #

We propose, in the “canonical” API to map JSON data structues to Erlang and back in the following way:

Decoding from JSON Erlang Encoding into JSON
Number integer() | float() Number
Boolean true | false Boolean
Null null Null
String binary() String
  atom() String
Array list() Array
Object #{binary() => _} Object
  #{atom() => _} Object
  #{integer() => _} Object

Erlang has generally a richer value system than JSON, therefore there’s generally more types that can be encoded into JSON, even if they can never be produced directly by the decoder.

However, with the flexible API, as demonstrated below, the user will be able to customize the decoding & encoding routines to produce and consume any Erlang term as necessary in the particular application.

Note: A decode-encode rountrip might not produce the same data, even with custom decoders – since JSON has such a limited data-type options, compared to Erlang, some information will be commonly be lost, for example, coercing all keys in maps to binaries.

Streaming vs value-based parser #

When it comes to data-structure parsers it’s common to encounter two types: ones that given the data produce a complete parsed value, and others the same data produce a stream of events that can later be processed to extract values.

The first kind, which we’ll call here value-based, is generally simpler, usually more efficient, and more convenient to use. The second one offers unique advantages in specific use-cases: for example, where data can’t fully fit into memory.

For the proposed json library this EEP suggests a hybrid approach.

First, a simple, value-based API:

-type value() ::
    integer() |
    float() |
    boolean() |
    null |
    binary() |
    list(value()) |
    #{binary() => value()}.

-spec decode(binary()) -> value().

Error handling is achieved through exceptions. The following errors are possible:

-type error() ::
    unexpected_end |
    {unexpected_sequence, binary()} |
    {invalid_byte, byte()}

The exceptions might be enhanced through the Error Info mechanism with additional meta-data like byte offset where the error occurred.

For the advanced and customizable API, this EEP proposes a callback-based API that the decoder will use to produce values from the data it parses.

-type from_binary_fun() :: fun((binary()) -> dynamic()).
-type array_start_fun() :: fun((Acc :: dynamic()) -> ArrayAcc :: dynamic()).
-type array_push_fun() :: fun((Value :: dynamic(), Acc :: dynamic()) -> NewAcc :: dynamic()).
-type array_finish_fun() :: fun((ArrayAcc :: dynamic(), OldAcc :: dynamic()) -> {dynamic(), Acc :: dynamic()}).
-type object_start_fun() :: fun((Acc :: dynamic()) -> ObjectAcc :: dynamic()).
-type object_push_fun() :: fun((Key :: dynamic(), Value :: dynamic(), Acc :: dynamic()) -> NewAcc :: dynamic()).
-type object_finish_fun() :: fun((ObjectAcc :: dynamic(), OldAcc :: dynamic()) -> {dynamic(), Acc :: dynamic()}).

-type decoders() :: #{
    array_start => array_start_fun(),
    array_push => array_push_fun(),
    array_finish => array_finish_fun(),
    object_start => object_start_fun(),
    object_push => object_push_fun(),
    object_finish => object_finish_fun(),
    float => from_binary_fun(),
    integer => from_binary_fun(),
    string => from_binary_fun(),
    null => term()

-spec decode(binary(), Acc :: dynamic(), decoders()) ->
    {Value :: dynamic(), FinalAcc :: dynamic(), Rest :: binary()}.

This allows the user to fully customize the decoded format, including features seen in open-source JSON libraries:

  • decoding string keys as atoms;
  • decoding objects as lists of pairs;
  • decoding floats as custom structures with decimal precision;
  • decoding null as another atom, in particular undefined or nil;
  • using binary:copy/1 on strings that will be retained in memory;
  • decoding multiple JSON messages from a single binary blob;
  • and more.

Furthermore, this allows the user to only retain parts of the data structure to achieve results similar to using a streaming SAX-like parser for data that doesn’t fully fit into memory.

The array_finish and object_finish callbacks are responsible for restoring the accumulator to continue processing the parent object. To simplify the case where accumulators are not connected, these callbacks receive value of the accumulator that was passed to the corresponding _start call.

All the callbacks are optional and have a default value corresponding to the “simple” API behaviour, using lists as accumulators, in particular:

  • for array_start: fun(_) -> [] end
  • for array_push: fun(Elem, Acc) -> [Elem | Acc] end
  • for array_finish: fun(Acc, OldAcc) -> {lists:reverse(Acc), OldAcc} end
  • for object_start: fun(_) -> [] end
  • for object_push: fun(Key, Value, Acc) -> [{Key, Value} | Acc] end
  • for object_finish: fun(Acc, OldAcc) -> {maps:from_list(Acc), OldAcc} end
  • for float: fun erlang:binary_to_float/1
  • for integer: fun erlang:binary_to_integer/1
  • for string: fun (Value) -> Value end
  • for null: the atom null

Incomplete data parsing #

We propose a future enhancement to the full decode/3 API, where it can return an {incomplete, continuation()} value that can be used to decode values split across multiple binary blobs (for example as received from a TCP socket).

-spec decode_continue(binary(), continuation()) ->
    {Value :: dynamic(), FinalAcc :: dynamic(), Rest :: binary()} |
    {incomplete, continuation()}.

Encoding API #

For encoding this EEP again proposes two separate sets of APIs. A simple API using “canonical” data types:

-type encode_value() ::
    integer() |
    float() |
    boolean() |
    null |
    binary() |
    atom() |
    list(encode_value()) |
    #{binary() | atom() | integer() => encode_value()}.

-spec encode(encode_value()) -> iodata().

And an advanced, callback-based API allowing for single-pass encoding of custom data structures. This API is accompanied by a set of functions facilitating the implementation of custom encoding callbacks.

-type encoder() :: fun((dynamic(), encoder()) -> iodata()).

-spec encode(dynamic(), encoder()) -> iodata().

-spec encode_value(dynamic(), encoder()) -> iodata().
-spec encode_atom(atom(), encoder()) -> iodata().
-spec encode_integer(integer()) -> iodata().
-spec encode_float(float()) -> iodata().
-spec encode_list(list(), encoder()) -> iodata().
-spec encode_map(map(), encoder()) -> iodata().
-spec encode_map_checked(map(), encoder()) -> iodata().
-spec encode_key_value_list([{dynamic(), dynamic()}], encoder()) -> iodata().
-spec encode_key_value_list_checked([{dynamic(), dynamic()}], encoder()) -> iodata().
-spec encode_binary(binary()) -> iodata().
-spec encode_binary_escape_all(binary()) -> iodata().

The encoder() callback is invoked on every value during traversal. The simple API specified above is equivalent to using the fun json:encode_value/2 function as the encoder.

The *_checked/2 variants of functions offer verifying the encoder doesn’t produce repeated keys. The default encode_binary/1 function will emit unescaped unicode values as allowed by the specifications; however for compatibility reasons we provide the optional encode_binary_escape_all/1 function that will always produce purely ASCII messages encoding all higher unicode values with the \u escape sequences.

Formatting and pretty-printing #

This EEP further proposes an additional API for formatting (and pretty-printing) JSON messages. This API consists of transforming a textual JSON message into a formatted JSON message. This is the most flexible solution that orthogonally supports formatting results of custom encoding functions like described above, without adding the burden of complex formatting options in the middle of the encoders. Formatting isn’t usually done in critical hot-paths of high-performance services, therefore the overhead of a two-pass formatting is deemed acceptable.

-type format_option() :: #{
    indent => iodata(),
    line_separator => iodata(),
    after_colon => iodata()
-spec format(iodata()) -> iodata().
-spec format(iodata(), format_option()) -> iodata().

Reference Implementation #

PR-8111 Implements the encode/1, encode/2, decode/1, and decode/3 functions as proposed in this EEP. The formatting API and the support for incomplete message decoding is left as a follow-up task.

Appendix #

Example of a decoding trace #

Given the following data:

{"a": [[], {}, true, false, null, {"foo": "baz"}], "b": [1, 2.0, "three"]}

the decoding APIs will be called with the following arguments:

object_start(Acc0) => Acc1
  string(<<"a">>) => Str1
  array_start(Acc1) => Acc2
    empty_array() => Arr1
    array_push(Acc2, Arr1) => Acc3
    empty_object() => Obj1
    array_push(Obj1, Acc3) => Acc4
    array_push(true, Acc4) => Acc5
    array_push(false, Acc5) => Acc6
    null() => Null
    array_push(Null, Acc6) => Acc7
    object_start(Acc7) => Acc8
      string(<<"foo">>) => Str2
      string(<<"baz">>) => Str3
      object_push(Str2, Str3, Acc8) => Acc9
    object_finish(Acc9) => Obj2
    array_push(Obj2, Acc7) => Acc10
  array_finish(Acc10, Acc1) => {Arr1, Acc11}
  object_push(Arr1, Acc11) => Acc12
  string(<<"b">>) => Str4
  array_start(Acc12) => Acc13
    integer(<<"1">>) => Int1
    array_push(Int1, Acc13) => Acc14
    float(<<"2.0">>) => Float1
    array_push(Float1, Acc14) => Acc15
    string(<<"three">>) => Str5
    array_push(Str5, Acc15) => Acc16
  array_finish(Acc16, Acc12) => {Arr2, Acc17}
  object_push(Str4, Arr2, Acc17) => Acc18
object_finish(Acc18, Acc0) => {Obj3, Acc19}
% final decode/3 return
{Obj3, Acc19, <<"">>}

Example of a custom encoder #

An example of a custom encoder that would support using a heuristic to differentiate pairs of object-like key-value lists from plain lists of values could look as follows:

custom_encode(Value) -> json:encode(Value, fun encoder/2).

encoder([{_, _} | _] = Value, Encode) -> json:encode_key_value_list(Value, Encode);
encoder(Other, Encode) -> json:encode_value(Other, Encode).

Another encoder that supports using Elixir nil as Null and protocols for further customisation could look as follows:

encoder(nil, _Encode) -> <<"null">>;
encoder(null, _Encode) -> <<"\"null\"">>;
encoder(#{__struct__ => _} = Struct, Encode) -> 'Elixir.JSONProtocol':encode(Struct, Encode);
encoder(Other, Encode) -> json:encode_value(Other, Encode).

Copyright #

This document is placed in the public domain or under the CC0-1.0-Universal license, whichever is more permissive.