Erlang/OTP

Erlang/OTP 28 Highlights

2025-05-20T00:00:00+00:00

Erlang/OTP 28 is finally here. This blog post will introduce the new features that we are most excited about.

A list of all changes is found in Erlang/OTP 28 Readme. Or, as always, look at the release notes of the application you are interested in. For instance: Erlang/OTP 28 - Erts Release Notes - Version 16.0.

This year’s highlights mentioned in this blog post are:

Priority Messages
Improvements of Comprehensions
Smarter Error Suggestions
Improvements to the Shell
New erlang:hibernate/0
Warnings for Use of Old-style Catch
PCRE2
Optimizations to TLS 1.3
Based Floating Point Literals
Nominal Types
New Emacs Erlang Mode

Priority Messages

Sometimes, it is important for urgent messages to skip the queue and be read by the receiving process as soon as possible. Erlang/OTP 28 introduces priority messages, an opt-in mechanism that allows the receiving process to let certain messages get priority status.

By default, all messages are inserted to the end of the message queue of a process. This can become cumbersome when the queue is long. An urgent message may need to be read as soon as possible.

For example, the current message overload protection mechanism for logger polls its message queue length in order to know when it should start shedding messages. It would have benefitted from using the long_message_queue monitoring functionality introduced in Erlang/OTP 27, but the only way to get information like that is via a message, which would be inserted at the end of the very long queue.

Priority message solves this problem by letting selected messages be inserted before all ordinary messages, but still in the order they are received.

A receiver process can allow other processes to send priority message to itself in two simple steps. The first step is to create a process alias using alias/1:

PrioAlias = alias([priority])

This alias can then be distributed to other processes that should be able to send priority messages to the receiver process. A sender process can send a priority message by using erlang:send/3, passing the PrioAlias as the first argument, and the option priority in the option list as the third argument:

erlang:send(PrioAlias, Message, [priority])

In this way, messages sent to a priority alias with the priority flag will be inserted before ordinary messages in the message queue. Other processes can still send ordinary messages to the priority alias by not using the priority flag. If a message is sent to the priority alias without using the priority flag, it will be treated as an ordinary message.

It is also possible to send an exit signal as a priority signal, like this:

exit(PrioAlias, Message, [priority])

If the receiver process wants to stop receiving priority messages, it can do so by deactivating its priority alias:

true = unalias(PrioAlias)

After this, no priority message can be sent to the receiver process, because the priority alias is no longer active. The receiver process can activate and deactivate its priority alias again at any time.

Priority message reception can also be enabled for exit signals due to broken links and messages triggered due to monitors. Since these signals are not sent when a process calls a specific function for sending a signal, but when specific events occur in the system, a priority alias cannot be used for this. In order to enable such priority messages, you can pass the priority option to either erlang:monitor/3 or erlang:link/2.

Priority messages respect Erlang’s existing guarantee: Signals still arrive in the same order as they are sent, if they arrive at all. This change only affects where messages are inserted in the queue. Performance-wise, this feature preserves Erlang’s selective receive optimization. There is no performance penalty for ordinary messages or priority messages.

For more details, see the documentation of priority messages and EEP-76.

Improvements of Comprehensions

Erlang/OTP 28 introduces many useful updates in its comprehensions. All of them are new language features that have been suggested as EEPs. Between the release of Erlang/OTP 27 and 28, there were 4 accepted EEPs related to comprehensions. Features described by two of them are included in Erlang/OTP 28. The other two are postponed to a later release. The documentation for comprehensions contains an up-to-date overview of all relevant features.

Strict Generators

Strict generator as described in EEP 70 aims to improve expressiveness and safety for comprehensions.

In OTP 27 and earlier, when the right-hand side expression does not match the left-hand side pattern in a comprehension generator, the term is ignored and the evaluation continues on. In the following example, the element error is silently skipped in the comprehension.

1> [X ||{ok, X} <- [{ok, 1}, error, {ok, 3}]].
[1,3]

This behavior can hide the presence of unexpected elements in the input data. In the example above, what if the list should not contain anything other than 2-tuples with the first element being ok? By using a strict generator, the comprehension crashes when the pattern-matching fails with the element error.

2> [X ||{ok, X} <:- [{ok, 1}, error, {ok, 3}]].
** exception error: no match of right hand side value error

Strict generators can be used in list generators (<:-), binary generators (<:=), and map generators (<:-). In contrast, the previously existing generators are called relaxed generators.

Strict generators and relaxed generators can convey different intentions from the programmer. The following example is rewritten from a comprehension in the Erlang linter. It finds all nifs from an abstract form, and output them. Obviously, not all forms are nifs. We want to ignore all forms that are not nifs here. Using a relaxed generator here is correct.

Nifs = [Args || {attribute, _Anno, nifs, Args} <- Forms].

More examples about strict and relaxed generators can be found in List Comprehensions.

Sometimes, using either strict or relaxed generators is fine. When the left-hand side pattern is a fresh variable, pattern matching cannot fail. Using either leads to the same behavior. While the preference and use cases might be individual, it is recommended to use strict generators when either can be used. Using strict generators by default aligns with Erlang’s “Let it crash” philosophy.

Now you can pick a more fitting tool for the job, without losing the brevity of comprehensions. It is also a good time to review old code, and see if strict generators are more fitting in certain places. The compiler team in OTP has done that. Take a look if you are curious.

Zip Generators

Zip generators as described in EEP 73 makes it easier to iterate over multiple lists, binaries, or maps in parallel.

Erlang’s list comprehension extract elements in a nested or cartesian way by default:

1> [{X, Y} || X <- [1, 2], Y <- [a, b]].
[{1,a},{1,b},{2,a},{2,b}]

Using zip generators &&, we can change the default behavior and “zip” generators together as if using lists:zip/2:

2> [{X, Y} || X <- [1, 2] && Y <- [a, b]].
[{1,a},{2,b}]

Zip generators can be used with lists, binaries, and maps, and can be mixed freely with all existing generators and filters. Unlike lists:zip/2 and lists:zip/3, you can zip any number of generators together using &&s. The compiler avoids creating intermediate tuples, yet preserving the same error behaviors as these helper functions.

Smarter Error Suggestions

The Erlang/OTP 28 compiler has levelled up its ability in spotting typos. Now it gives you suggestions on how to fix them, whenever possible.

For example, the following code exports an undefined function bar/1.

-export([bar/1]).
baz(X) -> X.

The Erlang/OTP 27 compiler correctly points out the undefined function.

t.erl:3:2: function bar/1 undefined
%   3| -export([bar/1]).
%    |  ^

The Erlang/OTP 28 compiler goes one step further. It suggests a possible correction, according to all the defined functions in the module.

t.erl:3:2: function bar/1 undefined, did you mean baz/1?
%   3| -export([bar/1]).
%    |  ^

This applies to common error types, like undefined_nif, unbound_var, undefined_function, undefined_record, and so on.

It also works for wrong arity. If you call a function with the wrong number of arguments, the compiler will suggest available arities, like the following:

t.erl:6:12: function bar/2 undefined, did you mean bar/1,3,4?

This makes compilation errors easier to understand, and small mistakes faster to fix. Try it out and you’ll notice the change!

Improvements to the Shell

Erlang/OTP 28 brings several improvements to the shell interface, making it more flexible, interactive and powerful than before.

Lazy Reads from `stdin`

Previously, Erlang’s stdin greedily read all input data, which could cause problems with special characters. This is changed by PR-8962. Now in Erlang/OTP 28, all reads from stdin are done upon request, like only when an io:get_line/2 or equivalent is called. This removes the need to use the -noinput flag, and resolves issues like Issue-8113.

Raw and Cooked Modes for `noshell`

The noshell mode now supports two “submodes”:

cooked is the default behavior, same as before.
raw is the new option that can bypass the line editing support of the native terminal.

In raw mode, you can build more interactive applications. It offers the possibility to read keystrokes as they happen without the user typing Enter, while disabling the line editing support and the echoing to stdout. The following example is an escript that can read raw input (and immediately prints it back out) without requiring the user to press Enter:

#!/usr/bin/env escript
%% t.es

main(_Args) ->
    shell:start_interactive({noshell, raw}),
    io:format("Press any key, or press q to quit.\n"),
    loop().

loop() ->
    case io:get_chars("", 1024) of
        "q" ->
            io:format("Exit now.\n");
        Chars ->
            io:format("~p", [Chars]),
            loop();
        {error, Reason} ->
            io:format("Error reason: ~p~n", [Reason]),
            ok
    end.

With this, Erlang’s shell becomes a platform for building interactive terminal applications. The custom shell documentation shows how to create a custom shell. The terminal interface documentation shows how to implement a tic-tac-toe game.

Try it out. We look forward to see more interactive applications created using this feature.

Using `fun Name/Arity` to create funs in shell

Thanks to PR-8987, you can now use fun Name/Arity to create funs in shell. The fun can be created from an auto-imported BIF, such as is_atom/1, as in the example below.

1> F = fun is_atom/1.
fun erlang:is_atom/1
2> F(a).
true
3> F(42).
false

Or from a local function defined in shell, as in the following example.

1> I = fun id/1.
#Fun
2> I(42).
** exception error: undefined shell command id/1
3> id(I) -> I.
ok
4> I(42).
42

New `erlang:hibernate/0`

Erlang/OTP 28 introduces a new erlang:hibernate/0 function. This built-in function puts the calling process into a wait state where its memory footprint is reduced as much as possible. When the process receives its next message, it will wake up. Unlike the existing erlang:hibernate/3, it does not discard the call stack.

This makes erlang:hibernate/0 useful for processes that expect long idle time, but want to have a simpler hibernation.

Memory Usage Experiment

To demonstrate how efficient erlang:hibernate/0 is, we can make a benchmark that can spawn different number of processes (starting from 1, going up to 1 million), let them either waiting for a message using receive or using erlang:hibernate/0, and then compare memory usage.

Here’s the test code for the first scenario, which uses erlang:hibernate/0:

-module(benchmark_hibernate).
-export([worker/0, spawn_all/1]).

worker() ->
    erlang:hibernate().

spawn_all(0) ->
    timer:sleep(1000),
    io:format("Memory usage: ~p~n", [erlang:memory()]),
    timer:sleep(1000),
    io:format("Memory usage after 1s: ~p~n", [erlang:memory()]),
    ok;
spawn_all(N) ->
    spawn(?MODULE, worker, []),
    spawn_all(N-1).

Here’s the test code for the second scenario. Processes stay idle but they do not hibernate:

-module(benchmark_receive).
-export([worker2/0, spawn_all/1]).

worker2() ->
    receive
        _  -> ok
    end.

spawn_all(0) ->
    timer:sleep(1000),
    io:format("Memory usage: ~p~n", [erlang:memory()]),
    timer:sleep(1000),
    io:format("Memory usage after 1s: ~p~n", [erlang:memory()]),
    ok;
spawn_all(N) ->
    spawn(?MODULE, worker, []),
    spawn_all(N-1).

Memory usage is measured by erlang:memory() after 1 million processes have been spawned. For the final result, we take the average of two measurements.

We spawn 1, 10 thousand, 100 thousand, and 1 million processes for both scenarios. Results are summarized in the following table:

	Number of Processes	Memory Used (Mb)
Hibernated	1	44.8
Without `hibernate/0`	1	47.0
Hibernated	10,000	55.5
Without `hibernate/0`	10,000	73.4
Hibernated	100,000	130.3
Without `hibernate/0`	100,000	307.1
Hibernated	1,000,000	827.9
Without `hibernate/0`	1,000,000	2687.1

When there is only 1 process, the memory usage reduction is not obvious yet. When there are 1 million mostly idle processes, that’s more than 75% reduction in memory usage if you use erlang:hibernate/0!

Warnings for Use of Old-Style Catch

Erlang/OTP 28 introduces a warning for using the old style catch Expr, instead of try ... catch ... end.

The more simplistic catch Expr is problematic in that it catches all exceptions and can therefore hide bugs. For example, if the intention is to catch exceptions raised by throw/1, the old-style catch will also catch runtime errors. Using its alternative try ... catch ... end can offer better clarity.

In a future release, the use of the old catch construct will by default result in compiler warnings. To facilitate removing usages of the old-style catch, the compiler now has an option warn_deprecated_catch. It can be enabled on the project level or the module level to prevent new uses of the old-style catch.

If you have added warn_deprecated_catch at the project-level, the warning can be suppressed in individual modules that have not yet been updated by adding the -compile(nowarn_deprecated_catch) to them.

Here are some common uses of the old style catch Expr. We will show how to replace them with try ... catch ... end and briefly explain why it is a better solution.

Example 1: Using catch Expr to handle a possible throw

throw/1 is often used to quickly return from a deep recursion. If tree_walker/1 is a function that traverses a tree and sometimes throws a value, the old-style catch could be used like this:

Result = catch maybe_throw().

It can be refactored to the following code:

Result = try tree_walker(Tree) of
             Value -> Value
         catch
             throw:Reason -> Reason
         end.

This is a bit longer, but it is also safer. For example, if the caller of tree_walker/1 passes in an invalid tree (such as not_a_tree), the try/catch will not catch the resulting crash, allowing the bug to be noticed and fixed early.

To have the same ensurance that crashes are not hidden when using the old-style catch, you would have to write, which is as much code as the new try/catch:

Result = case catch tree_walker(Tree) of
            {'EXIT',Error} ->
                 error(Error);
            Value ->
                 Value
         end.

Example 2: Using catch Expr to match a specific error in a test case

test_bad_argument(Term) ->
    {'EXIT',{badarg,_}} = catch list_to_atom(Term).

It can be refactored to the following code:

test_bad_argument(Term) ->
    try list_to_atom(Term) of
        _Value -> error(not_supposed_to_succeed)
    catch
        error:badarg -> ok
    end.

An easier way is to include the following header file:

-include_lib("stdlib/include/assert.hrl").

With that in place, you can simply write:

test_bad_argument(Term) ->
    ?assertError(badarg, list_to_atom(Term)).

That will also result in more information being given if the test case fails:

1> t:test_bad_argument("ok").
** exception error: {assertException,[{module,t},
                                      {line,6},
                                      {expression,"list_to_atom ( Term )"},
                                      {pattern,"{ error , badarg , [...] }"},
                                      {unexpected_success,ok}]}
     in function  t:test_bad_argument/1 (t.erl:6)

It is likely that the compiler will start generate warnings for the old-style catch in Erlang/OTP 29 or 30. If you are still using the old style catch Expr in your code, now is a good time to start refactoring.

Based Floating Point Literals

Erlang/OTP 28 extends its floating point syntax to support floating point literals using any bases, similar to Ada and C99/C++17. This is based on EEP-75.

In Erlang, you can already write integers in different bases:

1> 2#100.
4
2> 3#100.
9

Now, you can do the same with floating point numbers:

1> 2#0.011.
0.375
2> 3#0.011.
0.14814814814814814
3> 16#0.011#e5.
4352.0

Such an exact representation of floating point numbers is especially useful in code generating tools. With only the base 10 representation, it is difficult to convert floats from and to other bases without precision loss. With based literals, you can even preserve bit level precision. For example, 2#0.10101#e8 represents the exact layout of a binary float.

PCRE2

Erlang/OTP 28 uplifts the re module to use PCRE2, instead of the PCRE library. This change is mostly backward compatible with PCRE with respect to regular expression syntax, but it also introduces some different behaviors.

The full documentation about breaking changes and incompatibilities can be found in PCRE2 Migration.

Why PCRE2 instead of PCRE?

PCRE2 is more in line with modern standards, especially Perl regex, which is stricter about pattern syntax and catches invalid patterns early. This makes your regex code safer, at the cost of breaking some old regex patterns.

Notable Changes:

Stricter Syntax Validation: For example, \i, \B, \8 all result in errors.

% Erlang/OTP 27
1> re:run("AMM", ~S"\M").
{match,[{1,1}]}

% Erlang/OTP 28
1> re:run("AMM", ~S"\M").
** exception error: bad argument
     in function  re:run/2
        called as re:run("AMM",~S"\M")
        *** argument 2: could not parse regular expression
                        unrecognized character follows \ on character 1

Unicode Property Updates: Characters matched by properties using \p{...} may have changed, according to the updated Unicode character property data.
re:split/3 with Branch Reset Groups ((?|...)): The following example may evaluate to [[],"abc",[],[]] in some interpretations of PCRE and Perl versions, differing from PCRE2’s result.

1> re:split("abcabc", ~S"(?|(abc)|(xyz))\1", [{return, list}]).
[[],"abc",[]]

It is worth noting that the internal format produced by re:compile/2 has changed in Erlang/OTP 28. It cannot be reused across nodes or OTP versions.

This upgrade offers better long-term maintainability, but you may need to test your existing regex code before upgrading.

Optimizations to TLS 1.3

The performance of SSL with TLS 1.3 has been optimized. The optimization reduces the general overhead for application data transmission. To measure the improvement from Erlang/OTP 27.1 to Erlang/OTP 28, we ran a small message echo benchmark and measure the time for roundtrips.

Results are shown in the following table:

	Samples	Average	Std Dev	Median	P99	Iteration
Erlang/OTP 28	25	65186	5.87%	66828	68749	38352 ns
Erlang/OTP 27	25	51730	4.64%	51418	57296	48328 ns

In general, you can expect a 15% - 25% speed-up in Erlang/OTP 28 if you are using TLS 1.3. No changes are needed in your code. If your application uses TLS 1.3, this is a good reason to upgrade to Erlang/OTP 28.

Nominal Types

Nominal type-checking as described in EEP 69 adds an alternative type system to Dialyzer. Nominal types can be declared using the syntax -nominal. The main use case of nominal types is to prevent accidental misuse of types with the same structure.

To start with, we can declare two nominal types meter/0 and foot/0 like the following:

-nominal meter() :: integer().
-nominal foot() :: integer().

Because meter/0 and foot/0 have different names and they are both nominal types, they are not compatible. Dialyzer performs nominal type-checking on input and output types of functions and specifications. For example, we can define functions int_to_meter/1 and foo/0 like the following:

-spec int_to_meter(integer()) -> meter().
int_to_meter(X) -> X.

-spec foo() -> foot().
foo() -> int_to_meter(24).

The specification of int_to_meter/1 declares the function’s return type to be meter(), so the result of int_to_meter(24) has type meter(). However, the specification of foo/0 declares the function’s return type to be foot(). The two nominal types are not compatible. Therefore, Dialyzer raises the following warning for our example:

Invalid type specification for function foo/0.
The success typing is foo() -> (meter() :: integer())
But the spec is foo() -> foot()
The return types do not overlap

On the other hand, a nominal type is compatible with a non-opaque, non-nominal type with the same structure. We can define the function return_integer/0 like this:

-spec return_integer() -> integer().
return_integer() -> int_to_meter(24).

The specification says that return_integer/0 should return an integer() type. However, the result of int_to_meter(24) has type meter(), so return_integer/0 will also return a meter() type. integer() is not a nominal type. The structure of meter() is compatible with integer(). Dialyzer can analyze the function above without raising a warning.

There are exceptions to the nominal type-checking rules shown above. For more details, see Nominals in the reference manual.

New Emacs Erlang Mode

Althought this is not included in the Erlang/OTP 28 release, members of the OTP team are developing a new Emacs Erlang mode using treesitter. If you are an Emacs user, you can get it from Github or Melpa and try it out.

The new Erlang mode handles strings and documentation a lot better than the old one. See the screenshot below for an example:

If you are interested in contributing to this project, all help is appreciated.

Erlang/OTP 27 Highlights

2024-05-20T00:00:00+00:00

Erlang/OTP 27 is finally here. This blog post will introduce the new features that we are most excited about.

A list of all changes is found in Erlang/OTP 27 Readme. Or, as always, look at the release notes of the application you are interested in. For instance: Erlang/OTP 27 - Erts Release Notes - Version 15.0.

This year’s highlights mentioned in this blog post are:

Overhauled documentation system
Triple-Quoted strings
Sigils
No need to enable feature maybe
The new json module
Process labels
New functionality in STDLIB
New SSL client-side stapling support
tprof: Yet another profiling tool
Multiple trace sessions
Native coverage support
Deprecating archives

Overhauled documentation system

The Erlang/OTP documentation before Erlang/OTP 27 was authored in XML, from which the Erl_Docgen application could generate HTML web pages, PDFs, or Unix man pages. The reason for generating PDFs is that the documentation used to be printed as actual paper books. The last time the books were printed were for Erlang/OTP R7 released in 2000.

As an example, here is the XML code for lists:duplicate/2 from Erlang/OTP 26:

    
      
      Make N copies of element.
      
        Returns a list containing N copies of term
          Elem.
        Example:
        > lists:duplicate(5, xx).
[xx,xx,xx,xx,xx]

The XML code was stored in separate files, not in the source code. When building the documentation, the function specs from the source code would be combined with the text from the documentation file. It was the responsibility of the writer to ensure that variables mentioned in the documentation body matched the names in the function spec.

One thing never said about Erl_Docgen and the old documentation system was that it made writing documentation enjoyable and effortless. That was one thing we wanted to change with the new documentation system. We wanted to make it fun to write documentation, or at least to require less attention to tedious details such as using XML tags correctly.

In Erlang/OTP 27, the documentation is written in Markdown and is placed in the source code before the function spec and implementation. Here is the documentation and implementation of lists:duplicate/2 in Erlang/OTP 27:

-doc """
Returns a list containing `N` copies of term `Elem`.

_Example:_

```erlang
> lists:duplicate(5, xx).
[xx,xx,xx,xx,xx]
```
""".

-spec duplicate(N, Elem) -> List when
      N :: non_neg_integer(),
      Elem :: T,
      List :: [T],
      T :: term().

duplicate(N, X) when is_integer(N), N >= 0 -> duplicate(N, X, []).

duplicate(0, _, L) -> L;
duplicate(N, X, L) -> duplicate(N-1, X, [X|L]).
```

The documentation is placed in a triple-quoted string following the -doc attribute.

Having the documentation near the spec makes its easy to ensure that the text refers to variables defined in the function spec.

Another goal we had was to replace Erl_Docgen with a tool more widely used so that we wouldn’t have to carry the entire burden for maintaining it. We did that by using ExDoc, which is also used by the Elixir language and most, if not all, Elixir projects.

An issue that arose is whether it’s advisable to include user documentation within the source code. Wouldn’t this make it much harder to maintain the code?

I don’t claim to have a universal response to that concern, but in the case of Erlang/OTP, most actively developed code exists within modules lacking documentation. Typically, OTP applications consist of one or a few modules containing the documented API, while the bulk of the implementation is found in other modules.

For example, the interface to the Erlang compiler is found in the compile module, while most of the code being executed resides in one of the other 59 modules of the Compiler application. Similarly, the SSL application comprises 76 modules, of which merely four contain documentation.

Another application that is frequently updated is ERTS. However, most of ERTS is implemented in C (and some C++), while much of the actual Erlang code within ERTS is located in modules without documentation.

There are, of course, some exceptions to how applications are structured, for example the STDLIB application, where most modules are documented. However, STDLIB is a mature application that is updated relatively infrequently.

Triple-Quoted strings

To facilitate writing documentation attributes containing many lines of text, triple-quoted strings as described in EEP 64 have been implemented. Triple-quoted strings come in handy whenever one needs to include multiple line of text in Erlang source code. For example, assume that we want to define a function that outputs some quotations:

1> t:quotes().
"I always have a quotation for everything -
it saves original thinking." - Dorothy L. Sayers

"Real stupidity beats artificial intelligence every time."
- Terry Pratchett
ok

In Erlang/OTP 26, there are several different ways to do that, but of none of them are particularly satisfying. For example, the text can be put into a single string:

quotes() ->
    S = "\"I always have a quotation for everything -
it saves original thinking.\" - Dorothy L. Sayers

\"Real stupidity beats artificial intelligence every time.\"
- Terry Pratchett\n",
    io:put_chars(S).

This works, but is ugly. We must also remember to escape every quote character.

A cleaner way is to use multiple strings, one for each line, letting the compiler combine them:

quotes() ->
    S = "\"I always have a quotation for everything -\n"
        "it saves original thinking.\" - Dorothy L. Sayers\n"
        "\n"
        "\"Real stupidity beats artificial intelligence every time.\"\n"
        "- Terry Pratchett\n",
    io:put_chars(S).

That is a little bit nicer, but we’ll need to type more quote characters and we must not forget to add \n at the end of each string. To make sure that we don’t forget to insert the newlines, we could delegate that mundane chore to the computer:

quotes() ->
    S = ["\"I always have a quotation for everything -",
         "it saves original thinking.\" - Dorothy L. Sayers",
         "",
         "\"Real stupidity beats artificial intelligence every time.\"",
         "- Terry Pratchett"],
    io:put_chars(lists:join("\n", S)),
    io:nl().

In Erlang/OTP 27, we can use a triple-quoted string:

quotes() ->
    S = """
        "I always have a quotation for everything -
        it saves original thinking." - Dorothy L. Sayers

        "Real stupidity beats artificial intelligence every time."
        - Terry Pratchett
        """,
    io:put_chars(S),
    io:nl().

The ending """ determines how much each line in the string should be indented. The same characters that precede """ are deleted from all lines between the beginning and terminating delimiters. For this particular example, all space characters are removed since all have the same indentation as the terminating """. Neither quote characters nor backslashes are special in the lines enclosed by the triple-quotes, so there is no need to escape anything.

Here is another example to show the versatility of triple-quoted strings:

effect_warning() ->
    """
    f() ->
        %% Test that the compiler warns for useless tuple building.
        {a,b,c},
        ok.
    """.

The function returns a string containing a short Erlang function.

Assuming that effect_warning/0 is defined in module t, it can be called like so:

1> io:format("~ts\n", [t:effect_warning()]).
f() ->
    %% Test that the compiler warns for useless tuple building.
    {a,b,c},
    ok.

Note that indentation of the Erlang code for function f/0 is retained.

For more information, see section String in the Reference Manual.

Sigils

Sigils for string literals as described in EEP 66 have been implemented.

Continuing with the theme of quotes, let’s explore why sigils were introduced into Erlang, drawing inspiration from the wisdom of ancient Greek philosophers:

1> t:greek_quote().
"Know thyself" (Greek: Γνῶθι σαυτόν)
ok

In Erlang/OTP 26, this can be implemented as follows:

greek_quote() ->
    S = "\"Know thyself\" (Greek: Γνῶθι σαυτόν)",
    io:format("~ts\n", [S]).

At this point, we get some customer feedback indicating that the modules containing all the quotes are consuming an excessive amount of memory. Each character in a string consumes 16 bytes of memory (on a 64-bit computer). That could be reduced to one byte for each character if a binary were to be used instead of a string. (Actually, one byte for each US ASCII character and two bytes for each Greek letter.)

That change should be really easy. Let’s try:

greek_quote() ->
    S = <<"\"Know thyself\" (Greek: Γνῶθι σαυτόν)">>,
    io:format("~ts\n", [S]).

That works for the English text, but not for the Greek characters:

2> t:greek_quote().
"Know thyself" (Greek: ½ö¸¹ Ã±ÅÄÌ½)

What’s wrong?

Strings in binary expression are by default assumed to be a sequence of byte-size characters. Therefore, this expression:

1> <<"Γνῶθι">>.
<<147,189,246,184,185>>

is syntactic sugar for:

2> <<$Γ:8, $ν:8, $ῶ:8, $θ:8, $ι:8>>.
<<147,189,246,184,185>>

It is necessary to specify that the characters are to be encoded as UTF-8 encoded characters by appending an /utf8 suffix:

greek_quote() ->
    S = <<"\"Know thyself\" (Greek: Γνῶθι σαυτόν)"/utf8>>,
    io:format("~ts\n", [S]).

That works because <<"Γνῶθι"/utf8>> is syntactic sugar for <<$Γ/utf8, $ν/utf8, $ῶ/utf8, $θ/utf8, $ι/utf8>>.

Enter sigils.

greek_quote() ->
    S = ~B["Know thyself" (Greek: Γνῶθι σαυτόν)],
    io:format("~ts\n", [S]).

The ~ character begins a sigil. It is usually followed by a letter that indicates how the characters in the string should be interpreted or encoded.

In this case the character B means that the characters should be put into a binary in UTF-8 encoding, and also that that no escape characters are allowed.

After B follows the start delimiter, in this case [. Since no escape characters are allowed, it is necessary to choose delimiters that don’t occur in the string contents. After the contents follows the end delimiter, in this case ].

~b creates a binary in the same way as ~B, except that backslashes will be interpreted as escape characters. This can be useful if one wants to insert control characters such as TAB (\t) into a binary:

1> ~b"abc\txyz".
<<"abc\txyz">>

Here we used the " character as delimiters as it is not used within the string.

If we omit the letter after ~, we will get the same result:

2> ~"abc\txyz".
<<"abc\txyz">>

The default sigil (no letter following ~) creates a binary, just like ~b and ~B, but whether escape characters are interpreted depends on the form of the string. Triple-quoted strings do not by default interpret escape sequences such as \n, but plain inline strings do, so ~"abc\ndef" works as you might expect, and you can always prefix an existing string like "abc\ndef" with a ~ to turn it into a binary without fear of changing its content.

Returning to the quotations example from the previous section, let’s see how a binary literal can be created by inserting ~ before the leading """:

quotes() ->
    S = ~"""
         "I always have a quotation for everything -
         it saves original thinking." - Dorothy L. Sayers

         "Real stupidity beats artificial intelligence every time."
         - Terry Pratchett
         """,
    io:put_chars(S),
    io:nl().

For a triple-quoted string, the default sigil and ~B always produces the same binary. The ~b sigil can be used when escape characters must be supported.

~s creates a string in the usual way. The only useful way it differs from a plain quoted string is that the delimiters can be switched. That way, one can avoid the hassle of escaping quote characters and still get to use control characters such as TAB:

3> ~s{"abc\txyz"}.
"\"abc\txyz\""

Used for a triple-quoted string it enables the use of escape characters:

4> ~s"""
    \tabc
    \tdef
    """.
"\tabc\n\tdef"

~S creates a string, but does not support escaping of characters within the string, similar to ~B.

For more information, see section Sigil in the Reference Manual.

(UPDATE: The description of the default sigil has been corrected. Thanks to Richard Carlsson for pointing out this error.)

No need to enable feature `maybe`

The maybe expression was introduced as a feature in Erlang/OTP 25. In that release, it was necessary to enable it both in the compiler and the runtime system.

Erlang/OTP 26 lifted the necessity to enable maybe in the runtime system.

Now in Erlang/OTP 27, maybe is enabled by default in the compiler. In the example from last year’s blog post, the line -feature(maybe_expr, enable). can now be removed:

$ cat t.erl
-module(t).
-export([listen_port/2]).
listen_port(Port, Options) ->
    maybe
        {ok, ListenSocket} ?= inet_tcp:listen(Port, Options),
        {ok, Address} ?= inet:sockname(ListenSocket),
        {ok, {ListenSocket, Address}}
    end.
$ erlc t.erl
$ erl
Erlang/OTP 27 . . .

Eshell V15.0  (abort with ^G)
1> t:listen_port(50000, []).
{ok,{#Port<0.5>,{{0,0,0,0},50000}}}

When maybe is used as an atom, it need to be quoted. For example:

will_succeed(. . .) -> yes;
will_succeed(. . .) -> no;
   .
   .
   .
will_succeed(_) -> 'maybe'.

Alternatively, it is still possible to disable the maybe_expr feature. With the feature disabled, maybe can be used as an atom without quotes.

One way to disable maybe is to use the -disable-feature option when compiling. For example:

erlc -disable-feature maybe_expr *.erl

Another way to disable maybe is to add the following directive to the source code:

-feature(maybe_expr, disable).

The new `json` module

There is a new module json in STDLIB for generating and parsing JSON (JavaScript Object Notation).

It is implemented by Michał Muskała who has also implemented the Jason library for Elixir. Jason is known for being faster than other pure Erlang or Elixir JSON libraries. The json module is not a pure translation of the Elixir code for Jason, but a re-implementation with even better performance than Jason.

As an example, imagine that we have this file quotes.json with quotes from the film Jason and the Argonauts:

[
    {"quote": "The gods are best served by those who need their help the least.",
     "attribution": "Zeus",
     "verified": true},
    {"quote": "Now the voyage is over, I don't want any trouble to begin.",
     "attribution": "Jason",
     "verified": true}
]

The JSON contents of the file can be be decoded by calling json:decode/1:

1> {ok,JSON} = file:read_file("quotes.json").
{ok,<<"[\n   {\"quote\": \"The gods are best served by those who need their help the least.\",\n    \"attribution\": \"Zeus\""...>>}
2> json:decode(JSON).
[#{<<"attribution">> => <<"Zeus">>,
   <<"quote">> =>
       <<"The gods are best served by those who need their help the least.">>,
   <<"verified">> => true},
 #{<<"attribution">> => <<"Jason">>,
   <<"quote">> =>
       <<"Now the voyage is over, I don't want any trouble to begin.">>,
   <<"verified">> => true}]

By default, for safety, the keys for objects are translated to binaries. Using atoms could open up for denial-of-service attacks if a malicious JSON object would define millions of unique keys.

For convenience, it is still possible to convert keys to atoms in a safe way by using a decoder callback. Here is an example:

1> Push = fun(Key, Value, Acc) -> [{binary_to_existing_atom(Key), Value} | Acc] end.
#Fun

This fun converts the key for a JSON object to an existing atom, or raises an exception if no such atom exists.

Since this example is run from the shell, we’ll need to make sure that all possible keys are known atoms:

2> {quote,attribution,verified}.
{quote,attribution,verified}

This would normally not be necessary when JSON decoding is done in an Erlang module, because the atoms to be used as keys would presumably be defined naturally by being used when processing the decoded JSON objects.

With this preparation done, the JSON decoder can be called using the Push fun as an object_push decoder callback:

3> {Qs,_,<<>>} = json:decode(JSON, [], #{object_push => Push}), Qs.
[#{quote =>
       <<"The gods are best served by those who need their help the least.">>,
   attribution => <<"Zeus">>,verified => true},
 #{quote =>
       <<"Now the voyage is over, I don't want any trouble to begin.">>,
   attribution => <<"Jason">>,verified => true}]

The json:encode/1 function encodes an Erlang term to JSON:

4> io:format("~ts\n", [json:encode(Qs)]).
[{"quote":"The gods are best served by those who need their help the least.","attribution":"Zeus","verified":true},{"quote":"Now the voyage is over, I don't want any trouble to begin.","attribution":"Jason","verified":true}]
ok

The encoder accepts binaries, atoms, and integer as keys for objects, so there is no need to customize encoding for this particular example.

However, when necessary, it is possible to customize the encoding. For example, assume that we want to store each quotation in a three-tuple instead of in a map:

1> Q = [{~"The gods are best served by those who need their help the least.",
~"Zeus",true},
{~"Now the voyage is over, I don't want any trouble to begin.",
~"Jason",true}].
[{<<"The gods are best served by those who need their help the least.">>,
  <<"Zeus">>,true},
 {<<"Now the voyage is over, I don't want any trouble to begin.">>,
  <<"Jason">>,true}]

The json:encode/1 function does not handle that format by default, but it can be handled by defining an encoder function:

quote_encoder({Q, A, V}, Encode)
  when is_binary(Q), is_binary(A), is_boolean(V) ->
    json:encode_map(#{quote => Q,
                      attribution => A,
                      verified => V},
                    Encode);
quote_encoder(Other, Encode) ->
    json:encode_value(Other, Encode).

The first clause matches a tuple of size three that looks like a quotation. If it matches, it is converted to the map representation for a JSON object, which is then converted by the utility function json:encode_map/1 to JSON.

The second clause handles all other Erlang terms by calling the default encoding function json:encode_value/2 for converting a term to JSON.

Assuming that this function is defined in module t, the conversion to JSON is invoked as follows:

2> io:format("~ts\n", [json:encode(Q, fun t:quote_encoder/2)]).
[{"quote":"The gods are best served by those who need their help the least.","attribution":"Zeus","verified":true},{"quote":"Now the voyage is over, I don't want any trouble to begin.","attribution":"Jason","verified":true}]

The JSON encoder will call the callback recursively for given term. That can be clearly seen if we modify the second clause of quote_encoder/2 to also print the value of Other:

3> json:encode(Q, fun t:quote_encoder/2), ok.
-- [{<<"The gods are best served by those who need their help the least.">>,
     <<"Zeus">>,true},
    {<<"Now the voyage is over, I don't want any trouble to begin.">>,
     <<"Jason">>,true}]
-- <<"quote">>
-- <<"The gods are best served by those who need their help the least.">>
-- <<"attribution">>
-- <<"Zeus">>
-- <<"verified">>
-- true
-- <<"quote">>
-- <<"Now the voyage is over, I don't want any trouble to begin.">>
-- <<"attribution">>
-- <<"Jason">>
-- <<"verified">>
-- true

Process labels

As an help for debugging or observing in general, labels can be now set on non-registered processes using proc_lib:set_label/1.

The label is an arbitrary term. The label is shown by the the shell command i/0 and by observer. They can also be found in the dictionary section of a crash dump.

Here is an example where five labeled quote-handler processes are started and inspected:

1> F = fun(I) ->
   spawn_link(fun() ->
     proc_lib:set_label({quote_handler, I}),
     receive _ -> ok end
   end)
   end.
#Fun
2> Ps = [F(I) || I <- lists:seq(1, 5)].
[<0.91.0>,<0.92.0>,<0.93.0>,<0.94.0>,<0.95.0>]
3> proc_lib:get_label(hd(Ps)).
{quote_handler,1}
4> i().
Pid                   Initial Call                          Heap     Reds Msgs
Registered            Current Function                     Stack
<0.0.0>               erl_init:start/2                       987     5347    0
init                  init:loop/1                              2
   .
   .
   .
{quote_handler,1}     prim_eval:'receive'/2                    9
<0.92.0>              erlang:apply/2                         233     4006    0
{quote_handler,2}     prim_eval:'receive'/2                    9
<0.93.0>              erlang:apply/2                         233     4006    0
{quote_handler,3}     prim_eval:'receive'/2                    9
<0.94.0>              erlang:apply/2                         233     4006    0
{quote_handler,4}     prim_eval:'receive'/2                    9
<0.95.0>              erlang:apply/2                         233     4006    0
{quote_handler,5}     prim_eval:'receive'/2                    9
Total                                                     642876  1156835    0
                                                             438
ok

The SSH and and SSL applications have been updated to label the processes they create.

New functionality in STDLIB

New utility functions for set modules

The three sets modules in STDLIB — sets, gb_sets, and ordsets — have new functions is_equal/2, map/2, and filtermap/2.

The is_equal/2 function is useful when one needs to find out whether two sets contain the same elements. Comparing with == or =:= is not always reliable. For example:

1> Seq = lists:seq(1, 20, 2).
[1,3,5,7,9,11,13,15,17,19]
2> gb_sets:from_list(Seq) == gb_sets:delete(10, gb_sets:from_list([10|Seq])).
false
3> gb_sets:is_equal(gb_sets:from_list(Seq), gb_sets:delete(10, gb_sets:from_list([10|Seq]))).
true

The map/2 maps the element of a set, producing a new set:

4> Seq = lists:seq(1, 20, 2).
[1,3,5,7,9,11,13,15,17,19]
#Fun
5> ordsets:to_list(ordsets:map(fun(N) -> N div 4 end, ordsets:from_list(Seq))).
[0,1,2,3,4]

The filtermap/2 function can map and filter at the same time. Here is an example showing how to multiply each integer in a set by 100 and remove non-integers:

1> Mixed = [1,2,3,a,b,c].
[1,2,3,a,b,c]
2> F = fun(N) when is_integer(N) -> {true,N * 100};
   (_) -> false
   end.
#Fun
3> sets:to_list(sets:filtermap(F, sets:from_list(Mixed))).
[300,200,100]

New `timer` convenience functions that take funs

In Erlang/OTP 26, the functions in the timer module don’t accept funs. It is certainly possibly to pass in a fun in the argument for erlang:apply/2, but if one makes a mistake it will be only be noticed when the timer expires:

1> timer:apply_after(10, erlang, apply, [fun() -> io:put_chars("now!\n") end]).
{ok,{once,#Ref<0.2380540714.1485570051.86513>}}
=ERROR REPORT==== 10-Apr-2024::05:56:43.894073 ===
Error in process <0.109.0> with exit value:
{undef,[{erlang,apply,[#Fun],[]}]}

Here the empty argument list for the fun was forgotten. It should have been:

2> timer:apply_after(10, erlang, apply, [fun() -> io:put_chars("now!\n") end, []]).
{ok,{once,#Ref<0.2380540714.1485570051.86522>}}
now!

In Erlang/OTP 27, using a fun is much easier:

1> timer:apply_after(10, fun() -> io:put_chars("now!\n") end).
{ok,{once,#Ref<0.3845681669.1215561736.51634>}}
now!

In systems that use hot code updating, using a local fun for a long-running timer is not ideal. The code that defines the fun could have been replaced, and when the timer finally expires the call will fail. Therefore, it is also possible to pass a fun as well as its arguments, making it possible to use use a remote fun that will survive hot code updating:

2> timer:apply_after(10, fun io:put_chars/1, ["now\n"]).
{ok,{once,#Ref<0.3845681669.1215561736.51650>}}
now

The apply_interval/* and apply_repeatedly/* functions now also accept funs.

New `ets` functions

The new functions ets:first_lookup/1 and ets:next_lookup/2 simplifies and speeds up traversing an ETS table:

1> T = ets:new(example, [ordered_set]).
#Ref<0.1968915180.2077884419.247786>
2> ets:insert(T, [{I,I*I} || I <- lists:seq(1, 10)]).
true
3> {K1,_} = ets:first_lookup(T).
{1,[{1,1}]}
4> {K2,_} = ets:next_lookup(T, K1).
{2,[{2,4}]}
5> {K3,_} = ets:next_lookup(T, K2).
{3,[{3,9}]}
6> {K4,_} = ets:next_lookup(T, K3).
{4,[{4,16}]}

Similarly, ets:last_lookup/1 and ets:prev_lookup/2 can be used to traverse a table in reverse order.

The new function ets:update_element/4 is similar to ets:update_element/3, but makes it possible to supply a default object when there is no existing object with the given key:

1> T = ets:new(example, []).
#Ref<0.878413430.1983512583.205850>
2> ets:update_element(T, a, {2, true}, {a, true}).
true
3> ets:lookup(T, a).
[{a,true}]

New SSL client-side stapling support

A new feature in the SSL client in Erlang/OTP 27 is support for OCSP stapling for easier and faster verification of the revocation status of server certificates.

With OCSP stapling, the SSL client can streamline the validation of revocation status. Normally the client would have to query the CA (Certificate Authority) using OCSP (Online Certificate Status Protocol) to ensure that the server’s certificate has not been revoked.

The basic idea behind OCSP stapling is that the server itself will proactively query the CA regarding the revocation status for its own certificate and “staple” the time-stamped OCSP response from the CA to the certificate. When a client connects, the server passes along its OCSP-stapled certificate to the client. To verify the revocation status, the client only needs to check that the OCSP response was signed by the CA.

Here follows an example showing how OCSP stapling can be enabled in the SSL client:

1> ssl:start().
ok
2> {ok, Socket} = ssl:connect("duckduckgo.com", 443,
                              [{cacerts, public_key:cacerts_get()},
                               {stapling, staple}]).
{ok,{sslsocket,{gen_tcp,#Port<0.5>,tls_connection,undefined},
               [<0.122.0>,<0.121.0>]}}

`tprof`: Yet another profiling tool

In Erlang/OTP 27, the new profiling tool tprof joins the existing profiling tools cprof, eprof, and fprof.

Why introduce a new profiling tool?

One reason is that cprof and eprof perform similar profiling tasks, but the naming of the API functions are different. It is quite easy to mix up the names when running one tool after the other, and running them after each other is not uncommon. For example, when trying to find a bottleneck in a complex running Erlang system, one approcach is to first use cprof to get a rough idea of the general part of the system where a bottleneck could be located. After that, eprof is run on a limited part of the system trying to narrow it down. Directly running eprof on a large Erlang application could overload it and bring it down.

Using tprof, the same function is used for both counting calls and measuring the time for each call. Here is how to count calls when lists:seq(1, 1000) is called:

1> tprof:profile(lists, seq, [1, 1000], #{type => call_count}).
FUNCTION          CALLS  [    %]
lists:seq/2           1  [ 0.40]
lists:seq_loop/3    251  [99.60]
                         [100.0]
ok

Note that call counting is always done for all processes.

The bulk of the work for lists:seq/2 is done in lists:seq_loop/3, which was called 251 times. Since we asked for 1000 integers, we reach the conclusion that each tail-recursive call to seq_loop/3 creates four list elements at once. That can be confirmed by looking at the source code.

To measure the time for each call, we only need to replace call_count with call_time:

2> tprof:profile(lists, seq, [1, 1000], #{type => call_time}).

****** Process <0.94.0>  --  100.00% of total ***
FUNCTION          CALLS  TIME (μs)  PER CALL  [     %]
lists:seq/2           1          0      0.00  [  0.00]
lists:seq_loop/3    251         50      0.20  [100.00]
                                50            [ 100.0]
ok

Call time is only measured the process that called tprof:profile/4 and any process spawned by that process.

By replacing call_time with call_memory the amount of memory consumed by each call will be measured:

3> tprof:profile(lists, seq, [1, 1000], #{type => call_memory}).

****** Process <0.97.0>  --  100.00% of total ***
FUNCTION          CALLS  WORDS  PER CALL  [     %]
lists:seq_loop/3    251   2000      7.97  [100.00]
                          2000            [ 100.0]
ok

The total number of words created is 2000, which make sense since each list element needs 2 words. The number of words consumed per call is 2000 / 251, which is approximately 7.97 or almost 8. That also makes sense since each tail-recursive call creates 4 list elements, or 8 words, and there are 250 such calls. The remaining call creates the final empty list ([]).

call_memory tracing was introduced in the runtime system in Erlang/OTP 26, but was not exposed in any existing profiling tool because it didn’t really fit in any of them. It made more sense to enable support for it in a new tool.

Multiple trace sessions

Tracing makes it possible to observe, debug, analyse, and measure the performance of a running Erlang system. Over the year, numerous tools using tracing has been developed. In Erlang/OTP alone, several tools leverage tracing for different purposes:

dbg, ttb - general tracing tools
etop - similar to top in Unix
eprof, cprof, fprof, tprof - profiling tools
et - event tracer
debugger - uses tracing internally when evaluating receive expressions

In Erlang/OTP 26 and earlier tracing had some limitations:

There could only be a single tracer per traced process.
The configuration for which processes and functions to trace were global within the runtime system.

Those limitations meant that different tracing tools could easily step on each other’s toes. The treacherous part was that using multiple tracing tools at the same time would seem to work for a while… until it didn’t.

In Erlang/OTP 27, multiple trace sessions can be created. Each trace session has its own tracer process and configuration for which processes and functions to trace.

To create a trace session and set up tracing, there is the new trace module in the Kernel application. Tools that set up tracing using that module will no longer interfere with each other. Tools that use the old API will share a single global trace session.

In the initial Erlang/OTP 27 release, some of the tools using tracing have been updated to use trace sessions. Other tools will be updated in upcoming maintenance releases.

We have tried to design the new API in a way to make it relatively easy for maintainers of external tools to migrate their code. Apart from the names of the functions and the first argument (the session argument), the other arguments and their semantics are almost entirely identical to the old API.

Quick trace session example

Here is an example to show how the new API is used. First we’ll need a tracer process that prints all trace messages it receives:

1> Tracer = spawn(fun F() -> receive M -> io:format("== ~p ==\n", [M]), F() end end).
<0.90.0>

Having a tracer process, we can create a trace session:

2> Session = trace:session_create(my_session, Tracer, []).
{#Ref<0.179442114.3923902468.103849>,{my_session,0}}

Next we turn on call tracing on the current process:

3> trace:process(Session, self(), true, [call]).
1

Make sure that module array is loaded and trace all calls in it:

4> l(array).
{module,array}
5> trace:function(Session, {array,'_','_'}, [], [local]).
89

Next create a new array:

6> array:new(10).
== {trace,<0.88.0>,call,{array,new,"\n"}} ==
{array,10,0,undefined,10}
== {trace,<0.88.0>,call,{array,new_0,[10,0,false]}} ==
== {trace,<0.88.0>,call,{array,new_1,["\n",0,false,undefined]}} ==
== {trace,<0.88.0>,call,{array,new_1,[[],10,true,undefined]}} ==
== {trace,<0.88.0>,call,{array,new,[10,true,undefined]}} ==
== {trace,<0.88.0>,call,{array,find_max,"\t\n"}} ==

Note that trace messages are randomly intermingled with the return value of the call.

When we are done, we can destroy the session:

7> trace:session_destroy(Session).

If we don’t destroy the session, it will be automatically destroyed when the last reference to it goes away.

Native coverage support

The Cover tool for determining code coverage has long been part of Erlang/OTP.

Traditionally, Cover collected its coverage metrics without the help of any specialized functionality in the runtime system. To count how many times each line in a module was executed, Cover instrumented abstract code for the module by inserting calls to ets:update_counter/3 on each executable line.

That worked, but the cover-instrumented Erlang code would always run slower. How much slower depended on the nature of the code being tested.

In Erlang/OTP 27, runtime systems supporting the JIT (just-in-time compiler) can now collect coverage metrics in the runtime system with minimal performance overhead.

The Cover tool has been updated to automatically take advantage of native coverage support if supported by the runtime system. When running the test suites for most OTP applications, there is no noticeable difference in execution time running with and without Cover.

The native coverage support can also be used directly for performing measurements that Cover cannot accomplish, such as collecting metrics for code that is executed while the Erlang runtime system is starting.

Here is a quick example showing how we can collect coverage metrics for init, which is the first module executed when starting up the runtime system. First we need to instruct the runtime system to instrument all functions in all modules with extra code to count the number of times each function is called:

$ bin/erl +JPcover function_counters

The runtime system starts normally. We can now read out the counters for the init module:

1> lists:reverse(lists:keysort(2, code:get_coverage(function, init))).
[{{archive_extension,0},392},
 {{get_argument1,2},198},
 {{objfile_extension,0},101},
 {{boot_loop,2},64},
 {{request,1},55},
 {{to_strings,1},44},
 {{do_handle_msg,2},38},
 {{handle_msg,2},38},
 {{b2s,1},38},
 {{get_argument,2},33},
 {{get_argument,1},31},
 {{'-load_modules/2-lc$^0/1-0-',1},30},
 {{'-load_modules/2-lc$^1/1-2-',1},30},
 {{'-load_modules/2-lc$^2/1-3-',1},30},
 {{'-load_modules/2-lc$^3/1-4-',1},30},
 {{extract_var,2},30},
 {{'-prepare_loading_fun/0-fun-0-',3},29},
 {{eval_script,2},23},
 {{append,1},18},
 {{get_arguments,1},18},
 {{reverse,1},17},
 {{check,2},17},
 {{ensure_loaded,2},16},
 {{ensure_loaded,1},16},
 {{do_load_module,2},14},
 {{do_ensure_loaded,2},14},
 {{get_flag_args,...},12},
 {{...},...},
 {...}|...]

The returned list of counter values for each function is sorted in descending order on the number of time each function was executed.

For more information, see Native Coverage Support in the documentation for the code module.

Deprecating archives

Archives is experimental functionality that has existed in Erlang/OTP for a long time. Part of the support for archives is deprecated in Erlang/OTP 27.

The reason is that the performance of code loading from archives has never been great. Even worse is that the very existence of the archive functionality degrades the performance of code loading even when no archives are used, and complicates or prevents optimizations aimed at reducing startup time.

In Erlang/OTP 27, the following functionality is deprecated:

Using archives for packaging a single application or parts of a single application into an archive file that is included in the code path. This functionality will likely be removed in Erlang/OTP 28.
The code:lib_dir/2 function. This function was introduced to allow reading files inside archives. In Erlang/OTP 28, the function itself will not be removed, but it will most likely no longer support looking into archives.
All functionality to handle archives in module erl_prim_loader. That same functionality is likely to be removed in Erlang/OTP 28.
The -code_path_choice flag for erl. In Erlang/OTP 27, the default has changed from relaxed to strict. This flag is likely to be removed in Erlang/OTP 28.

In order to use archives in Erlang/OTP 27, it is necessary to use the flag -code_path_choice relaxed.

Using a single archive in an Escript is not deprecated

An archive can still be used to hold all files needed by an Escript. However, to access files in the archive (for example, to read templates or other data files), the only supported way guaranteed to work in future releases is to use the escript:extract/2 function.

The Optimizations in Erlang/OTP 27

2024-04-23T00:00:00+00:00

This post explores the new optimizations for record updates as well as some of the other improvements. It also gives a brief historic overview of recent optimizations leading up to Erlang/OTP 27.

A brief history of recent optimizations

The modern history of optimizations for Erlang begins in January 2018. We had realized that we had reached the limit of the optimizations that were possible working on BEAM code in the Erlang compiler.

Erlang/OTP 22 introduced a new SSA-based intermediate representation in the compiler. Read the full story in SSA History.
Erlang/OTP 24 introduced the JIT (Just In Time compiler), which improved performance by emitting native code for BEAM instructions at load-time.
Erlang/OTP 25 introduced type-based optimization in the JIT, which allowed the Erlang compiler to pass type information to the JIT to help it emit better native code. While that improved the native code emitted by the JIT, limitations in both the compiler and the JIT prevented the JIT to take full advantage of the type information.
Erlang/OTP 26 improved the type-based optimizations. The most noticeable performance improvements were matching and construction of binaries using the bit syntax. Those improvements, combined with changes to the base64 module itself, made encoding to Base64 about 4 times as fast and decoding from Base64 more than 3 times as fast.

What to expect of the JIT in Erlang/OTP 27

The major compiler and JIT improvement in Erlang/OTP 27 is optimization of record operations, but there are also many smaller optimizations that make the code smaller and/or faster.

Please try this at home!

While this blog post will show many examples of generated code, I have attempted to explain the optimizations in English as well. Feel free to skip the code examples.

On the other hand, if you want more code examples…

To examine the native code for loaded modules, start the runtime system like this:

erl +JDdump true

The native code for all modules that are loaded will be dumped to files with the extension .asm.

To examine the BEAM code for a module, use the -S option when compiling. For example:

erlc -S base64.erl

A simple record optimization

To get started, let’s look at a simple record optimization that was not done in Erlang/OTP 26 and earlier. Suppose we have this module:

-record(foo, {a,b,c,d,e}).

update(N) ->
    R0 = #foo{},
    R1 = R0#foo{a=N},
    R2 = R1#foo{b=2},
    R2#foo{c=3}.

Here is BEAM code for the record operations:

    {update_record,{atom,reuse},
                   6,
                   {literal,{foo,undefined,undefined,undefined,undefined,
                                 undefined}},
                   {x,0},
                   {list,[2,{x,0}]}}.
    {update_record,{atom,copy},6,{x,0},{x,0},{list,[3,{integer,2}]}}.
    {update_record,{atom,copy},6,{x,0},{x,0},{list,[4,{integer,3}]}}.

That is, all three record update operations have been retained as separate update_record instructions. Each operation creates a new record by copying the unchanged parts of the record and filling in the new values in the correct position.

The compiler in Erlang/OTP 27 will essentially rewrite update/1 to:

update(N) ->
    #foo{a=N,b=2,c=3}.

which will produce the following BEAM code for the record creation:

    {put_tuple2,{x,0},
                {list,[{atom,foo},
                       {x,0},
                       {integer,2},
                       {integer,3},
                       {atom,undefined},
                       {atom,undefined}]}}.

Those optimizations were implemented in the following pull requests:

Updating records in place

To explore the more sophisticated record optimization introduced in Erlang/OTP 27, consider this example:

-module(count1).
-export([count/1]).

-record(s, {atoms=0,other=0}).

count(L) ->
    count(L, #s{}).

count([X|Xs], #s{atoms=C}=S) when is_atom(X) ->
    count(Xs, S#s{atoms=C+1});
count([_|Xs], #s{other=C}=S) ->
    count(Xs, S#s{other=C+1});
count([], S) ->
    S.

count(List) counts the number of atoms and the number of other terms in the given list. For example:

1> -record(s, {atoms=0,other=0}).
ok
2> count1:count([a,b,c,1,2,3,4,5]).
#s{atoms = 3,other = 5}

Here follows the BEAM code emitted for count/2:

    {test,is_nonempty_list,{f,6},[{x,0}]}.
    {get_list,{x,0},{x,2},{x,0}}.
    {test,is_atom,{f,5},[{x,2}]}.
    {get_tuple_element,{x,1},1,{x,2}}.
    {gc_bif,'+',{f,0},3,[{tr,{x,2},{t_integer,{0,'+inf'}}},{integer,1}],{x,2}}.
    {test_heap,4,3}.
    {update_record,{atom,inplace},
                   3,
                   {tr,{x,1},
                       {t_tuple,3,true,
                                #{1 => {t_atom,[s]},
                                  2 => {t_integer,{0,'+inf'}},
                                  3 => {t_integer,{0,'+inf'}}}}},
                   {x,1},
                   {list,[2,{tr,{x,2},{t_integer,{1,'+inf'}}}]}}.
    {call_only,2,{f,4}}. % count/2
  {label,5}.
    {get_tuple_element,{x,1},2,{x,2}}.
    {gc_bif,'+',{f,0},3,[{tr,{x,2},{t_integer,{0,'+inf'}}},{integer,1}],{x,2}}.
    {test_heap,4,3}.
    {update_record,{atom,inplace},
                   3,
                   {tr,{x,1},
                       {t_tuple,3,true,
                                #{1 => {t_atom,[s]},
                                  2 => {t_integer,{0,'+inf'}},
                                  3 => {t_integer,{0,'+inf'}}}}},
                   {x,1},
                   {list,[3,{tr,{x,2},{t_integer,{1,'+inf'}}}]}}.
    {call_only,2,{f,4}}. % count/2
  {label,6}.
    {test,is_nil,{f,3},[{x,0}]}.
    {move,{x,1},{x,0}}.
    return.

The first two instructions test whether the first argument in {x,0} is a non-empty list and if so extracts the first element of the list:

    {test,is_nonempty_list,{f,6},[{x,0}]}.
    {get_list,{x,0},{x,2},{x,0}}.

The next instruction tests whether the first element is an atom. If not, a jump is made to the code for the second clause.

    {test,is_atom,{f,5},[{x,2}]}.

Next the counter for the number of atoms seen is fetched from the record and incremented by one:

    {get_tuple_element,{x,1},2,{x,2}}.
    {gc_bif,'+',{f,0},3,[{tr,{x,2},{t_integer,{0,'+inf'}}},{integer,1}],{x,2}}.

Next follows allocation of heap space and the updating of the record:

    {test_heap,4,3}.
    {update_record,{atom,inplace},
                   3,
                   {tr,{x,1},
                       {t_tuple,3,true,
                                #{1 => {t_atom,[s]},
                                  2 => {t_integer,{0,'+inf'}},
                                  3 => {t_integer,{0,'+inf'}}}}},
                   {x,1},
                   {list,[3,{tr,{x,2},{t_integer,{1,'+inf'}}}]}}.

The test_heap instruction ensures that there is sufficient room on the heap for copying the record (4 words).

The update_record instruction was introduced in Erlang/OTP 26. Its first operand is an atom that is a hint from the compiler to help the JIT emit better code. In Erlang/OTP 26 the hints reuse and copy are used. For more about those hints, see Updating records in OTP 26.

In Erlang/OTP 27, there is a new hint called inplace. The compiler emits that hint when it has determined that nowhere in the runtime system is there another reference to the tuple except for the reference used for the update_record instruction. In other words, from the compiler’s point of view, if the runtime system were to directly update the existing record without first copying it, the observable behavior of the program would not change. As soon will be seen, from the runtime system’s point of view, directly updating the record is not always safe.

This new optimization was implemented by Frej Drejhammar. It builds on and extends the compiler passes added in Erlang/OTP 26 for appending to a binary.

Now let’s see what the JIT will do when a record_update instruction has an inplace hint. Here is the complete native code for the instruction:

# update_record_in_place_IsdI
    mov rax, qword ptr [rbx+8]
    mov rcx, qword ptr [rbx+16]
    test cl, 1
    short je L38           ; Update directly if small integer.

    ; The new value is a bignum.
    ; Test whether the tuple is in the safe part of the heap.

    mov rdi, [r13+480]     ; Get the high water mark
    cmp rax, r15           ; Compare tuple pointer to heap top
    short jae L39          ; Jump and copy if above
    cmp rax, rdi           ; Compare tuple pointer to high water
    short jae L38          ; Jump and overwrite if above high water

    ; The tuple is not in the safe part of the heap.
    ; Fall through to the copy code.

L39:                       ; Copy the current record
    vmovups ymm0, [rax-2]
    vmovups [r15], ymm0
    lea rax, qword ptr [r15+2] ; Set up tagged pointer to copy
    add r15, 32            ; Advance heap top past the copy

L38:
    mov rdi, rcx           ; Get new value for atoms field
    mov qword ptr [rax+22], rdi
    mov qword ptr [rbx+8], rax

(Lines starting with # are comments emitted by the JIT, while the text that follows ; is a comment added by me for clarification.)

The BEAM loader renames an update_record instruction with an inplace hint to update_record_in_place.

The first two instructions load the tuple to be update into CPU register rax and the new counter value (C + 1) into rcx.

    mov rax, qword ptr [rbx+8]
    mov rcx, qword ptr [rbx+16]

The next two instructions test whether the new counter value is a small integer that fits into a word. The test has been simplified to a more efficient test that is only safe when the value is known to be an integer. If it is a small integer, it is always safe to jump to the code that updates the existing tuple:

    test cl, 1
    short je L38           ; Update directly if small integer.

If it is not a small integer, it must be a bignum, that is a signed integer that does not fit in 60 bits and therefore have to be stored on the heap with rcx containing a tagged pointer to the bignum on the heap.

If rcx is a pointer to a term on the heap, it is not always safe to directly updating the existing tuple. That is because of the way the Erlang generational garbage collector works. Each Erlang process has two heaps for keeping Erlang terms: the young heap and the old heap. Terms on the young heap are allowed to reference terms on the old heap, but not vice versa. That means that if the tuple to be updated resides on the old heap, it is not safe to update one of its elements so that it will reference a term on the young heap.

Therefore, the JIT needs to emit code to ensure that the pointer to the tuple resides in the “safe part” of the young heap:

    mov rdi, [r13+480]     ; Get the high water mark
    cmp rax, r15           ; Compare tuple pointer to heap top
    short jae L39          ; Jump and copy if above
    cmp rax, rdi           ; Compare tuple pointer to high water
    short jae L38          ; Jump and overwrite if above high water

The safe part of the heap is between the high water mark and the heap top. If the tuple is below the high water mark, if it is still alive, it will be copied to the old heap in the next garbage collection.

If the tuple is in the safe part, the copy code is skipped by jumping to the code that stores the new value into the existing tuple.

If not, the next part will copy the existing record to the heap.

L39:                       ; Copy the current record
    vmovups ymm0, [rax-2]
    vmovups [r15], ymm0
    lea rax, qword ptr [r15+2] ; Set up tagged pointer to copy
    add r15, 32            ; Advance heap top past the copy

The copying is done using AVX instructions.

Next follows the code that writes the new value into the tuple:

L38:
    mov rdi, rcx           ; Get new value for atoms field
    mov qword ptr [rax+22], rdi
    mov qword ptr [rbx+8], rax

If all the new values being written into the existing record are known never to be tagged pointers, the native instructions can be simplified. Consider this module:

-module(whatever).
-export([main/1]).

-record(bar, {bool,pid}).

main(Bool) when is_boolean(Bool) ->
    flip_state(#bar{bool=Bool,pid=self()}).

flip_state(R) ->
    R#bar{bool=not R#bar.bool}.

The update_record instruction looks like this:

    {update_record,{atom,inplace},
                   3,
                   {tr,{x,0},
                       {t_tuple,3,true,
                                #{1 => {t_atom,[bar]},
                                  2 => {t_atom,[false,true]},
                                  3 => pid}}},
                   {x,0},
                   {list,[2,{tr,{x,1},{t_atom,[false,true]}}]}}.

Based on the type for the new value, {t_atom,[false,true]}, the JIT is able to generate much shorter code than for the previous example:

# update_record_in_place_IsdI
    mov rax, qword ptr [rbx]
# skipped copy fallback because all new values are safe
    mov rdi, qword ptr [rbx+8]
    mov qword ptr [rax+14], rdi
    mov qword ptr [rbx], rax

References to literals (such as [1,2,3]) are also safe, because literals are stored in a special literal area, and the garbage collector handles them specially. Consider this code:

-record(state, {op, data}).

update_state(R0, Op0, Data) ->
    R = R0#state{data=Data},
    case Op0 of
        add -> R#state{op=fun erlang:'+'/2};
        sub -> R#state{op=fun erlang:'-'/2}
    end.

Both of the record updates in the case can be done in place. Here is the BEAM code for the record update in the first clause:

    {update_record,{atom,inplace},
                   3,
                   {tr,{x,0},{t_tuple,3,true,#{1 => {t_atom,[state]}}}},
                   {x,0},
                   {list,[2,{literal,fun erlang:'+'/2}]}}.

Since the value to be written is a literal, the JIT emits simpler code without the copy fallback:

# update_record_in_place_IsdI
    mov rax, qword ptr [rbx]
# skipped copy fallback because all new values are safe
    long mov rdi, 9223372036854775807  ; Placeholder for address to fun
    mov qword ptr [rax+14], rdi
    mov qword ptr [rbx], rax

The large integer 9223372036854775807 is a placeholder that will be patched later when the address of the literal fun will be known.

Here is the pull request for updating tuples in place:

#8090: Destructive tuple update

Optimizing by generating less garbage

When updating a record in place, omitting the copying of the existing record should be a clear win, except perhaps for very small records.

What is less clear is the effect on garbage collection. Updating a tuple in place is an example of optimizing by generating less garbage. By creating less garbage, the expectation is that garbage collections should occur less often, which should improve the performance of the program.

Because of the highly variable execution time for doing a garbage collection, it is notoriously difficult to benchmark optimizations that reduce the amount of garbage created. Often the outcomes of benchmarks do not apply to performing the same tasks in a real application.

My own anecdotal evidence suggests that in most cases there are no measurable performance wins by producing less garbage.

I also remember when an optimization that reduced the size of an Erlang term resulted in a benchmark being consistently slower. It took the author of that optimization several days of investigation to confirm that the slowdown in the benchmark was not the fault of his optimization, but by creating less garbage, garbage collection happened at a later time when it happened to be much more expensive.

On average we expect that this optimization should improve performance, especially for large records.

Optimization of funs

The internal representation of funs in the runtime system has changed in Erlang/OTP 27, making possible several new optimizations.

As an example, consider this function:

madd(A, C) ->
    fun(B) -> A * B + C end.

In Erlang/OTP 26, the native code for creating the fun looks like so:

# i_make_fun3_FStt
L38:
    long mov rsi, 9223372036854775807 ; Placeholder for dispatch table
    mov edx, 1
    mov ecx, 2
    mov qword ptr [r13+80], r15
    mov rbp, rsp
    lea rsp, qword ptr [rbx-128]
    vzeroupper
    mov rdi, r13
    call 4337160320       ; Call helper function in runtime system
    mov rsp, rbp
    mov r15, qword ptr [r13+80]
# Move fun environment
    mov rdi, qword ptr [rbx]
    mov qword ptr [rax+40], rdi
    mov rdi, qword ptr [rbx+8]
    mov qword ptr [rax+48], rdi
# Create boxed ptr
    or al, 2
    mov qword ptr [rbx], rax

The large integer 9223372036854775807 is a placeholder for a value that will be filled in later.

Most of the work of actually creating the fun object is done by calling a helper function (the call 4337160320 instruction) in the runtime system.

In Erlang/OTP 27, the part of fun that resides on the heap of the calling process has been simplified so that it is now smaller than in Erlang/OTP 26, and most importantly does not contain anything that is too tricky to initialize in inline code.

The code for creating the fun is not only shorter, but it also doesn’t need to call any function in the runtime system:

# i_make_fun3_FStt
L38:
    long mov rax, 9223372036854775807 ; Placeholder for dispatch table
# Create fun thing
    mov qword ptr [r15], 196884
    mov qword ptr [r15+8], rax
# Move fun environment
# (moving two items)
    vmovups xmm0, xmmword ptr [rbx]
    vmovups xmmword ptr [r15+16], xmm0
L39:
    long mov rdi, 9223372036854775807 ; Placeholder for fun reference
    mov qword ptr [r15+32], rdi
# Create boxed ptr
    lea rax, qword ptr [r15+2]
    add r15, 40
    mov qword ptr [rbx], rax

The difference from Erlang/OTP 26 is that the parts of the fun that is only needed when loading and unloading code are no longer stored on the heap. Instead those parts are stored in the literal pool area belonging to the loaded code for the module, and are shared by all instances of the same fun.

The part of the fun that resides on the process heap is two words smaller compared to Erlang/OTP 26.

The creation of the fun environment has also been optimized. In Erlang/OTP 26, four instructions were needed:

# Move fun environment
    mov rdi, qword ptr [rbx]
    mov qword ptr [rax+40], rdi
    mov rdi, qword ptr [rbx+8]
    mov qword ptr [rax+48], rdi

In Erlang/OTP 27, using AVX instructions both variables (A and C) can be moved using only two instructions:

# Move fun environment
# (moving two items)
    vmovups xmm0, xmmword ptr [rbx]
    vmovups xmmword ptr [r15+16], xmm0

Another optimization made possible by the changed fun representation is testing for a fun having a specific arity (the number of expected arguments when calling it). For example:

ensure_fun_0(F) when is_function(F, 0) -> ok.

Here is the native code emitted by the JIT in Erlang/OTP 26:

# is_function2_fss
    mov rdi, qword ptr [rbx]   ; Fetch `F` from {x,0}.

    rex test dil, 1            ; Test whether the term is a tagged pointer...
    short jne label_3          ; ... otherwise fail.

    mov eax, dword ptr [rdi-2] ; Pick up the header word.
    cmp eax, 212               ; Test whether it is a fun...
    short jne label_3          ; ... otherwise fail.

    cmp byte ptr [rdi+22], 0   ; Test whether the arity is 0...
    short jne label_3          ; ... otherwise fail.

In Erlang/OTP 27, the arity for the fun (the number of expected arguments) is stored in the header word of the fun term, which means that the test for a fun can be combined with the test for its arity:

# is_function2_fss
    mov rdi, qword ptr [rbx]   ; Fetch `F` from {x,0}.

    rex test dil, 1            ; Test whether the term is a tagged pointer...
    short jne label_3          ; ... otherwise fail.

    cmp word ptr [rdi-2], 20   ; Test whether this is a fun with arity 0...
    short jne label_3          ; ... otherwise fail.

All external funs are now literals stored outside all process heaps. As an example, consider the following functions:

my_fun() ->
    fun ?MODULE:some_function/0.

mfa(M, F, A) ->
    fun M:F/A.

In Erlang/OTP 26, the external fun returned by my_fun/0 would not occupy any room on the heap of the calling process, while the dynamic external fun returned by mfa/3 would need 5 words on the heap of the calling process.

In Erlang/OTP 27, neither of the funs will require any room on the heap of the calling process.

Those optimizations were implemented in the following pull requests:

Integer arithmetic improvements

In the end of June last year, we released the OTP 26.0.2 patch for Erlang/OTP 26 that made binary_to_integer/1 faster.

To find out how much faster, run this benchmark:

bench() ->
    Size = 1_262_000,
    String = binary:copy(<<"9">>, Size),
    {Time, _Val} = timer:tc(erlang, binary_to_integer, [String]),
    io:format("Size: ~p, seconds: ~p\n", [Size, Time / 1_000_000]).

It measures the time to convert a binary holding 1,262,000 digits to an integer.

Running an unpatched Erlang/OTP 26 on my Intel-based iMac from 2017, the benchmark finishes in about 10 seconds.

The same benchmark run using Erlang/OTP 26.0.2 finishes in about 0.4 seconds.

The speed-up was achieved by three separate optimizations:

binary_to_integer/1 was implemented as a BIF in C using a naive algorithm that didn’t scale well. It was replaced with a divide-and-conquer algorithm implemented in Erlang. (Implementing the new algorithm as a BIF wasn’t faster than the Erlang version.)
The runtime system’s function for doing multiplication of large integers was modified to use the Karatsuba algorithm, which is a divide-and-conquer multiplication algorithm invented in the 1960s.
Some of the low-level helper functions for arithmetic with large integers (bignums) were modified to take advantage of a 128-bit integer data type on 64-bit CPUs when supported by the C compiler.

Those improvements were implemented in the following pull request:

#7426: Optimize binary_to_integer/1 and friends

In Erlang/OTP 27, some additional improvement of integer arithemetic were implemented. That reduced the execution time for the binary_to_integer/1 benchmark to about 0.3 seconds.

Those improvements are found in the following pull request:

#7553: Optimize integer arithmetic

Those arithmetic enchancements improve the running times for the pidigits benchmark:

Version		Seconds
26.0		`7.635`
26.2.1		`2.959`
27.0		`2.782`

(Run on my M1 MacBook Pro.)

Numerous miscellaneous enhancements

Many enhancements have been made to the code generation for many instructions, as well as a few to the Erlang compiler. Here follows a single example to show one of the improvements to the =:= operator:

ensure_empty_map(Map) when Map =:= #{} ->
    ok.

Here is the BEAM code for the =:= operator as used in this example:

    {test,is_eq_exact,{f,1},[{x,0},{literal,#{}}]}.

Here is the native code for Erlang/OTP 26:

# is_eq_exact_fss
L45:
    long mov rsi, 9223372036854775807
    mov rdi, qword ptr [rbx]
    cmp rdi, rsi
    short je L44                  ; Succeeded if the same term.

    rex test dil, 1
    short jne label_1             ; Fail quickly if not a tagged pointer.

    ; Call the general runtime function for comparing two terms.
    mov rbp, rsp
    lea rsp, qword ptr [rbx-128]
    vzeroupper
    call 4549723200
    mov rsp, rbp

    test eax, eax
    short je label_1               ; Fail if unequal.
L44:

The code begins with a few tests to quickly succeed or fail, but in practice those are unlikely to trigger for this example, which means that the call to the general routine in the runtime system for comparing two terms will almost always be called.

In Erlang/OTP 27, the JIT emits special code for testing whether a term is an empty map:

# is_eq_exact_fss
# optimized equality test with empty map
    mov rdi, qword ptr [rbx]
    rex test dil, 1
    short jne label_1              ; Fail if not a tagged pointer.

    cmp dword ptr [rdi-2], 300
    short jne label_1              ; Fail if not a map.

    cmp dword ptr [rdi+6], 0
    short jne label_1              ; Fail if size is not zero.

Here follows the main pull requests for miscellaneous enhancements in Erlang/OTP 27:

Erlang/OTP 26 Highlights

2023-05-16T00:00:00+00:00

Erlang/OTP 26 is finally here. This blog post will introduce the new features that we are most excited about.

A list of all changes is found in Erlang/OTP 26 Readme. Or, as always, look at the release notes of the application you are interested in. For instance: Erlang/OTP 26 - Erts Release Notes - Version 14.0.

This year’s highlights mentioned in this blog post are:

The shell
Improvements of maps
Improvements of the lists module
No need to enable feature maybe in the runtime system
Improvements in the Erlang compiler and JIT
Incremental mode for Dialyzer
argparse: A command line parser for Erlang
SSL: Safer defaults
SSL: Improved checking of options

The shell

OTP 26 brings many improvements to the experience of using the Erlang shell.

For example, functions can now be defined directly in the shell:

1> factorial(N) -> factorial(N, 1).
ok
2> factorial(N, F) when N > 1 -> factorial(N - 1, F * N);
.. factorial(_, F) -> F.
ok
3> factorial(5).
120

The shell prompt changes to .. when the previous line is not a complete Erlang construct.

Functions defined in this way are evaluated using the erl_eval module, not compiled by the Erlang compiler. That means that the performance will not be comparable to compiled Erlang code.

It also possible to define types, specs, and records, making it possible to paste code from a module directly into the shell for testing. For example:

1> -record(coord, {x=0.0 :: float(), y=0.0 :: float()}).
ok
2> -type coord() :: #coord{}.
ok
3> -spec add(coord(), coord()) -> coord().
ok
4> add(#coord{x=X1, y=Y1}, #coord{x=X2, y=Y2}) ->
..     #coord{x=X1+X2, y=Y1+Y2}.
ok
5> Origin = #coord{}.
#coord{x = 0.0,y = 0.0}
6> add(Origin, #coord{y=10.0}).
#coord{x = 0.0,y = 10.0}

The auto-completion feature in the shell has been vastly improved, supporting auto-completion of variables, record names, record fields names, map keys, function parameter types, and file names.

For example, instead of typing the variable name Origin, I can just type O and press TAB to expand it to Origin since the only variable defined in the shell with the initial letter O is Origin. That is a little bit difficult to illustrate in a blog post, so let’s introduce another variable starting with O:

7> Oxford = #coord{x=51.752022, y=-1.257677}.
#coord{x = 51.752022,y = -1.257677}

If I now press O and TAB, the shell shows the possible completions:

8> O
bindings
Origin    Oxford

(The word bindings is shown in bold and underlined.)

If I press x and TAB, the word is completed to Oxford:

8> Oxford.
#coord{x = 51.752022,y = -1.257677}

To type #coord{ is is sufficient to type # and TAB (because there is only one record currently defined in the shell):

9> #coord{

Pressing TAB one more time causes the field names in the record to be printed:

9> #coord{
fields
x=    y=

When trying to complete something which has many possible expansions, the shell attempts to show the most likely completions first. For example, if I type l and press TAB, the shell shows a list of BIFs beginning with the letter l:

10> l
bifs
length(                   link(                     list_to_atom(
list_to_binary(           list_to_bitstring(        list_to_existing_atom(
list_to_float(            list_to_integer(          list_to_pid(
list_to_port(             list_to_ref(              list_to_tuple(
Press tab to see all 37 expansions

Pressing TAB again, more BIFs are shown, as well as possible shell commands and modules:

10> l
bifs
length(                   link(                     list_to_atom(
list_to_binary(           list_to_bitstring(        list_to_existing_atom(
list_to_float(            list_to_integer(          list_to_pid(
list_to_port(             list_to_ref(              list_to_tuple(
load_module(
commands
l(     lc(    lm(    ls(
modules
lcnt:                      leex:                      lists:
local_tcp:                 local_udp:                 log_mf_h:
logger:                    logger_backend:            logger_config:
logger_disk_log_h:         logger_filters:            logger_formatter:
logger_h_common:           logger_handler_watcher:    logger_olp:
logger_proxy:              logger_server:             logger_simple_h:
logger_std_h:              logger_sup:

Typing ists: (to complete the word lists) and pressing TAB, a partial list of functions in the lists modules are shown:

10> lists:
functions
all(            any(            append(         concat(         delete(
droplast(       dropwhile(      duplicate(      enumerate(      filter(
filtermap(      flatlength(     flatmap(        flatten(        foldl(
foldr(          foreach(        join(           keydelete(      keyfind(
Press tab to see all 72 expansions

Typing m and pressing TAB, the list of functions is narrowed down to just those beginning with the letter m:

10> lists:m
functions
map(            mapfoldl(       mapfoldr(       max(            member(
merge(          merge3(         min(            module_info(

Animations showing shell features

Improvements of maps

Changed ordering of atom keys

OTP 25 and earlier releases printed small maps (up to 32 elements) with atom keys according to the term order of their keys:

1> AM = #{a => 1, b => 2, c => 3}.
#{a => 1,b => 2,c => 3}
2> maps:to_list(AM).
[{a,1},{b,2},{c,3}]

In OTP 26, as an optimization for certain map operations, such as maps:from_list/1, maps with atom keys are now sorted in a different order. The new order is undefined and may change between different invocations of the Erlang VM. On my computer at the time of writing, I got the following order:

1> AM = #{a => 1, b => 2, c => 3}.
#{c => 3,a => 1,b => 2}
2> maps:to_list(AM).
[{c,3},{a,1},{b,2}]

There is a new modifier k for format strings to specify that maps should be sorted according to the term order of their keys before printing:

3> io:format("~kp\n", [AM]).
#{a => 1,b => 2,c => 3}
ok

It is also possible to use a custom ordering fun. For example, to order the map elements in reverse order based on their keys:

4> io:format("~Kp\n", [fun(A, B) -> A > B end, AM]).
#{c => 3,b => 2,a => 1}
ok

There is also a new maps:iterator/2 function that supports iterating over the elements of the map in a more intuitive order. Examples will be shown in the next section.

Map comprehensions

In OTP 25 and earlier, it was common to combine maps:from_list/1 and maps:to_list/1 with list comprehensions. For example:

1> M = maps:from_list([{I,I*I} || I <- lists:seq(1, 5)]).
#{1 => 1,2 => 4,3 => 9,4 => 16,5 => 25}

In OTP 26, that can be written more succinctly with a map comprehension:

1> M = #{I => I*I || I <- lists:seq(1, 5)}.
#{1 => 1,2 => 4,3 => 9,4 => 16,5 => 25}

With a map generator, a comprehension can now iterate over the elements of a map. For example:

2> [K || K := V <- M, V < 10].
[1,2,3]

Using a map comprehension with a map generator, here is an example showing how keys and values can be swapped:

3> #{V => K || K := V <- M}.
#{1 => 1,4 => 2,9 => 3,16 => 4,25 => 5}

Map generators accept map iterators as well as maps. Especially useful are the ordered iterators returned from the new maps:iterator/2 function:

4> AM = #{a => 1, b => 2, c => 1}.
#{c => 1,a => 1,b => 2}
5> [{K,V} || K := V <- maps:iterator(AM, ordered)].
[{a,1},{b,2},{c,1}]
6> [{K,V} || K := V <- maps:iterator(AM, reversed)].
[{c,1},{b,2},{a,1}]
7> [{K,V} || K := V <- maps:iterator(AM, fun(A, B) -> A > B end)].
[{c,1},{b,2},{a,1}]

Map comprehensions were first suggested in EEP 58.

Inlined `maps:get/3`

In OTP 26, the compiler will inline calls to maps:get/3, making them slightly more efficient.

Improved `maps:merge/2`

When merging two maps, the maps:merge/2 function will now try to reuse the key tuple from one of the maps in order to reduce the memory usage for maps.

For example:

1> maps:merge(#{x => 13, y => 99, z => 100}, #{x => 0, z => -7}).
#{y => 99,x => 0,z => -7}

The resulting map has the same three keys as the first map, so it can reuse the key tuple from the first map.

This optimization is not possible if one of the maps has any key not present in the other map. For example:

2> maps:merge(#{x => 1000}, #{y => 2000}).
#{y => 2000,x => 1000}

Improved map updates

Updating of a map using the => operator has been improved to avoid updates that don’t change the value of the map or its key tuple. For example:

1> M = #{a => 42}.
#{a => 42}
2> M#{a => 42}.
#{a => 42}

The update operation does not change the value of the map, so in order to save memory, the original map is returned.

(A similar optimization for the := operator was implemented 5 years ago.)

When updating the values of keys that already exist in a map using the => operator, the key tuple will now be re-used. For example:

3> M#{a => 100}.
#{a => 100}

The pull requests for map improvements

For anyone who wants to dig deeper, here are the main pull requests for maps for OTP 26:

Improvements of the `lists` module

New function `lists:enumerate/3`

In OTP 25, lists_enumerate() was introduced. For example:

1> lists:enumerate([a,b,c]).
[{1,a},{2,b},{3,c}]
2> lists:enumerate(0, [a,b,c]).
[{0,a},{1,b},{2,c}]

In OTP 26, lists:enumerate/3 completes the family of functions by allowing an increment to be specified:

3> lists:enumerate(0, 10, [a,b,c]).
[{0,a},{10,b},{20,c}]
4> lists:enumerate(0, -1, [a,b,c]).
[{0,a},{-1,b},{-2,c}]

New options for the `zip` family of functions

The zip family of functions in the lists module combines two or three lists into a single list of tuples. For example:

1> lists:zip([a,b,c], [1,2,3]).
[{a,1},{b,2},{c,3}]

The existing zip functions fail if the lists don’t have the same length:

2> lists:zip([a,b,c,d], [1,2,3]).
** exception error: no function clause matching . . .

In OTP 26, the zip functions now take an extra How parameter that determines what should happen when the lists are of unequal length.

For some use cases for zip, ignoring the superfluous elements in the longer list or lists can make sense. That can be done using the trim option:

3> lists:zip([a,b,c,d], [1,2,3], trim).
[{a,1},{b,2},{c,3}]

For other use cases it could make more sense to extend the shorter list or lists to the length of the longest list. That can be done using the {pad, Defaults} option, where Defaults should be a tuple having the same number of elements as the number of lists. For lists:zip/3, that means that the Defaults tuple should have two elements:

4> lists:zip([a,b,c,d], [1,2,3], {pad, {zzz, 999}}).
[{a,1},{b,2},{c,3},{d,999}]
5> lists:zip([a,b,c], [1,2,3,4,5], {pad, {zzz, 999}}).
[{a,1},{b,2},{c,3},{zzz,4},{zzz,5}]

For lists:zip3/3 the Defaults tuple should have three elements:

6> lists:zip3([], [a], [1,2,3], {pad, {0.0, zzz, 999}}).
[{0.0,a,1},{0.0,zzz,2},{0.0,zzz,3}]

No need to enable feature `maybe` in the runtime system

In OTP 25, the feature concept and the maybe feature were introduced. In order to use maybe in OTP 25, it is necessary to enable it in both the compiler and the runtime system. For example:

$ cat t.erl
-module(t).
-feature(maybe_expr, enable).
-export([listen_port/2]).
listen_port(Port, Options) ->
    maybe
        {ok, ListenSocket} ?= inet_tcp:listen(Port, Options),
        {ok, Address} ?= inet:sockname(ListenSocket),
        {ok, {ListenSocket, Address}}
    end.
$ erlc t.erl
$ erl
Erlang/OTP 25 . . .

Eshell V13.1.1  (abort with ^G)
1> t:listen_port(50000, []).
=ERROR REPORT==== 6-Apr-2023::12:01:20.373223 ===
Loading of . . ./t.beam failed: {features_not_allowed,
                                 [maybe_expr]}

** exception error: undefined function t:listen_port/2
2> q().
$ erl -enable-feature maybe_expr
Erlang/OTP 25 . . .

Eshell V13.1.1  (abort with ^G)
1> t:listen_port(50000, []).
{ok,{#Port<0.5>,{{0,0,0,0},50000}}}

In OTP 26, it is no longer necessary to enable a feature in the runtime system in order to load modules that are using it. It is sufficient to have -feature(maybe_expr, enable). in the module.

For example:

$ erlc t.erl
$ erl
Erlang/OTP 26 . . .

Eshell V14.0 (press Ctrl+G to abort, type help(). for help)
1> t:listen_port(50000, []).
{ok,{#Port<0.4>,{{0,0,0,0},50000}}}

Improvements in the Erlang compiler and JIT

OTP 26 improves on the type-based optimizations in the JIT introduced last year, but the most noticable improvements are for matching and construction of binaries using the bit syntax. Those improvements, combined with changes to the base64 module itself, makes encoding to Base64 about 4 times faster and decoding from Base64 more than 3 times faster.

More details about these improvements can be found in the blog post More Optimizations in the Compiler and JIT.

Worth mentioning here is also the re-introduction of an optimization that was lost when the JIT was introduced in OTP 24:

erts: Reintroduce literal fun optimization

It turns out that this optimization is important for the jason library. Without it, JSON decoding is 10 percent slower.

Incremental mode for Dialyzer

Dialyzer has a new incremental mode implemented by Tom Davies. The incremental mode can greatly speed up the analysis when only small changes have been done to a code base.

Let’s jump straight into an example. Assuming that we want to prepare a pull request for the stdlib application, here is how we can use Dialyzer’s incremental mode to show warnings for any issues in stdlib:

$ dialyzer --incremental --apps erts kernel stdlib compiler crypto --warning_apps stdlib
Proceeding with incremental analysis... done in 0m14.91s
done (passed successfully)

Let’s break down the command line:

The --incremental option tells Dialyzer to use the incremental mode.
The --warning_apps stdlib lists the application that we want warnings for. In this case, it’s the stdlib application.
The --apps erts kernel stdlib compiler crypto option lists the applications that should be analyzed, but without generating any warnings.

Dialyzer analyzed all modules given for the --apps and --warning_apps options. On my computer, the analysis finished in about 15 seconds.

If I immediately run Dialyzer with the same arguments, it finishes pretty much instantaneously because nothing has been changed:

$ dialyzer --incremental --warning_apps stdlib --apps erts kernel stdlib compiler crypto
done (passed successfully)

If I do any change to the lists module (for example, by adding a new function), Dialyzer will re-analyze all modules that depend on it directly or indirectly:

$ dialyzer --incremental --warning_apps stdlib --apps erts kernel stdlib compiler crypto
There have been changes to analyze
    Of the 270 files being tracked, 1 have been changed or removed,
    resulting in 270 requiring analysis because they depend on those changes
Proceeding with incremental analysis... done in 0m14.95s
done (passed successfully)

It turns out that all modules in the analyzed applications depend on the lists module directly or indirectly.

If I change something in the base64 module, the re-analysis will be much quicker because there are fewer dependencies:

$ dialyzer --incremental --warning_apps stdlib --apps erts kernel stdlib compiler crypto
There have been changes to analyze
    Of the 270 files being tracked, 1 have been changed or removed,
    resulting in 3 requiring analysis because they depend on those changes
Proceeding with incremental analysis... done in 0m1.07s
done (passed successfully)

In this case only three modules needed to be re-analyzed, which was done in about one second.

Using the dialyzer.config file

Note that all of the examples above used the same command line.

When running Dialyzer in the incremental mode, the list of applications to be analyzed and the list of applications to produce warnings for must be supplied every time Dialyzer is invoked.

To avoid having to supply the application lists on the command line, they can be put into a configuration file named dialyzer.config. To find out in which directory Dialyzer will look for the configuration file, run the following command:

$ dialyzer --help
  .
  .
  .
Configuration file:
     Dialyzer's configuration file may also be used to augment the default
     options and those given directly to the Dialyzer command. It is commonly
     used to avoid repeating options which would otherwise need to be given
     explicitly to Dialyzer on every invocation.

     The location of the configuration file can be set via the
     DIALYZER_CONFIG environment variable, and defaults to
     within the user_config location given by filename:basedir/3.

     On your system, the location is currently configured as:
       /Users/bjorng/Library/Application Support/erlang/dialyzer.config

     An example configuration file's contents might be:

       {incremental,
         {default_apps,[stdlib,kernel,erts]},
         {default_warning_apps,[stdlib]}
       }.
       {warnings, [no_improper_lists]}.
       {add_pathsa,["/users/samwise/potatoes/ebin"]}.
       {add_pathsz,["/users/smeagol/fish/ebin"]}.

  .
  .
  .

Near the end there is information about the configuration file and where Dialyzer will look for it.

To shorten the command line for our previous examples, the following term can be stored in the dialyzer.config:

{incremental,
 {default_apps, [erts,kernel,stdlib,compiler,crypto]},
 {default_warning_apps, [stdlib]}
}.

Now it is sufficient to just give the --incremental option to Dialyzer:

$ dialyzer --incremental
done (passed successfully)

Running Dialyzer on proper

As a final example, let’s run Dialyzer on PropER.

To do that, the default_warnings_apps option in the configuration file must be changed to proper. It is also necessary to add the add_pathsa option to prepend the path of the proper application to the code path:

{incremental,
 {default_apps, [erts,kernel,stdlib,compiler,crypto]},
 {default_warning_apps, [proper]}
}.
{add_pathsa, ["/Users/bjorng/git/proper/_build/default/lib/proper"]}.

Running Dialyzer:

$ dialyzer --incremental
There have been changes to analyze
    Of the 296 files being tracked,
    26 have been changed or removed,
    resulting in 26 requiring analysis because they depend on those changes
Proceeding with incremental analysis...
proper.erl:2417:13: Unknown function cover:start/1
proper.erl:2426:13: Unknown function cover:stop/1
proper_symb.erl:249:9: Unknown function erl_syntax:atom/1
proper_symb.erl:250:5: Unknown function erl_syntax:revert/1
proper_symb.erl:250:23: Unknown function erl_syntax:application/3
proper_symb.erl:257:51: Unknown function erl_syntax:nil/0
proper_symb.erl:259:49: Unknown function erl_syntax:cons/2
proper_symb.erl:262:5: Unknown function erl_syntax:revert/1
proper_symb.erl:262:23: Unknown function erl_syntax:tuple/1
 done in 0m2.36s
done (warnings were emitted)

Dialyzer found 26 new files to analyze (the BEAM files in the proper application). Those were analyzed in about two and a half seconds.

Dialyzer emitted warnings for unknown functions because proper calls functions in applications that were not being analyzed. To eliminate those warnings, the tools and syntax_tools applications can be added to the list of applications in the list of default_apps:

{incremental,
 {default_apps, [erts,kernel,stdlib,compiler,crypto,tools,syntax_tools]},
 {default_warning_apps, [proper]}
}.
{add_pathsa, ["/Users/bjorng/git/proper/_build/default/lib/proper"]}.

With that change to the configuration file, no more warnings are printed:

$ dialyzer --incremental
There have been changes to analyze
    Of the 319 files being tracked,
    23 have been changed or removed,
    resulting in 38 requiring analysis because they depend on those changes
Proceeding with incremental analysis... done in 0m6.47s

It is also possible to include warning options in the configuration file, for example to disable warnings for non-proper lists or to enable warnings for unmatched returns. Let’s enable warnings for unmatched returns:

{incremental,
 {default_apps, [erts,kernel,stdlib,compiler,crypto,tools,syntax_tools]},
 {default_warning_apps, [proper]}
}.
{warnings, [unmatched_returns]}.
{add_pathsa, ["/Users/bjorng/git/proper/_build/default/lib/proper"]}.

When warnings options are changed, Dialyzer will re-analyze all modules:

$ dialyzer --incremental
PLT was built for a different set of enabled warnings,
so an analysis must be run for 319 modules to rebuild it
Proceeding with incremental analysis... done in 0m19.43s
done (passed successfully)

Pull request

dialyzer: Add incremental analysis mode

argparse: A command line parser for Erlang

New in OTP 26 is the argparse module, which simplifies parsing of the command line in escripts. argparse was implemented by Maxim Fedorov.

To show only a few of the features, let’s implement the command-line parsing for an escript called ehead, inspired by the Unix command head:

#!/usr/bin/env escript
%% -*- erlang -*-

main(Args) ->
    argparse:run(Args, cli(), #{progname => ehead}).

cli() ->
    #{
      arguments =>
          [#{name => lines, type => {integer, [{min, 1}]},
             short => $n, long => "-lines", default => 10,
             help => "number of lines to print"},
           #{name => files, nargs => nonempty_list, action => extend,
             help => "lists of files"}],
      handler => fun(Args) ->
                         io:format("~p\n", [Args])
                 end
     }.

As currently written, the ehead script will simply print the arguments collected by argparse and quit.

If ehead is run without any arguments an error message will be shown:

$ ehead
error: ehead: required argument missing: files
Usage:
  ehead [-n ] [--lines ] ...

Arguments:
  files       lists of files

Optional arguments:
  -n, --lines number of lines to print (int >= 1, 10)

The message tells us that at least one file name must be given:

$ ehead foo bar baz
#{lines => 10,files => ["foo","bar","baz"]}

Since the command line was valid, argparse collected the arguments into a map, which was then printed by the handler fun.

The number of lines to be printed from each file defaults to 10, but can be changed using either the -n or --lines option:

$ ehead -n 42 foo bar baz
#{lines => 42,files => ["foo","bar","baz"]}
$ ehead foo --lines=42 bar baz
#{lines => 42,files => ["foo","bar","baz"]}
$ ehead --lines 42 foo bar baz
#{lines => 42,files => ["foo","bar","baz"]}
$ ehead foo bar --lines 42 baz
#{lines => 42,files => ["foo","bar","baz"]}

Attempting to give the number of lines as 0 results in an error message:

$ ehead -n 0 foobar
error: ehead: invalid argument for lines: 0 is less than accepted minimum
Usage:
  ehead [-n ] [--lines ] ...

Arguments:
  files       lists of files

Optional arguments:
  -n, --lines number of lines to print (int >= 1, 10)

Pull request

[argparse] Command line parser for Erlang

SSL: Safer defaults

In OTP 25, the default options for ssl:connect/3 would allow setting up a connection without verifying the authenticity of the server (that is, without checking the server’s certificate chain). For example:

Erlang/OTP 25 . . .

Eshell V13.1.1  (abort with ^G)
1> application:ensure_all_started(ssl).
{ok,[crypto,asn1,public_key,ssl]}
2> ssl:connect("www.erlang.org", 443, []).
=WARNING REPORT==== 6-Apr-2023::12:29:20.824457 ===
Description: "Authenticity is not established by certificate path validation"
     Reason: "Option {verify, verify_peer} and cacertfile/cacerts is missing"

{ok,{sslsocket,{gen_tcp,#Port<0.6>,tls_connection,undefined},
               [<0.122.0>,<0.121.0>]}}

A warning report would be generated, but a connection would be set up.

In OTP 26, the default value for the verify option is now verify_peer instead of verify_none. Host verification requires trusted CA certificates to be supplied using one of the options cacerts or cacertsfile. Therefore, a connection attempt with an empty option list will fail in OTP 26:

Erlang/OTP 26 . . .

Eshell V14.0 (press Ctrl+G to abort, type help(). for help)
1> application:ensure_all_started(ssl).
{ok,[crypto,asn1,public_key,ssl]}
2> ssl:connect("www.erlang.org", 443, []).
{error,{options,incompatible,
                [{verify,verify_peer},{cacerts,undefined}]}}

The default value for the cacerts option is undefined, which is not compatible with the {verify,verify_peer} option.

To make the connection succeed, the recommended way is to use the cacerts option to supply CA certificates to be used for verifying. For example:

1> application:ensure_all_started(ssl).
{ok,[crypto,asn1,public_key,ssl]}
2> ssl:connect("www.erlang.org", 443, [{cacerts, public_key:cacerts_get()}]).
{ok,{sslsocket,{gen_tcp,#Port<0.5>,tls_connection,undefined},
               [<0.137.0>,<0.136.0>]}}

Alternatively, host verification can be explicitly disabled. For example:

1> application:ensure_all_started(ssl).
{ok,[crypto,asn1,public_key,ssl]}
2> ssl:connect("www.erlang.org", 443, [{verify,verify_none}]).
{ok,{sslsocket,{gen_tcp,#Port<0.6>,tls_connection,undefined},
               [<0.143.0>,<0.142.0>]}}

Another way that OTP 26 is safer is that legacy algorithms such as SHA1 and DSA are no longer allowed by default.

SSL: Improved checking of options

In OTP 26, the checking of options is strengthened to return errors for incorrect options that used to be silently ignored. For example, ssl now rejects the fail_if_no_peer_cert option if used for the client:

1> application:ensure_all_started(ssl).
{ok,[crypto,asn1,public_key,ssl]}
2> ssl:connect("www.erlang.org", 443, [{fail_if_no_peer_cert, true}, {verify, verify_peer}, {cacerts, public_key:cacerts_get()}]).
{error,{option,server_only,fail_if_no_peer_cert}}

In OTP 25, the option would be silently ignored.

ssl in OTP 26 also returns clearer error reasons. In the example in the previous section the following connection attempt was shown:

2> ssl:connect("www.erlang.org", 443, []).
{error,{options,incompatible,
                [{verify,verify_peer},{cacerts,undefined}]}}

In OTP 25, the corresponding error return is less clear:

2> ssl:connect("www.erlang.org", 443, [{verify,verify_peer}]).
{error,{options,{cacertfile,[]}}}

More Optimizations in the Compiler and JIT

2023-04-19T00:00:00+00:00

This post explores the enhanced type-based optimizations and the other performance improvements in Erlang/OTP 26.

What to expect of the JIT in OTP 26

In OTP 25, the compiler was updated to embed type information in the BEAM file and the JIT was extended to emit better code based on that type information. Those improvements were described in the blog post Type-Based Optimizations in the JIT.

As mentioned in that blog post, there were limitations in both the compiler and the JIT that prevented many optimizations. In OTP 26, the compiler will produce better type information and the JIT will take better advantage of the improved type information, typically resulting in fewer redundant type tests and smaller native code size.

A new BEAM instruction introduced in OTP 26 makes record updates faster by a small but measurable amount.

The most noticable performance improvements in OTP 26 are probably for matching and construction of binaries using the bit syntax. Those improvements, combined with changes to the base64 module itself, makes encoding to Base64 about 4 times as fast and decoding from Base64 more than 3 times as fast.

Please try this at home!

While this blog post will show many examples of generated code, I have attempted to explain the optimizations in English as well. Feel free to skip the code examples.

On the other hand, if you want more code examples…

To examine the native code for loaded modules, start the runtime system like this:

erl +JDdump true

The native code for all modules that are loaded will be dumped to files with the extension .asm.

To examine the BEAM code for a module, use the -S option when compiling. For example:

erlc -S base64.erl

Quick overview of type-based optimizations in OTP 25

Let’s quickly summarize the type-based optimizations in OTP 25. For more details, see the aformentioned blog post.

First consider an addition of two values with nothing known about their types:

add1(X, Y) ->
    X + Y.

The BEAM code looks like this:

    {gc_bif,'+',{f,0},2,[{x,0},{x,1}],{x,0}}.
    return.

Without any information about the operands, the JIT must emit code that can handle all possible types for the operands. For the x86_64 architecture, 14 native instructions are needed.

If the type of the operands are known to be integers sufficiently small making overflow impossible, the JIT needs to emit only 5 native instructions for the addition.

Here is an example where the types and ranges of the operands for the + operator are known:

add5(X, Y) when X =:= X band 16#3FF,
                Y =:= Y band 16#3FF ->
    X + Y.

The BEAM code for this function is as follows:

    {gc_bif,'band',{f,24},2,[{x,0},{integer,1023}],{x,2}}.
    {test,is_eq_exact,
          {f,24},
          [{tr,{x,0},{t_integer,any}},{tr,{x,2},{t_integer,{0,1023}}}]}.
    {gc_bif,'band',{f,24},2,[{x,1},{integer,1023}],{x,2}}.
    {test,is_eq_exact,
          {f,24},
          [{tr,{x,1},{t_integer,any}},{tr,{x,2},{t_integer,{0,1023}}}]}.
    {gc_bif,'+',
            {f,0},
            2,
            [{tr,{x,0},{t_integer,{0,1023}}},{tr,{x,1},{t_integer,{0,1023}}}],
            {x,0}}.
    return.

The register operands ({x,0} and {x,1}) have now been annotated with type information:

{tr,Register,Type}

That is, each register operand is a three-tuple with tr as the first element. tr stands for typed register. The second element is the BEAM register ({x,0} or {x,1} in this case), and the third element is the type of the register in the compiler’s internal type representation. {t_integer,{0,1023}} means that the value is an integer in the inclusive range 0 through 1023.

With that type information, the JIT emits the following native code for the + operator:

# i_plus_ssjd
# add without overflow check
    mov rax, qword ptr [rbx]
    mov rsi, qword ptr [rbx+8]
    and rax, -16               ; Zero the tag bits
    add rax, rsi
    mov qword ptr [rbx], rax

(Lines starting with # are comments emitted by the JIT, while the text that follows ; is a comment added by me for clarification.)

The reduction in code size from 14 instructions down to 5 is nice, but having to express the range check in that convoluted way using band can hardly be called nice nor natural.

If we try to express the range checks in a more natural way:

add4(X, Y) when is_integer(X), 0 =< X, X < 16#400,
                is_integer(Y), 0 =< Y, Y < 16#400 ->
    X + Y.

the compiler in OTP 25 will no longer be able to figure out the ranges for the operands. Here is the BEAM code:

    {test,is_integer,{f,22},[{x,0}]}.
    {test,is_ge,{f,22},[{tr,{x,0},{t_integer,any}},{integer,0}]}.
    {test,is_lt,{f,22},[{tr,{x,0},{t_integer,any}},{integer,1024}]}.
    {test,is_integer,{f,22},[{x,1}]}.
    {test,is_ge,{f,22},[{tr,{x,1},{t_integer,any}},{integer,0}]}.
    {test,is_lt,{f,22},[{tr,{x,1},{t_integer,any}},{integer,1024}]}.
    {gc_bif,'+',
            {f,0},
            2,
            [{tr,{x,0},{t_integer,any}},{tr,{x,1},{t_integer,any}}],
            {x,0}}.
    return.

Because of that severe limitation in the compiler’s value range analysis, I wrote:

We aim to improve the type analysis and optimizations in OTP 26 and generate better code for this example.

The enhanced type-based optimizations in OTP 26

Compiling the same example with OTP 26, the result is:

    {test,is_integer,{f,19},[{x,0}]}.
    {test,is_ge,{f,19},[{tr,{x,0},{t_integer,any}},{integer,0}]}.
    {test,is_ge,{f,19},[{integer,1023},{tr,{x,0},{t_integer,{0,'+inf'}}}]}.
    {test,is_integer,{f,19},[{x,1}]}.
    {test,is_ge,{f,19},[{tr,{x,1},{t_integer,any}},{integer,0}]}.
    {test,is_ge,{f,19},[{integer,1023},{tr,{x,1},{t_integer,{0,'+inf'}}}]}.
    {gc_bif,'+',
            {f,0},
            2,
            [{tr,{x,0},{t_integer,{0,1023}}},{tr,{x,1},{t_integer,{0,1023}}}],
            {x,0}}.

The BEAM instruction for the + operator now have ranges for its operands.

Let’s look at little bit closer at the first three instructions, which corresponds to the guard test is_integer(X), 0 =< X, X < 16#400.

First is the guard check for an integer:

    {test,is_integer,{f,19},[{x,0}]}.

It is followed by the guard test 0 =< X (rewritten to X >= 0 by the compiler):

    {test,is_ge,{f,19},[{tr,{x,0},{t_integer,any}},{integer,0}]}.

As a result of the is_integer/1 test it is known that {x,0} is an integer.

The third instruction corresponds to X < 16#400, which the compiler has rewritten to 16#3FF >= X (1023 >= X):

    {test,is_ge,{f,19},[{integer,1023},{tr,{x,0},{t_integer,{0,'+inf'}}}]}.

In the type for the {x,0} register there is something new for OTP 26. It says that the range is 0 through '+inf', that is, from 0 up to positive infinity. Combining that range with the range from this instruction, the Erlang compiler can infer that if this instruction succeeds, the type for {x,0} is t_integer,{0,1023}}.

Combining guard tests

In OTP 25, the JIT would emit native code for each BEAM instruction in the guard individually. When translated individually, the three guards tests for one of the variables each require 11 native instructions, or 33 instructions for all three.

By having the BEAM loader combine the three guard tests into a single is_int_range instruction, the JIT is capable of doing a much better job and emit a mere 6 native instructions.

How is that possible?

As individual BEAM instructions, each guard test needs 5 instructions to fetch the value from {x,0} and test that the value is a small integer. As a combined instruction, that only needs to be done once. Other parts of the guard tests also become redundant in the combined instruction and can be omitted. For example, the is_integer/1 type test will also succeed if its argument is a bignum (an integer that does not fit in a machine word). Clearly, a bignum will fall well outside the range 0 through 1023, so if the argument is not a small integer, the combined guard test will fail immediately.

With those and some other simplifications, we end up with the following native instructions:

# is_int_in_range_fScc
    mov rax, qword ptr [rbx]
    sub rax, 15
    test al, 15
    short jne label_19
    cmp rax, 16368
    short ja label_19

The first instruction fetches the value of {x,0} to the CPU register rax:

    mov rax, qword ptr [rbx]

The next instruction subtracts the tagged value for the lower bound of the range. Since the lower bound of the range is 0 and the tag for small integers is 15, the value that is subtracted is 16 * 0 + 15 or simply 15. (For small integers, the runtime system uses the 4 least significant bits of the word as tag bits.) If the lower bound would have been 1, the value to be subtracted would have been 16 * 1 + 15 or 31:

    sub rax, 15

The subtraction achieves two aims at once. Firstly, it simplifies the tag test in the next two instructions because if the value of of {x,0} is a small integer, the 4 least significants bits will now be zero:

    test al, 15
    short jne label_19

The test al, 15 instruction does a bitwise AND operation of the lower byte of the CPU register rax, discarding the result but setting CPU flags depending on the value. The next instruction tests whether the result was nonzero (the tag was not the tag for a small integer), in which case the test fails and a jump to the failure label is made.

The second aim for the subtraction is to simplify the range check. If the value being tested was below the lower bound, the value of rax will be negative after the subtraction.

Since integers are represented in two’s complement notation, a signed negative integer interpreted as an unsigned integer will be a very large integer. Therefore, both bounds can be checked at once using the old trick of treating the value in rax as unsigned:

    cmp rax, 16368
    short ja label_19

The cmp rax, 16368 instruction compares the value in rax with the difference of the tagged upper bound and the tagged lower bound, that is:

(16 * 1023 + 15) - (16 * 0 + 15)

ja stands for “Jump (if) Above”, that is, jump if the CPU flags indicates that in previous comparison of unsigned integers the first integer was greater than the second. Since a negative number represented in two’s complement notation looks like a huge integer when interpreted as an unsigned integer, short ja label_19 will transfer control to the failure label for values both below the lower bound and above the upper bound.

More code generation improvements

The JIT in OTP 26 generates better code for common combinations of relational operators. In order to reduce the number of combinations that the JIT will need to handle, the compiler rewrites the < operator to >= if possible. In the previous example, it was shown that the compiler rewrote X < 1024 to 1023 >= X.

Let’s look at a contrived example to show (off) a few more improvements in the code generation:

add6(M) when is_map(M) ->
    A = map_size(M),
    if
        9 < A, A < 100 ->
            A + 6
    end.

The main part of the BEAM code looks like this:

    {test,is_map,{f,41},[{x,0}]}.
    {gc_bif,map_size,{f,0},1,[{tr,{x,0},{t_map,any,any}}],{x,0}}.
    {test,is_ge,
          {f,43},
          [{tr,{x,0},{t_integer,{0,288230376151711743}}},{integer,10}]}.
    {test,is_ge,
          {f,43},
          [{integer,99},{tr,{x,0},{t_integer,{10,288230376151711743}}}]}.
    {gc_bif,'+',{f,0},1,[{tr,{x,0},{t_integer,{10,99}}},{integer,6}],{x,0}}.
    return.

In OTP 26, the JIT will inline the code for many of the most frequently used guard BIFs. Here is the native code for the map_size/1 call:

# bif_map_size_jsd
    mov rax, qword ptr [rbx]      ; Fetch map from {x,0}
# skipped type check because the argument is always a map
    mov rax, qword ptr [rax+6]    ; Fetch size of map
    shl rax, 4
    or al, 15                     ; Tag as small integer
    mov qword ptr [rbx], rax      ; Store size in {x,0}

The two is_ge instructions are combined by the BEAM loader into an is_in_range instruction:

# is_in_range_ffScc
# simplified fetching of BEAM register
    mov rdi, rax
# skipped test for small operand since it always small
    sub rdi, 175
    cmp rdi, 1424
    ja label_43

The first instruction is a new optimization in OTP 26. Normally {x,0} is fetched using the instruction mov rax, qword ptr [rbx]. However, in this case, the last instruction in the previous BEAM instruction is the instruction mov qword ptr [rbx], rax. Therefore, since it is known that the contents of {x,0} is already in CPU register rax, the instruction can be simplified to:

# simplified fetching of BEAM register
    mov rdi, rax

The size of a map that will fit in memory on a 64-bit computer is always a small integer, so the test for a small integer is skipped:

# skipped test for small operand since it always small
    sub rdi, 175     ; Subtract 16 * 10 + 15
    cmp rdi, 1424    ; Compare with (16*99+15)-(16*10+15)
    ja label_43

The native code for the + operator looks like this:

# i_plus_ssjd
# add without overflow check
    mov rax, qword ptr [rbx]
    add rax, 96      ; 16 * 6 + 0
    mov qword ptr [rbx], rax

New BEAM instructions in OTP 26

The previous example of combining guard tests showed that the JIT can often generate better code if multiple BEAM instructions are combined into one. While the BEAM loader is capable of combining instructions it is often more practical to let the Erlang compiler emit combined instructions.

OTP 26 introduces two new instructions, each of which replaces a sequence of any number of simpler instructions:

update_record for updating any number of fields in a record.
bs_match for matching multiple segments of fixed size.

In OTP 25, the bs_create_bin instruction for constructing a binary with any number of segments was introduced, but its full potential for generating efficient code was not leveraged in OTP 25.

Updating records in OTP 25

Consider the following example of a record definition and three functions that update the record:

-record(r, {a,b,c,d,e}).

update_a(R) ->
    R#r{a=42}.

update_ce(R) ->
    R#r{c=99,e=777}.

update_bcde(R) ->
    R#r{b=2,c=3,d=4,e=5}.

In OTP 25 and earlier, the way in which a record is updated depends on both the number of fields being updated and the size of the record.

When a single field in a record is updated, as in update_a/1, the setelement/3 BIF is called:

    {test,is_tagged_tuple,{f,34},[{x,0},6,{atom,r}]}.
    {move,{x,0},{x,1}}.
    {move,{integer,42},{x,2}}.
    {move,{integer,2},{x,0}}.
    {call_ext_only,3,{extfunc,erlang,setelement,3}}.

When updating more than one field but fewer than approximately half of the fields, as in update_ce/1, code similar to the following is emitted:

    {test,is_tagged_tuple,{f,37},[{x,0},6,{atom,r}]}.
    {allocate,0,1}.
    {move,{x,0},{x,1}}.
    {move,{integer,777},{x,2}}.
    {move,{integer,6},{x,0}}.
    {call_ext,3,{extfunc,erlang,setelement,3}}.
    {set_tuple_element,{integer,99},{x,0},3}.
    {deallocate,0}.
    return.

Here the e field is updated using setelement/3, followed by set_tuple_element to update the c field destructively. Erlang does not allow mutation of terms, but here it is done “under the hood” in a safe way.

When a majority of the fields are updated, as in update_bcde/1, a new tuple is built:

    {test,is_tagged_tuple,{f,40},[{x,0},6,{atom,r}]}.
    {test_heap,7,1}.
    {get_tuple_element,{x,0},1,{x,0}}.
    {put_tuple2,{x,0},
                {list,[{atom,r},
                       {x,0},
                       {integer,2},
                       {integer,3},
                       {integer,4},
                       {integer,5}]}}.
    return.

Updating records in OTP 26

In OTP 26, all records are updated using the new BEAM instruction update_record. For example, here is the main part of the BEAM code for update_1:

    {test,is_tagged_tuple,{f,34},[{x,0},6,{atom,r}]}.
    {test_heap,7,1}.
    {update_record,{atom,reuse},6,{x,0},{x,0},{list,[2,{integer,42}]}}.
    return.

The last operand is a list of positions in the tuple and their corresponding new values.

The first operand, {atom,reuse}, is a hint to the JIT that it is possible that the source tuple is already up to date and does not need to be updated. Another possible value for the hint operand is {atom,copy}, meaning that the source tuple is definitely not up to date.

The JIT emits the following native code for the update_record instruction:

# update_record_aIsdI
    mov rax, qword ptr [rbx]
    mov rdi, rax
    cmp qword ptr [rdi+14], 687
    je L130
    vmovups xmm0, [rax-2]
    vmovups [r15], xmm0
    mov qword ptr [r15+16], 687
    vmovups ymm0, [rax+22]
    vmovups [r15+24], ymm0
    lea rax, qword ptr [r15+2]
    add r15, 56
L130:
    mov qword ptr [rbx], rax

Let’s walk through those instructions. First the value of {x,0} is fetched:

    mov rax, qword ptr [rbx]

Since the hint operand is the atom reuse, is is possible that it is unnecessary to copy the tuple. Therefore, the JIT emits an instruction sequence to test whether the a field (position 2 in the tuple) already contains the value 42. If so, the source tuple can be reused:

    mov rdi, rax
    cmp qword ptr [rdi+14], 687   ; 42
    je L130                       ; Reuse source tuple

Next follows the copy and update sequence. First the header word for the tuple and its first element (the r atom) are copied using AVX instructions:

    vmovups xmm0, [rax-2]
    vmovups [r15], xmm0

Next the value 42 is stored into position 2 of the copy of the tuple:

    mov qword ptr [r15+16], 687   ; 42

Finally the remaining four elements of the tuple are copied:

    vmovups ymm0, [rax+22]
    vmovups [r15+24], ymm0

All that remains is to create a tagged pointer to the newly created tuple and increment the heap pointer:

    lea rax, qword ptr [r15+2]
    add r15, 56

The last instruction stores the tagged pointer to either the original or the updated tuple to {x,0}:

L130:
    mov qword ptr [rbx], rax

The BEAM code for update_ce/1 is very similar to the code for update_a/1:

    {test,is_tagged_tuple,{f,37},[{x,0},6,{atom,r}]}.
    {test_heap,7,1}.
    {update_record,{atom,reuse},
                   6,
                   {x,0},
                   {x,0},
                   {list,[4,{integer,99},6,{integer,777}]}}.
    return.

The native code looks like this:

# update_record_aIsdI
    mov rax, qword ptr [rbx]
    vmovups ymm0, [rax-2]
    vmovups [r15], ymm0
    mov qword ptr [r15+32], 1599   ; 99
    mov rdi, [rax+38]
    mov [r15+40], rdi
    mov qword ptr [r15+48], 12447  ; 777
    lea rax, qword ptr [r15+2]
    add r15, 56
    mov qword ptr [rbx], rax

Note that the copying and updating is done unconditionally, despite the reuse hint. The JIT is free to ignore the hints. When multiple fields are being updated, the test for whether the update is unnecessary would be more expensive and it is also much less likely that all of the fields would turn out to be unchanged. Therefore, trying to reuse the original tuple is more likely to be a pessimization rather than an optimization.

Matching and constructing binaries in OTP 25

To explore the optimizations of binaries, the following example will be used:

bin_swap(<>) ->
    <>.

Somewhat simplified, the main part of the BEAM code as emitted by the compiler in OTP 25 looks like this:

    {test,bs_start_match3,{f,1},1,[{x,0}],{x,1}}.
    {bs_get_position,{x,1},{x,0},2}.
    {test,bs_get_integer2,
          {f,2},
          2,
          [{x,1},
           {integer,8},
           1,
           {field_flags,[unsigned,big]}],
          {x,2}}.
    {test,bs_get_integer2,
          {f,2},
          3,
          [{x,1},
           {integer,24},
           1,
           {field_flags,[unsigned,big]}],
          {x,3}}.
    {test,bs_test_tail2,{f,2},[{x,1},0]}.
    {bs_create_bin,{f,0},
                   0,4,1,
                   {x,0},
                   {list,[{atom,integer},
                          1,1,nil,
                          {tr,{x,3},{t_integer,{0,16777215}}},
                          {integer,24},
                          {atom,integer},
                          2,1,nil,
                          {tr,{x,2},{t_integer,{0,255}}},
                          {integer,8}]}}.
    return.

Let’s walk through the code. The first instruction sets up a match context:

    {test,bs_start_match3,{f,1},1,[{x,0}],{x,1}}.

A match context holds several pieces of information needed for matching a binary.

The next instruction saves information that will be needed if matching of the binary fails for some reason:

    {bs_get_position,{x,1},{x,0},2}.

The next two instructions match out two segments as integers (comments added by me):

    {test,bs_get_integer2,
          {f,2},          % Failure label
          2,              % Number of live X registers (needed for GC)
          [{x,1},         % Match context register
           {integer,8},   % Size of segment in units
           1,             % Unit value
           {field_flags,[unsigned,big]}],
          {x,2}}.         % Destination register
    {test,bs_get_integer2,
          {f,2},
          3,
          [{x,1},
           {integer,24},
           1,
           {field_flags,unsigned,big]}],
          {x,3}}.

The next instruction makes sure that the end of the binary has now been reached:

    {test,bs_test_tail2,{f,2},[{x,1},0]}.

The next instruction creates the binary with the segments swapped:

    {bs_create_bin,{f,0},
                   0,4,1,
                   {x,0},
                   {list,[{atom,integer},
                          1,1,nil,
                          {tr,{x,3},{t_integer,{0,16777215}}},
                          {integer,24},
                          {atom,integer},
                          2,1,nil,
                          {tr,{x,2},{t_integer,{0,255}}},
                          {integer,8}]}}.

Before OTP 25, creation of binaries was done using multiple instructions, similar to how binary matching is still done in OTP 25. The reason for creating the bs_create_bin instruction in OTP 25 was to be able to provide improved error information when construction of a binary fails, similar to the improved BIF error information.

When the size of a segment of size 8, 16, 32, or 64 is matched, specialized instructions are used for x86_64. The specialized instructions do everything inline provided that the segment is byte-aligned. (The JIT in OTP 25 for AArch64/ARM64 does not have these specialized instructions.) Here is the instruction for matching a segment of size 8:

# i_bs_get_integer_8_Stfd
    mov rcx, qword ptr [rbx+8]
    mov rsi, qword ptr [rcx+22]
    lea rdx, qword ptr [rsi+8]
    cmp rdx, qword ptr [rcx+30]
    ja label_25
    rex test sil, 7
    short je L91
    mov edx, 64
    call L92
    short jmp L90
L91:
    mov rdi, qword ptr [rcx+14]
    shr rsi, 3
    mov qword ptr [rcx+22], rdx
    movzx rax, byte ptr [rdi+rsi]
    shl rax, 4
    or rax, 15
L90:
    mov qword ptr [rbx+16], rax

The first two instructions pick up the pointer to the match context and from the match context the current bit offset into the binary:

    mov rcx, qword ptr [rbx+8]   ; Load pointer to match context
    mov rsi, qword ptr [rcx+22]  ; Get offset in bits into binary

The next three instructions ensure that the length of the binary is at least 8 bits:

    lea rdx, qword ptr [rsi+8]   ; Add 8 to the offset
    cmp rdx, qword ptr [rcx+30]  ; Compare offset+8 with size of binary
    ja label_25                  ; Fail if the binary is too short

The next five instructions test whether the current byte in the binary is aligned at a byte boundary. If not, a helper code fragment is called:

    rex test sil, 7    ; Test the 3 least significant bits
    short je L91       ; Jump if 0 (meaning byte-aligned)
    mov edx, 64        ; Load size and flags
    call L92           ; Call helper fragment
    short jmp L90      ; Done

A helper code fragment is a shared block of code that can be called from the native code generated for BEAM instructions, typically to handle cases that are uncommon and/or would require more native instructions than are practial to include inline. Each such code fragment has its own calling convention, typically tailor-made to be as convenient for the caller as possible. (See Further adventures in the JIT for more information about helper code fragments.)

The remaining instructions read one byte from memory, convert it to a tagged Erlang terms, store it in {x,2}, and advance the bit offset in the match context:

L91:
    mov rdi, qword ptr [rcx+14]    ; Load base pointer for binary
    shr rsi, 3                     ; Convert bit offset to byte offset
    mov qword ptr [rcx+22], rdx    ; Update bit offset in match context
    movzx rax, byte ptr [rdi+rsi]  ; Read one byte from the binary
    shl rax, 4                     ; Multiply by 16...
    or rax, 15                     ; ... and add tag for a small integer

L90:
    mov qword ptr [rbx+16], rax    ; Store extracted integer

When matching a segment of size other than one of the special sizes mentioned earlier, the JIT will always emit a call to a general routine that can handle matching of any integer segment with any aligment, endianness, and signedness.

In OTP 25, the full potential for optimization of the bs_create_bin instruction is not realized. The construction of each segment is done by calling a helper routine that builds the segment. Here is the native for the part of the bs_create_bin instruction that builds the integer segments:

# construct integer segment
    mov edx, 24
    mov rsi, qword ptr [rbx+24]
    xor ecx, ecx
    lea rdi, qword ptr [rbx-80]
    call 4387496416
# construct integer segment
    mov edx, 8
    mov rsi, qword ptr [rbx+16]
    xor ecx, ecx
    lea rdi, qword ptr [rbx-80]
    call 4387496416

Binary pattern matching in OTP 26

In OTP 26, there is a new BEAM bs_match instruction used for matching segments with sizes known at compile time. The BEAM code for the matching code in the function head for bin_swap/1 is as follows:

    {test,bs_start_match3,{f,1},1,[{x,0}],{x,1}}.
    {bs_get_position,{x,1},{x,0},2}.
    {bs_match,{f,2},
              {x,1},
              {commands,[{ensure_exactly,32},
                         {integer,2,{literal,[]},8,1,{x,2}},
                         {integer,3,{literal,[]},24,1,{x,3}}]}}.

The first two instructions are identical to their OTP 25 counterparts.

The first operand of the bs_match instruction, {f,2}, is the failure label and the second operand {x,2} is the register holding the match context. The third operand, {commands,[...]}, is a list of matching commands.

The first command in the commands list, {ensure_exactly,32}, tests that the remaining number of bits in the binary being matched is exactly 32. If not, a jump is made to the failure label.

The second command extracts an integer of 8 bits and stores it in {x,2}. The third command extracts an integer of 24 bits and store it in {x,3}.

Having matching of multiple segments contained in a single BEAM instruction makes it much easier for the JIT to generate efficient code. Here is what the native code will do:

Test that there are at exactly 32 bits left in the binary.
If the segment is byte-aligned, read a 4-byte word from the binary and store it in a CPU register.
If the segment is not byte-aligned, read an 8-byte word from the binary and shift to extract the 32 bits needed.
Shift and mask out 8 bits and tag as an integer. Store into {x,2}.
Shift and mask out 24 bits and tag as an integer. Store into {x,3}.

The native code for the bs_match instruction (slightly simplifed) is as follows:

# i_bs_match_fS
# ensure_exactly 32
    mov rsi, qword ptr [rbx+8]
    mov rax, qword ptr [rsi+30]
    mov rcx, qword ptr [rsi+22]
    sub rax, rcx
    cmp rax, 32
    jne label_3
# read 32
    mov rdi, qword ptr [rsi+14]
    add qword ptr [rsi+22], 32
    mov rax, rcx
    shr rax, 3
    add rdi, rax
    and ecx, 7
    jnz L38
    movbe edx, dword ptr [rdi]
    add ecx, 32
    short jmp L40
L38:
    mov rdx, qword ptr [rdi-3]
    shr rdx, 24
    bswap rdx
L40:
    shl rdx, cl
# extract integer 8
    mov rax, rdx
# store extracted integer as a small
    shr rax, 52
    or rax, 15
    mov qword ptr [rbx+16], rax
    shl rdx, 8
# extract integer 24
    shr rdx, 36
    or rdx, 15
    mov qword ptr [rbx+24], rdx

The first part of the code ensures that there are exactly 32 bits remaining in the binary:

# ensure_exactly 32
    mov rsi, qword ptr [rbx+8]    ; Get pointer to match context
    mov rax, qword ptr [rsi+30]   ; Get size of binary in bits
    mov rcx, qword ptr [rsi+22]   ; Get offset in bits into binary
    sub rax, rcx
    cmp rax, 32
    jne label_3

The next part of the code does not directly correspond to the commands in the bs_match BEAM instruction. Instead, the code reads 32 bits from the binary:

# read 32
    mov rdi, qword ptr [rsi+14]
    add qword ptr [rsi+22], 32  ; Increment bit offset in match context
    mov rax, rcx
    shr rax, 3
    add rdi, rax
    and ecx, 7                  ; Test alignment
    jnz L38                     ; Jump if segment not byte-aligned

    ; Read 32 bits (4 bytes) byte-aligned and convert to big-endian
    movbe edx, dword ptr [rdi]
    add ecx, 32
    short jmp L40

L38:
    ; Read a 8-byte word and extract the 32 bits that are needed.
    mov rdx, qword ptr [rdi-3]
    shr rdx, 24
    bswap rdx                   ; Convert to big-endian

L40:
    ; Shift the read bytes to the most significant bytes of the word
    shl rdx, cl

The 4 bytes read will be converted to big-endian and placed as the most significant bytes of CPU register rdx with the rest of the register zeroed.

The following instructions extracts the 8 bits for the first segment and stores it as a tagged integer in {x,2}:

# extract integer 8
    mov rax, rdx
# store extracted integer as a small
    shr rax, 52
    or rax, 15
    mov qword ptr [rbx+16], rax
    shl rdx, 8

The following instructions extracts the 24 bits for the second segment and stores it as a tagged integer in {x,3}:

# extract integer 24
    shr rdx, 36
    or rdx, 15
    mov qword ptr [rbx+24], rdx

Binary construction in OTP 26

For binary construction in OTP 26, the compiler emits a bs_create_bin BEAM instruction just as in OTP 25. However, the native code that the JIT in OTP 26 emits for that instruction bears little resemblance to the native code emitted by OTP 25. The native code will do the following:

Allocate room on the heap for a binary and initialize it with inlined native code. A helper code fragment is called to do a garbage collection if there is not sufficient room left on the heap.
Read the integer from {x,3} and untag it.
Read the integer from {x,2} and untag it. Combine the value with the previous 24-bit value to obtain a 32-bit value.
Write the combined 32 bits into the binary.

Here follows the complete native code for the bs_create_bin instruction (somewhat simplified):

# i_bs_create_bin_jItd
# allocate heap binary
    lea rdx, qword ptr [r15+56]
    cmp rdx, rsp
    short jbe L43
    mov ecx, 4
.db 0x90
    call 4343630296
L43:
    lea rax, qword ptr [r15+2]
    mov qword ptr [rbx-120], rax
    mov qword ptr [r15], 164
    mov qword ptr [r15+8], 4
    add r15, 16
    mov qword ptr [rbx-64], r15
    mov qword ptr [rbx-56], 0
    add r15, 8
# accumulate value for integer segment
    xor r8d, r8d
    mov rdi, qword ptr [rbx+24]
    sar rdi, 4
    or r8, rdi
# accumulate value for integer segment
    shl r8, 8
    mov rdi, qword ptr [rbx+16]
    sar rdi, 4
    or r8, rdi
# construct integer segment from accumulator
    bswap r8d
    mov rdi, qword ptr [rbx-64]
    mov qword ptr [rbx-56], 32
    mov dword ptr [rdi], r8d

Let’s walk through it.

The first part of the code, starting with # allocate heap binary and ending before the next comment line allocates a heap binary with inlined native code. The only call to a helper code fragment is in case there is not sufficient space left on the heap.

Next follows the construction of the segments of the binary.

Instead of writing the value of each segment to memory one at a time, multiple segments are accumulated into a CPU register. Here follows the code for the first segment to be constructed (24 bits):

# accumulate value for integer segment
    xor r8d, r8d                ; Initialize accumulator
    mov rdi, qword ptr [rbx+24] ; Fetch {x,3}
    sar rdi, 4                  ; Untag
    or r8, rdi                  ; OR into accumulator

Here follows the code for the second segment (8 bits):

# accumulate value for integer segment
    shl r8, 8                   ; Make room for 8 bits
    mov rdi, qword ptr [rbx+16] ; Fetch {x,2}
    sar rdi, 4                  ; Untag
    or r8, rdi                  ; OR into accumulator

Since there are no segments of the binary left, the accumulated value will be written out to memory:

# construct integer segment from accumulator
    bswap r8d                   ; Make accumulator big-endian
    mov rdi, qword ptr [rbx-64] ; Get pointer into binary
    mov qword ptr [rbx-56], 32  ; Update size of binary
    mov dword ptr [rdi], r8d    ; Write 32 bits

Appending to binaries in OTP 25

The ancient OTP R12B release introduced an optimization for efficiently appending to a binary. Let’s look at an example to see the optimization in action:

-module(append).
-export([expand/1, expand_bc/1]).

expand(Bin) when is_binary(Bin) ->
    expand(Bin, <<>>).

expand(<>, Acc) ->
    expand(T, <>);
expand(<<>>, Acc) ->
    Acc.

expand_bc(Bin) when is_binary(Bin) ->
    << <> || <> <= Bin >>.

Both append:expand/1 and append:expand_bc/1 take a binary and double its size by expanding each byte to two bytes. For example:

1> append:expand(<<1,2,3>>).
<<0,1,0,2,0,3>>
2> append:expand_bc(<<4,5,6>>).
<<0,4,0,5,0,6>>

Both functions accept only binaries:

3> append:expand(<<1,7:4>>).
** exception error: no function clause matching append:expand(<<1,7:4>>,<<>>)
4> append:expand_bc(<<1,7:4>>).
** exception error: no function clause matching append:expand_bc(<<1,7:4>>)

Before looking at the BEAM code, let’s do some benchmarking using erlperf to find out which function is faster:

erlperf --init_runner_all 'rand:bytes(10_000).' \
        'r(Bin) -> append:expand(Bin).' \
        'r(Bin) -> append:expand_bc(Bin).'
Code                                     ||        QPS       Time   Rel
r(Bin) -> append:expand_bc(Bin).          1       7936     126 us  100%
r(Bin) -> append:expand(Bin).             1       4369     229 us   55%

The expression for the --init_runner_all option uses rand:bytes/1 to create a binary with 10,000 random bytes, which will be passed to both expand functions.

From the benchmark results, it can be seen that the expand_bc/1 function is almost twice as fast.

To find out why, let’s compare the BEAM code for the two functions. Here is the instruction that appends to the binary in expand/1:

    {bs_create_bin,{f,0},
                   0,3,8,
                   {x,1},
                   {list,[{atom,append},  % Append operation
                          1,8,nil,
                          {tr,{x,1},{t_bitstring,1}}, % Source/destination
                          {atom,all},
                          {atom,integer},
                          2,1,nil,
                          {tr,{x,2},{t_integer,{0,255}}},
                          {integer,16}]}}.

The first segment is an append operation. The operand {tr,{x,1},{t_bitstring,1}} denotes both source and destination of the operation. That is, the binary referenced by {x,1} will be mutated. Erlang normally does not allow mutation, but this mutation is done under the hood in a way not observable from outside. That makes the append operation much more efficient than it would be if the source binary had to be copied.

For the binary comprehension in expand_bc/1, there is a similar BEAM instruction for appending to the binary:

    {bs_create_bin,{f,0},
                   0,3,1,
                   {x,1},
                   {list,[{atom,private_append}, % Private append operation
                          1,1,nil,
                          {x,1},
                          {atom,all},
                          {atom,integer},
                          2,1,nil,
                          {tr,{x,2},{t_integer,{0,255}}},
                          {integer,16}]}}.

The main difference is that the binary comprehension uses the more efficient private_append operation instead of append.

The append operation has more overhead because it must produce the correct result for code such as:

bins(Bin) ->
    bins(Bin, <<>>).

bins(<>, Acc) ->
    [Acc|bins(T, <>)];
bins(<<>>, Acc) ->
    [Acc].

Running it:

1> example:bins(<<"abcde">>).
[<<>>,<<"a">>,<<"ab">>,<<"abc">>,<<"abcd">>,<<"abcde">>]

In the expand/1 function, only the final value binary being appended to was needed. In bins/1, all of the intermediate values of binary are collected in a list. For correctness, the append operations must ensure that the binary Acc is copied before H is appended to it. To be able to know when it is necessary to copy the binary, the append operation does some extra bookeeping that does not come for free.

Appending to binaries in OTP 26

In OTP 26, there is a new optimization in the compiler that replaces an append operation with a private_append operation whenever it is correct and safe to do so. This optimization was implemented by Frej Drejhammar. That is, the optimization will rewrite append:expand/2 to use private_append, but not examples:bins/2.

The difference between append:expand/1 and append:expand_bc/1 is now much smaller:

erlperf --init_runner_all 'rand:bytes(10_000).' \
        'r(Bin) -> append:expand(Bin).' \
        'r(Bin) -> append:expand_bc(Bin).'
Code                                     ||        QPS       Time   Rel
r(Bin) -> append:expand_bc(Bin).          1      13164   75988 ns  100%
r(Bin) -> append:expand(Bin).             1      12419   80550 ns   94%

expand_bc/1 is still a bit faster because the compiler emits somewhat more efficient binary matching code for it than for the expand/1 function.

The benefit of `is_binary/1` guards

The expand/1 function has an is_binary/1 guard test that may seem unnecessary:

expand(Bin) when is_binary(Bin) ->
    expand(Bin, <<>>).

The guard test is not necessary for correctness, because expand/2 will raise a function_clause exception if its argument is not a binary. However, better code will be generated for expand/2 with the guard test.

With the guard test, the first BEAM instruction in expand/2 is:

    {bs_start_match4,{atom,no_fail},2,{x,0},{x,0}}.

Without the guard test, the first BEAM instruction is:

    {test,bs_start_match3,{f,3},2,[{x,0}],{x,2}}.

The bs_start_match4 instruction is more efficient because it does not have to test that {x,0} contains a binary.

The benchmark results show measurable increased execution time for expand/1 if the guard test is removed:

erlperf --init_runner_all 'rand:bytes(10_000).' \
        'r(Bin) -> append:expand(Bin).' \
        'r(Bin) -> append:expand_bc(Bin).'
Code                                     ||        QPS       Time   Rel
r(Bin) -> append:expand_bc(Bin).          1      13273   75366 ns  100%
r(Bin) -> append:expand(Bin).             1      11875   84236 ns   89%

Revisiting the `base64` module

Traditionally, up to OTP 25, the clause in the base64 module that does most of the work of encoding a binary to Base64 looked like this:

encode_binary(<>, A) ->
    BB = (B1 bsl 16) bor (B2 bsl 8) bor B3,
    encode_binary(Ls,
                  <>).

The reason is that matching out segments of size 8 has always been specially optimized and has been much faster than matching out a segment of size 6. That is no longer true in OTP 26. With the improvements in binary matching described in this blog post, the clause can be written in a more natural way:

encode_binary(<>, A) ->
    encode_binary(Ls,
                  <>);

(This is not the exact code in OTP 26, because of additional features added later.)

The benchmark results for encoding a random binary of 1,000,000 bytes to Base64 for OTP 25 is:

erlperf --init_runner_all 'rand:bytes(1_000_000).' \
        'r(Bin) -> base64:encode(Bin).'
Code                                  ||        QPS       Time
r(Bin) -> base64:encode(Bin).          1         61   16489 us

The benchmark results for encoding a random binary of 1,000,000 bytes to Base64 for OTP 26 is:

erlperf --init_runner_all 'rand:bytes(1_000_000).' \
        'r(Bin) -> base64:encode(Bin).'
Code                                  ||        QPS       Time
r(Bin) -> base64:encode(Bin).          1        249    4023 us

That is, encoding is about 4 times faster.

Pull requests

Here are the main pull requests for the optimizations mentioned in this blog post:

compiler: Improve the type analysis
JIT: Optimise common combinations of relational operators
JIT: Minor optimizations, which includes the optimization that avoids fetching an operand that is already in a CPU register.
compiler: Optimize record updates
JIT: Optimize binary matching for fixed-width segments
JIT: Optimize creation of binaries
compiler: private_append optimization for binaries

Erlang/OTP 25 Highlights

2022-05-18T00:00:00+00:00

OTP 25 is finally here. This post will introduce the new features that I am most excited about.

You can download the readme describing all the changes here: Erlang/OTP 25 Readme. Or, as always, look at the release notes of the application you are interested in. For instance here: Erlang/OTP 25 - Erts Release Notes - Version 13.0.

This years highlights are:

New functions in the mapsand lists modules
Selectable features and the new maybe_expr feature
Dialyzer
Improvements of the JIT
Better support for perf and gdb
Relocatable installation directory
ETS-tables with adaptive support for write concurrency
New option short for erlang:float_to_list/2 and erlang:float_to_binary/2
The new module peer supersedes the slave module
gen_xxx modules has got a new format_status/1 callback
The timer module has been modernized and made more efficient
Crypto and OpenSSL 3.0
CA-certificates can be fetched from the OS standard place
A new fast Pseudo Random Generator

New functions in the `maps` and `lists` modules

Triggered by suggestions from the users we have introduced new functions in the maps and lists modules in stdlib.

`maps:groups_from_list/2,3`

For short we can say that this function take a list of elements and group them. The result is a map #{Group1 => [Group1Elements], GroupN => [GroupNElements]}.

Let us look at some examples from the shell:

> maps:groups_from_list(fun(X) -> X rem 2 end, [1,2,3]).
#{0 => [2], 1 => [1, 3]}

The provided fun calculates X rem 2 for every element X in the input list and then group the elements in a map with the result of X rem 2 as key and the corresponding elements as a list value for that key.

> maps:groups_from_list(fun erlang:length/1, ["ant", "buffalo", "cat", "dingo"]).
#{3 => ["ant", "cat"], 5 => ["dingo"], 7 => ["buffalo"]}

In the example above the strings in the input list are grouped into a map based on their length.

There is also a variant of groups_from_list with an additional fun by which the values can be converted before they are put into their groups.

> maps:groups_from_list(fun(X) -> X rem 2 end, fun(X) -> X*X end, [1,2,3]).
#{0 => [4], 1 => [1, 9]}

In the example above the elements X in the list are grouped according the X rem 2 calculation but the values stored in the groups are the elements multiplied by themselves (X * X).

> maps:groups_from_list(fun erlang:length/1, fun lists:reverse/1, ["ant", "buffalo", "cat", "dingo"]).
#{3 => ["tna","tac"],5 => ["ognid"],7 => ["olaffub"]}

In the example above the strings from the input list are grouped according to their length and they are reversed before they are stored in the groups.

For more details see the maps:groups_from_list/2 documentation.

`lists:enumerate/1,2`

Takes a list of elements and returns a new list where each element has been associated with its position in the original list. Returns a new list with tuples of the form {I, H} where I is the position of H in the original list. The enumeration starts with 1 and increases by 1 in each step.

Example:

> lists:enumerate([a,b,c]).
[{1,a},{2,b},{3,c}]

There is also a enumerate/2 function which can be used to set the initial number to something else than 1. See example below:

> lists:enumerate(10, [a,b,c]).
[{10,a},{11,b},{12,c}]

For more details see the lists:enumerate/1 documentation.

`lists:uniq/1,2`

Removes duplicates from a list while preserving the order of the elements. The first occurrence of each element is kept. We already have lists:usort which also removes duplicates but returns a sorted list.

Examples:

> lists:uniq([3,3,1,2,1,2,3]).
[3,1,2]
> lists:uniq([a, a, 1, b, 2, a, 3]).
[a, 1, b, 2, 3]

lists:uniq/2 allows the user to specify with a fun how to determine that 2 elements in the list are equal. In the example below the provided fun is just testing the first element of the 2 tuples for equality.

Examples:

> lists:uniq(fun({X, _}) -> X end, [{b, 2}, {a, 1}, {c, 3}, {a, 2}]).
[{b, 2}, {a, 1}, {c, 3}]

For more details see the lists:uniq/1 documentation.

Selectable features and the new `maybe_expr` feature

Selectable features is a new mechanism and concept where a new potentially incompatible feature (language or runtime), can be introduced and tested without causing troubles for those that don’t use it.

When it comes to language features the intention is that they can be activated per module with no impact on modules where they are not activated.

Let’s use the new maybe_expr feature as an example.

In module my_experiment the feature is activated and used like this:

-module(my_experiment).
-export([foo/1]).

%% Enable the feature maybe_expr in this module only
%% Makes maybe a keyword which might be incompatible
%% in modules using maybe as a function name or an atom
-feature(maybe_expr,enable). 
foo() ->
  maybe
    {ok, X} ?= f(Foo),
    [H|T] ?= g([1,2,3]),
    ...
  else
    {error, Y} ->
        {ok, "default"};
    {ok, _Term} ->
        {error, "unexpected wrapper"}
  end.

The compiler will note that the feature maybe_expr is enabled and will handle the maybe construct correctly. In the generated .beam file it will also be noted that the module has enabled the feature.

When starting an Erlang node the specific feature (or all) must be enabled otherwise the .beam file with the feature will not be allowed for loading.

erl -enable-feature maybe_expr

erl -enable-feature all

For more details see the feature section in the Erlang Reference Manual.

The new `maybe_expr` feature EEP-49

The EEP-49 “Value-Based Error Handling Mechanisms”, was suggested by Fred Hebert already 2018 and now it has finally been implemented as the first feature within the new feature concept.

The maybe ... end construct is similar to begin ... end in that it is used to group multiple distinct expressions as a single block. But there is one important difference in that the maybe block does not export its variables while begin does export its variables.

A new type of expressions (denoted MatchOrReturnExprs) are introduced, which are only valid within a maybe ... end expression:

maybe
    Exprs | MatchOrReturnExprs
end

MatchOrReturnExprs are defined as having the following form:

Pattern ?= Expr

This definition means that MatchOrReturnExprs are only allowed at the top-level of maybe ... end expressions.

The ?= operator takes the value returned by Expr and pattern matches it against Pattern.

If the pattern matches, all variables from Pattern are bound in the local environment, and the expression is equivalent to a successful Pattern = Expr call. If the value does not match, the maybe ... end expression returns the failed expression directly.

A special case exists in which we extend maybe ... end into the following form:

maybe
    Exprs | MatchOrReturnExprs
else
    Pattern -> Exprs;
    ...
    Pattern -> Exprs
end

This form exists to capture non-matching expressions in a MatchOrReturnExprs to handle failed matches rather than returning their value. In such a case, an unhandled failed match will raise an else_clause error, otherwise identical to a case_clause error.

This extended form is useful to properly identify and handle successful and unsuccessful matches within the same construct without risking to confuse happy and unhappy paths.

Given the structure described here, the final expression may look like:

maybe
    Foo = bar(),            % normal exprs still allowed
    {ok, X} ?= f(Foo),
    [H|T] ?= g([1,2,3]),
    ...
else
    {error, Y} ->
        {ok, "default"};
    {ok, _Term} ->
        {error, "unexpected wrapper"}
end

For more details see the maybe section in the Erlang Reference Manual.

Motivation

With the maybe construct it is possible to reduce deeply nested conditional expressions and make messy patterns found in the wild unnecessary. It also provides a better separation of concerns when implementing functions.

Reducing Nesting

One common pattern that can be seen in Erlang is deep nesting of case ... end expressions, to check complex conditionals.

Take the following code taken from Mnesia, for example:

commit_write(OpaqueData) ->
    B = OpaqueData,
    case disk_log:sync(B#backup.file_desc) of
        ok ->
            case disk_log:close(B#backup.file_desc) of
                ok ->
                    case file:rename(B#backup.tmp_file, B#backup.file) of
                        ok ->
                            {ok, B#backup.file};
                        {error, Reason} ->
                            {error, Reason}
                    end;
                {error, Reason} ->
                    {error, Reason}
            end;
        {error, Reason} ->
            {error, Reason}
    end.

The code is nested to the extent that shorter aliases must be introduced for variables (OpaqueData renamed to B), and half of the code just transparently returns the exact values each function was given.

By comparison, the same code could be written as follows with the new construct:

commit_write(OpaqueData) ->
    maybe
        ok ?= disk_log:sync(OpaqueData#backup.file_desc),
        ok ?= disk_log:close(OpaqueData#backup.file_desc),
        ok ?= file:rename(OpaqueData#backup.tmp_file, OpaqueData#backup.file),
        {ok, OpaqueData#backup.file}
    end.

Or, to protect against disk_log calls returning something else than ok | {error, Reason}, the following form could be used:

commit_write(OpaqueData) ->
    maybe
        ok ?= disk_log:sync(OpaqueData#backup.file_desc),
        ok ?= disk_log:close(OpaqueData#backup.file_desc),
        ok ?= file:rename(OpaqueData#backup.tmp_file, OpaqueData#backup.file),
        {ok, OpaqueData#backup.file}
    else
        {error, Reason} -> {error, Reason}
    end.

The semantics of these calls are identical, except that it is now much easier to focus on the flow of individual operations and either success or error paths.

Dialyzer

Dialyzer now supports the missing_return and extra_return options to raise warnings when specifications differ from inferred types. These are similar to, but not quite as verbose, as overspecs and underspecs.
Dialyzer now better understands the types for min/2, max/2, and erlang:raise/3. Because of that, Dialyzer can potentially generate new warnings. In particular, functions that use erlang:raise/3 could now need a spec with a no_return() return type to avoid an unwanted warning.

Improvements of the JIT

The JIT compiler introduced in Erlang/OTP 24 improved the performance for Erlang applications.

Erlang/OTP 25 introduces some major improvements of the JIT:

The JIT now supports the AArch64 (ARM64) architecture, used by (for example) Apple Silicon Macs and newer Raspberry Pi devices.
Better code generated based on types provided by the Erlang compiler.
Better support for perf and gdb with line numbers for Erlang code.

Support for AArch64 (ARM64)

How much speedup one can expect from the JIT compared to the interpreter varies from nothing to up to four times.

To get some more concrete figures we have run three different benchmarks with the JIT disabled and enabled on a MacBook Pro (M1 processor; released in 2020).

First we ran the EStone benchmark. Without the JIT, 691,962 EStones were achieved and with the JIT 1,597,949 EStones. That is, more than twice as many EStones with the JIT.

Next we tried running Dialyzer to build a small PLT:

dialyzer --build_plt --apps erts kernel stdlib

With the JIT, the time for building the PLT was reduced from 18.38 seconds down to 9.64 seconds. That is, almost but not quite twice as fast.

Finally, we ran a benchmark for the base64 module included in this Github issue.

With the JIT:

== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 11846 ms: 84 it/sec
fun base64:decode/1: 1000 iterations in 14617 ms: 68 it/sec

Without the JIT:

== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 25938 ms: 38 it/sec
fun base64:decode/1: 1000 iterations in 20603 ms: 48 it/sec

Encoding with the JIT is almost two and half times as fast, while the decoding time with the JIT is about 75 percent of the decoding time without the JIT.

Type-based optimizations

The JIT translates one BEAM instruction at the time to native code without any knowledge of previous instructions. For example, the native code for the + operator must work for any operands: small integers that fit in 64-bit word, large integers, floats, and non-numbers that should result in raising an exception.

In Erlang/OTP 25, the compiler embeds type information in the BEAM file to the help the JIT generate better native code without unnecessary type tests.

For more details, see the blog post Type-Based Optimizations in the JIT.

Better support for `perf` and `gdb`

It is now possible to profile Erlang systems with perf and get a mapping from the JIT code to the corresponding Erlang code. This will make it easy to find bottlenecks in the code.

The same goes for gdb which also can show which line of Erlang code a specific address in the JIT code corresponds to.

Perf is a Linux command-line tool for lightweight CPU profiling; it checks CPU performance counters, trace points, uprobes, and kprobes, monitors program events, and creates reports.

An Erlang node running under perf can be started like this:

perf record --call-graph fp -- erl +JPperf true

The result from perf could then be viewed like this:

perf report

It is also possible to attach perf to an already running Erlang node like this:

# start Erlang at get the Pid
erl +JPperf true

And the pid for the node is 4711

You can then attach perf to the node like this:

sudo perf record --call-graph fp -p 4711

Below is an example where perf is run to analyze dialyzer building a PLT like this:

 ERL_FLAGS="+JPperf true +S 1" perf record --call-graph=fp \
   dialyzer --build_plt -Wunknown --apps compiler crypto erts kernel stdlib \
   syntax_tools asn1 edoc et ftp inets mnesia observer public_key \
   sasl runtime_tools snmp ssl tftp wx xmerl tools

The above code is run using +S 1 to make the perf output easier to understand. If you then run perf report -f --no-children you may get something similar to this:

Frame pointers are enabled when the +JPperf true option is passed, so you can use perf record --call-graph=fp to get more context.

Any Erlang function in the report is prefixed with a $ and all C functions have their normal names. Any Erlang function that has the prefix $global:: refers to a global shared fragment.

So in the above, we can see that we spend the most time doing eq, i.e. comparing two terms. By expanding it and looking at its parents we can see that it is the function erl_types:t_is_equal/2 that contributes the most to this value. Go and have a look at it in the source code to see if you can figure out why so much time is spent there.

After eq we see the function erl_types:t_has_var/1 where we spend almost 5% of the entire execution in. A while further down you can see copy_struct_x which is the function used to copy terms. If we expand it to view the parents we find that it is mostly ets:lookup_element/3 that contributes to this time via the Erlang function dialyzer_plt:ets_table_lookup/2.

`perf` tips and tricks

You can do a lot of neat things with perf. Below is a list of some of the options we have found useful:

perf report --no-children Do not include the accumulation of all children in a call.
perf report --call-graph callee Show the callee rather than the caller when expanding a function call.
perf archive Create an archive with all the artifacts needed to inspect the data on another host. In early version of perf this command does not work, instead you can use this bash script.
perf report gives “failed to process sample” and/or “failed to process type: 68” This probably means that you are running a buggy version of perf. We have seen this when running Ubuntu 18.04 with kernel version 4. If you update to Ubuntu 20.04 or use Ubuntu 18.04 with kernel version 5 the problem should go away.

Improved error information for failing binary construction

Erlang/OTP 24 introduced improved BIF error information to provide more information when a call to a BIF failed.

In Erlang/OTP 25, improved error information is also given when the creation of a binary using the bit syntax fails.

Consider this function:

bin(A, B, C, D) ->
    <>.

If we call this function with incorrect arguments in past releases we will just be told that something was wrong and the line number:

1> t:bin(<<"abc">>, 2.0, 42, <<1:7>>).
** exception error: bad argument
     in function  t:bin/4 (t.erl, line 5)

But which part of line 5? Imagine that t:bin/4 was called from deep within an application and we had no idea what the actual values for the arguments were. It could take a while to figure out exactly what went wrong.

Erlang/OTP 25 gives us more information:

1> c(t).
{ok,t}
2> t:bin(<<"abc">>, 2.0, 42, <<1:7>>).
** exception error: construction of binary failed
     in function  t:bin/4 (t.erl, line 5)
        *** segment 1 of type 'float': expected a float or an integer but got: <<"abc">>

Note that the module must be compiled by the compiler in Erlang/OTP 25 in order to get the more informative error message. The old-style message will be shown if the module was compiled by a previous release.

Here the message tells us that first segment in the construction was given the binary <<"abc">> instead of a float or an integer, which is the expected type for a float segment.

It seems that we switched the first and second arguments for bin/4, so we try again:

3> t:bin(2.0, <<"abc">>, 42, <<1:7>>).
** exception error: construction of binary failed
     in function  t:bin/4 (t.erl, line 5)
        *** segment 2 of type 'binary': the value <<"abc">> is shorter than the size of the segment

It seems that there was more than one incorrect argument. In this case, the message tells us that the given binary is shorter than the size of the segment.

Fixing that:

4> t:bin(2.0, <<"abcd">>, 42, <<1:7>>).
** exception error: construction of binary failed
     in function  t:bin/4 (t.erl, line 5)
        *** segment 4 of type 'binary': the size of the value <<1:7>> is not a multiple of the unit for the segment

A binary segment has a default unit of 8. Therefore, passing a bit string of size 7 will fail.

Finally:

5> t:bin(2.0, <<"abcd">>, 42, <<1:8>>).
<<64,0,0,0,0,0,0,0,97,98,99,100,0,42,1>>

Improved error information for failed record matching

Another improvement is the exceptions when matching of a record fails.

Consider this record and function:

-record(rec, {count}).

rec_add(R) ->
    R#rec{count = R#rec.count + 1}.

In past releases, failure to match a record or retrieve an element from a record would result in the following exception:

1> t:rec_add({wrong,0}).
** exception error: {badrecord,rec}
     in function  t:rec_add/1 (t.erl, line 8)

Before Erlang/OTP 15 that introduced line numbers in exceptions, knowing which record that was expected could be useful if the error occurred in a large function.

Nowadays, unless several different records are accessed on the same line, the line number makes it obvious which record was expected.

Therefore, in Erlang/OTP 25 the badrecord exception has been changed to show the actual incorrect value:

2> t:rec_add({wrong,0}).
** exception error: {badrecord,{wrong,0}}
     in function  t:rec_add/1 (t.erl, line 8)

The new badrecord exceptions will show up for code that has been compiled with Erlang/OTP 25.

Relocatable installation directory

Previously shell scripts (e.g., erl and start) and the RELEASES file for an Erlang installation depended on a hard coded absolute path to the installation’s root directory. This made it cumbersome to move an installation to a different directory which can be problematic for platforms such as Android (#2879) where the installation directory is unknown at compile time. This is fixed by:

Changing the shell scripts so that they can dynamically find the ROOTDIR. The dynamically found ROOTDIR is selected if it differs from the hard-coded ROOTDIR and seems to point to a valid Erlang installation. The dyn_erl program has been changed so that it can return its absolute canonicalized path when given the –realpath argument (dyn_erl gets its absolute canonicalized path from the realpath POSIX function). The dyn_erl’s –realpath functionality is used by the scripts to get the root dir dynamically.
Changing the release_handler module that reads and writes to the RELEASES file so that it prepends code:root_dir() whenever it encounters relative paths. This is necessary since the current working directory can be changed so it is something different than code:root_dir().

ETS-tables with adaptive support for write concurrency

It has since long been possible to optimize an ETS table for write concurrency doing like this:

ets:new(my_table, [{write_concurrency, true}]).

Now we also introduce adaptive support for write concurrency which can be configured like this:

ets:new(my_table, [{write_concurrency, auto}]).

This option forces tables to automatically change the number of locks that are used at run-time depending on how much concurrency is detected. When you enable automatic write concurrency decentralized_counters are also activated for even more scalable ETS tables. Use this option when you know that a lot of processes will be accessing an ETS table on systems with many number of cores.

For more details you can read PR 5208 that introduced the change and the blog post about decentralized counters.

New option `short` for `erlang:float_to_list/2` and `erlang:float_to_binary/2`

A new option called short has been added to the functions erlang:float_to_list/2 and erlang:float_to_binary/2. This option creates the shortest correctly rounded string representation of the given float that can be converted back to the same float again.

If option short is specified, the float is formatted with the smallest number of digits that still guarantees that

F =:= list_to_float(float_to_list(F, [short]))

When the float is inside the range (-2⁵³, 2⁵³), the notation that yields the smallest number of characters is used (scientific notation or normal decimal notation). Floats outside the range (-2⁵³, 2⁵³) are always formatted using scientific notation to avoid confusing results when doing arithmetic operations.

The implementation is contributed by Thomas Depierre and uses the Ryū algorithm.

Ryū, is a new algorithm to convert binary floating point numbers to their decimal representations using only fixed-size integer operations. Ryū is simpler and approximately three times faster than the previously fastest implementation. https://github.com/ulfjack/ryu

The new module `peer` supersedes the slave module

The peer module provides functions for starting linked Erlang nodes. The Erlang node spawning new “peer” nodes is called origin, and the newly started nodes are peers.

A peer node automatically terminates when it loses the control connection to the origin. This connection could be an Erlang distribution connection, or an alternative - TCP or standard I/O. The alternative connection provides a way to execute remote procedure calls even when Erlang Distribution is not available, allowing to test the distribution itself.

Peer node terminal input/output is relayed through the origin. If a standard I/O alternative connection is requested, console output also goes via the origin, allowing debugging of node startup and boot script execution (see -init_debug). File I/O is not redirected, contrary to slave behavior.

The peer node can start on the same or a different host (via ssh) or in a separate container (for example Docker). When the peer starts on the same host as the origin, it inherits the current directory and environment variables from the origin.

Note

This module is designed to facilitate multi-node testing with Common Test. Use the ?CT_PEER() macro to start a linked peer node according to Common Test conventions: crash dumps written to specific location, node name prefixed with module name, calling function, and origin OS process ID). Use random_name/1 to create sufficiently unique node names if you need more control.

A peer node started without alternative connection behaves similarly to slave(3).

`gen_XXX` modules has got a new `format_status/1` callback.

The format_status/2 callback for gen_server, gen_statem and gen_event has been deprecated in favor of the new format_status/1 callback.

The new callback adds the possibility to limit and change many more things than the just the state.

The purpose with both the old and the new format_status callbacks are to let the user filter away sensitive information and possibly data of huge volume from the crash reports.

The `timer` module has been modernized and made more efficient

The timer module has been modernized and made more efficient, which makes the timer server less susceptible to being overloaded. The timer:sleep/1 function now accepts an arbitrarily large integer.

Crypto and OpenSSL 3.0

Some applications in OTP like SSL/TLS and SSH need cryptography to work. That is provided by the OTP application crypto, which interfaces Erlang to an external cryptolib in C using NIFs. The main example of such an external cryptolib is OpenSSL.

The OpenSSL cryptolib exists in many versions. OTP/crypto supports 0.9.8c and later, although only 1.1.1 is still maintained by OpenSSL.

OpenSSL has released its version 3.0 series, which is their future platform totally re-built with a new API. The APIs of previous versions (1.1.1 and older) are partly deprecated, although still available in 3.0. The support of 1.1.1 will also end in a future.

Since it is vital to get security patches in the cryptolib, and in a future only the 3.0 API might be available, OTP/crypto now from OTP-25.0 interfaces OpenSSL 3.0 using the new 3.0 API. A few functions from old APIs are still used, but they will be replaced as soon as possible.

You as a user will hopefully not notice any difference: if you have OpenSSL 1.1.1 (or older - not recommended) and build OTP, that one will be used as previously. If you have any OpenSSL 3.0 version installed, that one will be used without need of doing anything special except for normal handling of dynamic loading paths in the OS.

CA-certificates can be fetched from the OS standard place

With the new functions public_key:cacerts_load/0,1 and public_key:cacerts_get/0 the CA certificates can be fetched from the standard place of the OS (or from a file).

They will then be cached in decoded form by use of persistent_term which makes them available in an efficient way for the ssl and httpc modules. The intention with this is to make it unnecessary to depend on for example certifi in many packages.

On Windows and MacOSx the certificate store is not an ordinary file so the information is fetched via an API using a NIF (Windows) or with an external program (MacOSx).

Example with ssl

%% makes the certificates available without copying
CaCerts = public_key:cacerts_get(), 
% use the certificates when establishing a connection
{ok,Socket} = ssl:connect("erlang.org",443,[{cacerts,CaCerts}, {verify,verify_peer}]), 
...

We also plan to update the http client (httpc) to use this soon.

A new fast Pseudo Random Generator

A new custom designed Pseudo Random Generator rand:mwc59 has been implemented. It is probably the fastest possible generator with good quality that can be written in Erlang. To do this it barely avoids bignums, allocating heap data, and uses only a minimal number of fast operations.

Under the “right” circumstances: A number that takes 60 ns to generate with the default generator can be generated in 4 ns with rand:mwc59.

It is intended for applications in dire need for speed in PRNG numbers, but not any of the comfort features that rand otherwise offers.

Fast random integers

2022-05-12T00:00:00+00:00

When you need “random” integers, and it is essential to generate them fast and cheap; then maybe the full featured Pseudo Random Number Generators in the rand module are overkill. This blog post will dive in to new additions to the said module, how the Just-In-Time compiler optimizes them, known tricks, and tries to compare these apples and potatoes.

Speed over quality?
Suggested solutions
Quality
Storing the state
Seeding
JIT optimizations
Implementing a PRNG
rand_SUITE:measure/1
Measurement results
Summary

Speed over quality?

The Pseudo Random Number Generators implemented in the rand module offers many useful features such as repeatable sequences, non-biased range generation, any size range, non-overlapping sequences, generating floats, normal distribution floats, etc. Many of those features are implemented through a plug-in framework, with a performance cost.

The different algorithms offered by the rand module are selected to have excellent statistical quality and to perform well in serious PRNG tests (see section PRNG tests).

Most of these algorithms are designed for machines with 64-bit arithmetic (unsigned), but in Erlang such integers become bignums and almost an order of magnitude slower to handle than immediate integers.

Erlang terms in the 64-bit VM are tagged 64-bit words. The tag for an immediate integer is 4 bit, leaving 60 bits for the signed integer value. The largest positive immediate integer value is therefore 2⁵⁹-1.

Many algorithms work on unsigned integers so we have 59 bits useful for that. It could be theoretically possible to pretend 60 bits unsigned using split code paths for negative and positive values, but extremely impractical.

We decided to choose 58 bit unsigned integers in this context since then we can for example add two integers, and check for overflow or simply mask back to 58 bit, without the intermediate result becoming a bignum. To work with 59 bit integers would require having to check for overflow before even doing an addition so the code that avoids bignums would eat up much of the speed gained from avoiding bignums. So 58-bit integers it is!

The algorithms that perform well in Erlang are the ones that have been redesigned to work on 58-bit integers. But still, when executed in Erlang, they are far from as fast as their C origins. Achieving good PRNG quality costs much more in Erlang than in C. In the section Measurement results we see that the algorithm exsp that boasts sub-ns speed in C needs 17 ns in Erlang.

32-bit Erlang is a sad story in this regard. The bignum limit on such an Erlang system is so low, calculations would have to use 26-bit integers, that designing a PRNG not using bignums must be so small in period and size that it becomes too bad to be useful. The known trick erlang:phash2(erlang:unique_integer(), Range) is still fairly fast, but all rand generators work exactly the same as on a 64-bit system, hence operates on bignums so they are much slower.

If your application needs a “random” integer for an non-critical purpose such as selecting a worker, choosing a route, etc, and performance is much more important than repeatability and statistical quality, what are then the options?

Quality

There are many different aspects of a PRNG:s quality. Here are some.

Period

erlang:phash2(erlang:unique_integer(), Range) has, conceptually, an infinite period, since the time it will take for it to repeat is assumed to be longer than the Erlang node will survive.

For the new fast mwc59 generator the period it is about 2⁵⁹. For the regular ones in rand it is at least 2¹¹⁶ - 1, which is a huge difference. It might be possible to consume 2⁵⁹ numbers during an Erlang node’s lifetime, but not 2¹¹⁶.

There are also generators in rand with a period of 2⁹²⁸ - 1 which might seem ridiculously long, but this facilitates generating very many parallel sub-sequences guaranteed to not overlap.

In, for example, a physical simulation it is common practice to only use a fraction of the generator’s period, both regarding how many numbers you generate and on how large range you generate, or it may affect the simulation for example that specific numbers do not reoccur. If you have pulled 3 aces from a deck you know there is only one left.

Some applications may be sensitive to the generator period, while others are not, and this needs to be considered.

Size

The value size of the new fast mwc59 generators is 59, 32, or 16 bits, depending on the scrambling function that is used. Most of the regular generators in the rand module has got a value size of 58 bits.

If you need numbers in a power of 2 range then you can simply mask out the low bits:

V = X band ((1 bsl RangeBits) - 1).

Or shift down the required number of bits:

V = X bsr (GeneratorBits - RangeBits).

This, depending on if the generator is known to have weak high or low bits.

If the range you need is not a power of 2, but still much smaller than the generator’s size you can use rem:

V = X rem Range.

The rule of thumb is that Range should be less than the square root of the generator’s size. This is much slower than bit-wise operations, and the operation propagates low bits, which can be a problem if the generator is known to have weak low bits.

Another way is to use truncated multiplication:

V = (X * Range) bsr GeneratorBits

The rule of thumb here is that Range should be less than the square root of 2^{GeneratorBits}, that is, 2^{GeneratorBits/2}. Also, X * Range should not create a bignum, so not more than 59 bits. This method propagates high bits, which can be a problem if the generator is known to have weak high bits.

Other tricks are possible, for example if you need numbers in the range 0 through 999 you may use bit-wise operations to get a number 0 through 1023, and if too high re-try, which actually may be faster on average than using rem. This method is also completely free from bias in the generated numbers. The previous methods have the rules of thumb to get a so small bias that it becomes hard to notice.

Spectral score

The spectral score of a generator, measures how much a sequence of numbers from the generator are unrelated. A sequence of N numbers are interpreted as an N-dimensional vector and the spectral score for dimension N is a measure on how evenly these vectors are distributed in an N-dimensional (hyper)cube.

os:system_time(microseconds) simply increments so it should have a lousy spectral score.

erlang:phash2(erlang:unique_integer(), Range) has got unknown spectral score, since that is not part of the math behind a hash function. But a hash function is designed to distribute the hash value well for any input, so one can hope that the statistical distribution of the numbers is decent and “random” anyway. Unfortunately this does not seem to hold in PRNG tests

All regular PRNG:s in the rand module has got good spectral scores. The new mwc59 generator mostly, but not in 2 and 3 dimensions, due to its unbalanced design and power of 2 multiplier. Scramblers are used to compensate for those flaws.

PRNG tests

There are test frameworks that tests the statistical properties of PRNG:s, such as the TestU01 framework, or PractRand.

The regular generators in the rand module perform well in such tests, and pass thorough test suites.

Although the mcg59 generator pass PractRand 2 TB and TestU01 with its low 16 bits without any scrambling, its statistical problems show when the test parameters are tweaked just a little. To perform well in more cases, and with more bits, scrambling functions are needed. Still, the small state space and the flaws of the base generator makes it hard to pass all tests with flying colors. With the thorough double Xorshift scrambler it gets very good, though.

erlang:phash2(N, Range) over an incrementing sequence does not do well in TestU01, which suggests that a hash functions has got different design criteria from PRNG:s.

However, these kind of tests may be completely irrelevant for your application.

Predictability

For some applications, a generated number may have to be even cryptographically unpredictable, while for others there are no strict requirements.

There is a grey-zone for “non-critical” applications where for example a rouge party may be able to affect input data, and if it knows the PRNG sequence can steer all data to a hash table slot, overload one particular worker process, or something similar, and in this way attack an application. And, an application that starts out as “non-critical” may one day silently have become business critical…

This is an aspect that needs to be considered.

Storing the state

If the state of a PRNG can be kept in a loop variable, the cost can be almost nothing. But as soon as it has to be stored in a heap variable it will cost performance due to heap data allocation, term building, and garbage collection.

In the section Measurement results we see that the fastest PRNG can generate a new state that is also the generated integer in just under 4 ns. Unfortunately, just to return both the value and the new state in a 2-tuple adds roughly 10 ns.

The application state in which the PRNG state must be stored is often more complex, so the cost for updating it will probably be even larger.

Seeding

Seeding is related to predictability. If you can guess the seed you know the generator output.

The seed is generator dependent and how to create a good seed usually takes much longer than generating a number. Sometimes the seed and its predictability is so unimportant that a constant can be used. If a generator instance generates just a few numbers per seeding, then seeding can be the harder problem.

erlang:phash2(erlang:unique_integer(), Range) is pre-seeded, or rather cannot be seeded, so it has no seeding cost, but can on the other hand be rather predictable, if it is possible to estimate how many unique integers that have been generated since node start.

The default seeding in the rand module uses a combination of a hash value of the node name, the system time, and erlang:unique_integer(), to create a seed, which is hopefully sufficiently unpredictable.

The suggested NIF and BIF solutions would also need a way to create a good enough seed, where “good enough” is hard to put a number on.

JIT optimizations

The speed of the newly implemented mwc59 generator is partly thanks to the recent type-based optimizations in the compiler and the Just-In-Time compiling BEAM code loader.

With no type-based optimization

This is the Erlang code for the mwc59 generator:

mwc59(CX) ->
    C = CX band ((1 bsl 32)-1),
    X = CX bsr 32,
    16#7fa6502 * X + C.

The code compiles to this Erlang BEAM assembler, (erlc -S rand.erl), using the no_type_opt flag to disable type-based optimizations:

    {gc_bif,'bsr',{f,0},1,[{x,0},{integer,32}],{x,1}}.
    {gc_bif,'band',{f,0},2,[{x,0},{integer,4294967295}],{x,0}}.
    {gc_bif,'*',{f,0},2,[{x,0},{integer,133850370}],{x,0}}.
    {gc_bif,'+',{f,0},2,[{x,0},{x,1}],{x,0}}.

When loaded by the JIT (x86) (erl +JDdump true) the machine code becomes:

# i_bsr_ssjd
    mov rsi, qword ptr [rbx]
# is the operand small?
    mov edi, esi
    and edi, 15
    cmp edi, 15
    short jne L2271

Above was a test if {x,0} is a small integer and if not the fallback at L2271 is called to handle any term.

Then follows the machine code for right shift, Erlang bsr 32, x86 sar rax, 32, and a skip over the fallback code:

    mov rax, rsi
    sar rax, 32
    or rax, 15
    short jmp L2272
L2271:
    mov eax, 527
    call 140439031217336
L2272:
    mov qword ptr [rbx+8], rax
# line_I

Here follows band with similar test and fallback code:

# i_band_ssjd
    mov rsi, qword ptr [rbx]
    mov rax, 68719476735
# is the operand small?
    mov edi, esi
    and edi, 15
    cmp edi, 15
    short jne L2273
    and rax, rsi
    short jmp L2274
L2273:
    call 140439031216768
L2274:
    mov qword ptr [rbx], rax

Below comes * with test, fallback code, and overflow check:

# line_I
# i_times_jssd
    mov rsi, qword ptr [rbx]
    mov edx, 2141605935
# is the operand small?
    mov edi, esi
    and edi, 15
    cmp edi, 15
    short jne L2276
# mul with overflow check, imm RHS
    mov rax, rsi
    mov rcx, 133850370
    and rax, -16
    imul rax, rcx
    short jo L2276
    or rax, 15
    short jmp L2275
L2276:
    call 140439031220000
L2275:
    mov qword ptr [rbx], rax

The following is + with tests, fallback code, and overflow check:

# i_plus_ssjd
    mov rsi, qword ptr [rbx]
    mov rdx, qword ptr [rbx+8]
# are both operands small?
    mov eax, esi
    and eax, edx
    and al, 15
    cmp al, 15
    short jne L2278
# add with overflow check
    mov rax, rsi
    mov rcx, rdx
    and rcx, -16
    add rax, rcx
    short jno L2277
L2278:
    call 140439031219296
L2277:
    mov qword ptr [rbx], rax

With type-based optimization

When the compiler can figure out type information about the arguments it can emit more efficient code. One would like to add a guard that restricts the argument to a 59 bit integer, but unfortunately the compiler cannot yet make use of such a guard test.

But adding a redundant input bit mask to the Erlang code puts the compiler on the right track. This is a kludge, and will only be used until the compiler has been improved to deduce the same information from a guard instead.

The Erlang code now has a first redundant mask to 59 bits:

mwc59(CX0) ->
    CX = CX0 band ((1 bsl 59)-1),
    C = CX band ((1 bsl 32)-1),
    X = CX bsr 32,
    16#7fa6502 * X + C.

The BEAM assembler then becomes, with the default type-based optimizations in the compiler the OTP-25.0 release:

    {gc_bif,'band',{f,0},1,[{x,0},{integer,576460752303423487}],{x,0}}.
    {gc_bif,'bsr',{f,0},1,[{tr,{x,0},{t_integer,{0,576460752303423487}}},
             {integer,32}],{x,1}}.
    {gc_bif,'band',{f,0},2,[{tr,{x,0},{t_integer,{0,576460752303423487}}},
             {integer,4294967295}],{x,0}}.
    {gc_bif,'*',{f,0},2,[{tr,{x,0},{t_integer,{0,4294967295}}},
             {integer,133850370}],{x,0}}.
    {gc_bif,'+',{f,0},2,[{tr,{x,0},{t_integer,{0,572367635452168875}}},
             {tr,{x,1},{t_integer,{0,134217727}}}],{x,0}}.

Note that after the initial input band operation, type information {tr,{x_},{t_integer,Range}} has been propagated all the way down.

Now the JIT:ed code becomes noticeably shorter.

The input mask operation knows nothing about the value so it has the operand test and the fallback to any term code:

# i_band_ssjd
    mov rsi, qword ptr [rbx]
    mov rax, 9223372036854775807
# is the operand small?
    mov edi, esi
    and edi, 15
    cmp edi, 15
    short jne L1816
    and rax, rsi
    short jmp L1817
L1816:
    call 139812177115776
L1817:
    mov qword ptr [rbx], rax

For all the following operations, operand tests and fallback code has been optimized away to become a straight sequence of machine code:

# line_I
# i_bsr_ssjd
    mov rsi, qword ptr [rbx]
# skipped test for small left operand because it is always small
    mov rax, rsi
    sar rax, 32
    or rax, 15
L1818:
L1819:
    mov qword ptr [rbx+8], rax
# line_I
# i_band_ssjd
    mov rsi, qword ptr [rbx]
    mov rax, 68719476735
# skipped test for small operands since they are always small
    and rax, rsi
    mov qword ptr [rbx], rax
# line_I
# i_times_jssd
# multiplication without overflow check
    mov rax, qword ptr [rbx]
    mov esi, 2141605935
    and rax, -16
    sar rsi, 4
    imul rax, rsi
    or rax, 15
    mov qword ptr [rbx], rax
# i_plus_ssjd
# add without overflow check
    mov rax, qword ptr [rbx]
    mov rsi, qword ptr [rbx+8]
    and rax, -16
    add rax, rsi
    mov qword ptr [rbx], rax

The execution time goes down from 3.7 ns to 3.3 ns which is 10% faster just by avoiding redundant checks and tests, despite adding a not needed initial input mask operation.

And there is room for improvement. The values are moved back and forth to BEAM {x,_} registers (qword ptr [rbx]) between operations. Moving back from the {x,_} register could be avoided by the JIT since it is possible to know that the value is in a process register. Moving out to the {x,_} register could be optimized away if the compiler would emit the information that the value will not be used from the {x,_} register after the operation.

Implementing a PRNG

To create a really fast PRNG in Erlang there are some limitations coming with the language implementation:

If the generator state is a complex term, that is, a heap term, instead of an immediate value, state updates gets much slower. Therefore the state should be a max 59-bit integer.
If an intermediate result creates a bignum, that is, overflows 59 bits, arithmetic operations gets much slower, so intermediate results must produce values that fit in 59 bits.
If the generator returns both a generated value and a new state in a compound term, then, again, updating heap data makes it much slower. Therefore a generator should only return an immediate integer state.
If the returned state integer cannot be used as a generated number, then a separate value function that operates on the state can be used. Two calls, however, double the call overhead.

LCG and MCG

The first attempt was to try a classical power of 2 Linear Congruential Generator:

X1 = (A * X0 + C) band (P-1)

And a Multiplicative Congruential Generator:

X1 = (A * X0) rem P

To avoid bignum operations the product A * X0 must fit in 59 bits. The classical paper “Tables of Linear Congruential Generators of Different Sizes and Good Lattice Structure” by Pierre L’Ecuyer lists two generators that are 35 bit, that is, an LCG with P = 2³⁵ and an MCG with P being a prime number just below 2³⁵. These were the largest generators to be found for which the muliplication did not overflow 59 bits.

The speed of the LCG is very good. The MCG less so since it has to do an integer division by rem, but thanks to P being close to 2³⁵ that could be optimized so the speed reached only about 50% slower than the LCG.

The short period and know quirks of a power of 2 LCG unfortunately showed in PRNG tests.

They failed miserably.

MWC

Sebastiano Vigna of the University of Milano, who also helped design our current 58-bit Xorshift family generators, suggested to use a Multiply With Carry generator instead:

T  = A * X0 + C0,
X1 = T band ((1 bsl Bits)-1),
C0 = T bsr Bits.

This generator operates on “digits” of size Bits, and if a digit is half a machine word then the multiplication does not overflow. Instead of having the state as a digit X and a carry C these can be merged to have T as the state instead. We get:

X  = T0 band ((1 bsl Bits)-1),
C  = T0 bsr Bits,
T1 = A * X + C

An MWC generator is actually a different form of a MCG generator with a power of 2 multiplier, so this is an equivalent generator:

T0 = (T1 bsl Bits) rem ((A bsl Bits) - 1)

In this form the generator updates the state in the reverse order, hence T0 and T1 are swapped. The modulus (A bsl Bits) - 1 has to be a safe prime number or else the generator does not have maximum period.

The base generator

Because the multiplier (or its multiplicative inverse) is a power of 2, the MWC generator gets bad Spectral score in 3 dimensions, so using a scrambling function on the state to get a number would be necessary to improve the quality.

A search for a suitable digit size and multiplier started, mostly done by using programs that try multipliers for safe prime numbers, and estimates spectral scores, such as CPRNG.

When the generator is balanced, that is, the multiplier A has got close to Bits bits, the spectral scores are the best, apart from the known problem in 3 dimensions. But since a scrambling function would be needed anyway there was an opportunity to try to generate a comfortable 32-bit digit using a 27-bit multiplier. With these sizes the product A * X0 does not create a bignum, and with a 32-bit digit it becomes possible to use standard PRNG tests to test the generator during development.

Because of using such slightly unbalanced parameters, unfortunately the spectral scores for 2 dimensions also gets bad, but the scrambler could solve that too…

The final generator is:

mwc59(T) ->
    C = T bsr 32,
    X = T band ((1 bsl 32)-1),
    16#7fa6502 * X + C.

The 32-bit digits of this base generator do not perform very well in PRNG tests, but actually the low 16 bits pass 2 TB in PractRand and 1 TB with the bits reversed, which is surprisingly good. The problem of bad spectral scores for 2 and 3 dimensions lie in the higher bits of the MWC digit.

Scrambling

The scrambler has to be fast as in use only a few and fast operations. For an arithmetic generator like this, Xorshift is a suitable scrambler. We looked at single Xorshift, double Xorshift and double XorRot. Double XorRot was slower than double Xorshift but not better, probably since the generator has got good low bits, so they need to be shifted up to improve the high bits. Rotating down high bits to the low is no improvement.

This is a single Xorshift scrambler:

V = T bxor (T bsl Shift)

When trying Shift constants it showed that with a large shift constant the generator performed better in PractRand, and with a small one it performed better in birthday spacing tests (such as in TestU01 BigCrush) and collision tests. Alas, it was not possible to find a constant good for both.

The choosen single Xorshift constant is 8 that passes 4 TB in PractRand and BigCrush in TestU01 but fails more thorough birthday spacing tests. The failures are few, such as the lowest bit in 8 and 9 dimensions, and some intermediate bits in 2 and 3 dimensions. This is something unlikely to affect most applications, and if using the high bits of the 32 generated, these imperfections should stay under the rug.

The final scrambler has to avoid bignum operations and masks the value to 32 bits so it looks like this:

mwc59_value32(T) ->
    V0 = T  band ((1 bsl 32)-1),
    V1 = V0 band ((1 bsl (32-8))-1),
    V0 bxor (V1 bsl 8).

A better scrambler would be a double Xorshift that can have both a small shift and a large shift. Using the small shift 4 makes the combined generator do very well in birthday spacings and collision tests, and following up with a large shift 27 shifts the whole improved 32-bit MWC digit all the way up to the top bit of the generator’s 59-bit state. That was the idea, and it turned out work fine.

The double Xorshift scrambler produces a 59-bit number where the low, the high, reversed low, reversed high, etc… all perform very well in PractRand, TestU01 BigCrush, and in exhaustive birthday spacing and collision tests. It is also not terribly much slower than the single Xorshift scrambler.

Here is a double Xorshift scrambler 4 then 27:

V1 = T bxor (T bsl 4),
V  = V1 bxor (V1 bsl 27).

Which, avoiding bignum operations and producing a 59-bit value, becomes the final scrambler:

mwc59_value(T) ->
    V0 = T  band ((1 bsl (59-4))),
    V1 = T  bxor (V0 bsl 4),
    V2 = V1 band ((1 bsl (59-27))),
    V1 bxor (V2 bsl 27).

Many thanks to Sebastiano Vigna that has done most of (practically all) the parameter searching and extensive testing of the generator and scramblers, backed by knowledge of what could work. Using an MWC generator in this particular way is rather uncharted territory regarding the math, so extensive testing is the way to trust the quality of the generator.

`rand_SUITE:measure/1`

The test suite for the rand module — rand_SUITE, in the Erlang/OTP source tree, contains a test case measure/1. This test case is a micro-benchmark of all the algorithms in the rand module, and some more. It measures the execution time in nanoseconds per generated number, and presents the times both absolute and relative to the default algorithm exsss that is considered to be 100%. See Measurement Results.

measure/1 is runnable also without a test framework. As long as rand_SUITE.beam is in the code path rand_SUITE:measure(N) will run the benchmark with N as an effort factor. N = 1 is the default and for example N = 5 gives a slower and more thorough measurement.

The test case is divided in sections where each first runs a warm-up with the default generator, then runs an empty benchmark generator to measure the benchmark overhead, and after that runs all generators for the specific section. The benchmark overhead is subtracted from the presented results after the overhead run.

The warm-up and overhead measurement & compensation are recent improvements to the measure/1 test case. Overhead has also been reduced by in-lining 10 PRNG iterations per test case loop iteration, which got the overhead down to one third of without such in-lining, and the overhead is now about as large as the fastest generator itself, approaching the function call overhead in Erlang.

The different measure/1 sections are different use cases such as “uniform integer half range + 1”, etc. Many of these test the performance of plug-in framework features. The test sections that are interesting for this text are “uniform integer range 10000”, “uniform integer 32-bit”, and “uniform integer full range”.

Measurement results

Here are some selected results from the author’s laptop from running rand_SUITE:measure(20):

The {mwc59,Tag} generator is rand:mwc59/1, where Tag indicates if the raw generator, the rand:mwc59_value32/1, or the rand:mwc59_value/1 scrambler was used.

The {exsp,_} generator is rand:exsp_next/1 which is a newly exported internal function that does not use the plug-in framework. When called from the plug-in framework it is called exsp below.

unique_phash2 is erlang:phash2(erlang:unique_integer(), Range).

system_time is os:system_time(microsecond).

RNG uniform integer range 10000 performance
                   exsss:     57.5 ns (warm-up)
                overhead:      3.9 ns      6.8%
                   exsss:     53.7 ns    100.0%
                    exsp:     49.2 ns     91.7%
         {mwc59,raw_mod}:      9.8 ns     18.2%
       {mwc59,value_mod}:     18.8 ns     35.0%
              {exsp,mod}:     22.5 ns     41.9%
          {mwc59,raw_tm}:      3.5 ns      6.5%
      {mwc59,value32_tm}:      8.0 ns     15.0%
        {mwc59,value_tm}:     11.7 ns     21.8%
               {exsp,tm}:     18.1 ns     33.7%
           unique_phash2:     23.6 ns     44.0%
             system_time:     30.7 ns     57.2%

The first two are the warm-up and overhead measurements. The measured overhead is subtracted from all measurements after the “overhead:” line. The measured overhead here is 3.9 ns which matches well that exsss measures 3.8 ns more during the warm-up run than after overhead. The warm-up run is, however, a bit unpredictable.

{_,*mod} and system_time all use (X rem 10000) + 1 to achieve the desired range. The rem operation is expensive, which we will see when comparing with the next section.

{_,*tm} use truncated multiplication to achieve the range, that is ((X * 10000) bsr GeneratorBits) + 1, which is much faster than using rem.

erlang:phash2/2 has got a range argument, that performs the rem 10000 operation in the BIF, which is fairly cheap, as we also will see when comparing with the next section.

RNG uniform integer 32 bit performance
                   exsss:     55.3 ns    100.0%
                    exsp:     51.4 ns     93.0%
        {mwc59,raw_mask}:      2.7 ns      4.9%
         {mwc59,value32}:      6.6 ns     12.0%
     {mwc59,value_shift}:      8.6 ns     15.5%
            {exsp,shift}:     16.6 ns     30.0%
           unique_phash2:     22.1 ns     40.0%
             system_time:     23.5 ns     42.6%

In this section, to generate a number in a 32-bit range, {mwc59,raw_mask} and system_time use a bit mask X band 16#ffffffff, {_,*shift} use bsr to shift out the low bits, and {mwc59_value32} has got the right range in itself. Here we see that bit operations are up to 10 ns faster than the rem operation in the previous section. {mwc59,raw_*} is more than 3 times faster.

Compared to the truncated multiplication variants in the previous section, the bit operations here are up to 3 ns faster.

unique_phash2 still uses BIF coded integer division to achieve the range, which gives it about the same speed as in the previous section, but it seems integer division with a power of 2 is a bit faster.

RNG uniform integer full range performance
                   exsss:     45.1 ns    100.0%
                    exsp:     39.8 ns     88.3%
                   dummy:     25.5 ns     56.6%
             {mwc59,raw}:      3.7 ns      8.3%
         {mwc59,value32}:      6.9 ns     15.2%
           {mwc59,value}:      8.5 ns     18.8%
             {exsp,next}:     16.8 ns     37.2%
       {splitmix64,next}:    331.1 ns    734.3%
           unique_phash2:     21.1 ns     46.8%
                procdict:     75.2 ns    166.7%
        {mwc59,procdict}:     16.6 ns     36.8%

In this section no range capping is done. The raw generator output is used.

Here we have the dummy generator, which is an undocumented generator within the rand plug-in framework that only does a minimal state update and returns a constant. It is used here to measure plug-in framework overhead.

The plug-in framework overhead is measured to 25.5 ns that matches exsp - {exsp,next} = 23.0 ns fairly well, which is the same algorithm within and without the plug-in framework, giving another measure of the framework overhead.

procdict is the default algorithm exsss but makes the plug-in framework store the generator state in the process dictionary, which here costs 30 ns.

{mwc59,procdict} stores the generator state in the process dictionary, which here costs 12.9 ns. The state term that is stored is much smaller than for the plug-in framework. Compare to procdict in the previous paragraph.

Summary

The new fast generator’s functions in the rand module fills a niche for speed over quality where the type-based JIT optimizations have elevated the performance.

The combination of high speed and high quality can only be fulfilled with a BIF implementation, but we hope that to be a combination we do not need to address…

Implementing a PRNG is tricky business.

Recent improvements in rand_SUITE:measure/1 highlights what the precious CPU cycles are used for.

Type-Based Optimizations in the JIT

2022-04-26T00:00:00+00:00

This post explores the new type-based optimizations in Erlang/OTP 25 where the compiler embeds type information in the BEAM files to help the JIT (Just-In-Time compiler) to generate better code.

The best of both worlds

The SSA-based compiler passes introduced in OTP 22 does a sophisticated type analysis, which allows for more optimizations and better code generation. There are, however, limits to what kind of optimizations the Erlang compiler can do because a BEAM file must be possible to load on any BEAM machine running on a 32-bit or 64-bit computer. Therefore, the compiler cannot do optimizations that depend on the size of integers that fit in a machine word or on how Erlang terms are represented.

The JIT (introduced in OTP 24) knows that it is running on a 64-bit computer and knows how Erlang terms are represented. The JIT is still limited in how much optimization it can do because it translates a single BEAM instruction at the time. For example, the + operator can add floats or integers of any size or any combination thereof. Previously executed BEAM instructions might have made it clear that the operands can only be small integers, but the JIT does not know that since it only looks at one instruction at the time, and therefore it must emit native code that handles all possible operands.

In OTP 25, the compiler has been updated to embed type information in the BEAM file and the JIT has been extended to emit better code based on that type information.

The embedded type information is versioned so that we can continue to improve the type-based optimizations in every OTP release. The loader will ignore versions it does not recognize so that the module can still be loaded without the type-based optimizations.

What to expect of the JIT in OTP 25

OTP 25 is just the beginning for type-based optimizations. We hope to improve both the type information from the compiler and the optimizations in the JIT in OTP 26.

How much better the native code emitted by the JIT will be depends on the nature of the code in the module.

The most commonly applied optimization is simplified tests. For example, a test for a tuple can frequently be reduced from 5 instructions down to 3 instructions, and a test for small integer operands can frequently be reduced from 5 instructions down to 4 instructions.

Less commonly applied but more significant are the simplifications that can be made when an integer is known to be “small” (fits in 60 bits). For example, a relational operator (such as <) used in a guard can be reduced from 11 instructions down to 4 if the operands are known to be small integers. This kind of optimization is most often applied in modules that use binary pattern matching because integers matched out from a binary have a well-defined range.

In the Erlang/OTP code base, the first kind of optimizations (shaving off one or two instructions) are applied roughly ten times as often as the second kind.

We will see later in this blog post that the optimizations of the second kind applied to the base64 module resulted in a significant speed up.

Simplifications of type tests

Let’s dive right into some examples.

Consider this module:

-module(example).
-export([tuple_matching/1]).

tuple_matching(X) ->
    case increment(X) of
        {ok,Result} -> Result;
        error -> X
    end.

increment(X) when is_integer(X) -> {ok,X+1};
increment(_) -> error.

The BEAM code for the tuple_matching/1 function emitted by the compiler in OTP 24 is (somewhat simplified):

    {allocate,1,1}.
    {move,{x,0},{y,0}}.
    {call,1,{f,5}}.
    {test,is_tuple,{f,3},[{x,0}]}.
    {get_tuple_element,{x,0},1,{x,0}}.
    {deallocate,1}.
    return.
  {label,3}.
    {move,{y,0},{x,0}}.
    {deallocate,1}.
    return.

The compiler has figured out that the increment/1 returns either the atom error or a two-tuple with ok as the first element. Therefore, to distinguish between those two possible return values, a single instruction suffices:

    {test,is_tuple,{f,3},[{x,0}]}.

There is no need to explicitly test for the value error because it must be error if it is not a tuple. Similarly, there is no need to test that the first element of the tuple is ok because it must be.

In OTP 24, the JIT translates that instruction to a sequence of 5 native instructions for x86_64:

# i_is_tuple_fs
    mov rsi, qword ptr [rbx]
    rex test sil, 1
    jne L2
    test byte ptr [rsi-2], 63
    jne L2

(Lines starting with # are comments.)

The mov instruction fetches the value of the BEAM register {x,0} to the CPU register rsi. The next two instructions test whether the term is a pointer to an object on the heap. If it is, the header word for the heap object is tested to make sure it is a tuple. The second test is needed because the heap object could be some other Erlang term, such as a binary, a map, or an integer that does not fit in a machine word.

Now let’s see what the compiler and the JIT in OTP 25 do with this instruction. The BEAM code is now:

    {test,is_tuple,
          {f,3},
          [{tr,{x,0},
               {t_union,{t_atom,[error]},
                        none,none,
                        [{{2,{t_atom,[ok]}},
                          {t_tuple,2,true,
                                   #{1 => {t_atom,[ok]},
                                     2 => {t_integer,any}}}}],
                        none}}]}.

The operand that was {x,0} in OTP 24 is now a tuple:

{tr,Register,Type}

That is, it is a three-tuple with tr as the first element. tr stands for typed register. The second element is the BEAM register ({x,0} in this case), and the third element is the type of the register in the compiler’s internal type representation. The type is equivalent to the following type spec:

'error' | {'ok', integer()}

The JIT cannot take advantage of that level of detail in the types, so the compiler embeds a simplified version of that type into the BEAM file. The embedded type is equivalent to:

atom() | tuple()

By knowing that {x,0} must be an atom or a tuple, the JIT in OTP 25 emits the following simplified native code:

# i_is_tuple_fs
    mov rsi, qword ptr [rbx]
# simplified tuple test since the source is always a tuple when boxed
    rex test sil, 1
    jne label_3

(The JIT generally emits a comment when type information made a simplification possible.)

Only the first test is now necessary, because if the term is a pointer to a heap object, according to the type information, it must be a tuple.

Simplification of relational operators

As another example, let’s look at how the relational operators in guards are translated. Consider this function:

my_less_than(A, B) ->
    if
        A < B -> smaller;
        true -> larger_or_equal
    end.

The BEAM code looks like this:

    {test,is_lt,{f,9},[{x,0},{x,1}]}.
    {move,{atom,smaller},{x,0}}.
    return.
  {label,9}.
    {move,{atom,larger_or_equal},{x,0}}.
    return.

When relational operators are used as guard tests, the compiler rewrites them as special instructions. Thus, the < operator is rewritten to an is_lt instruction.

The < operator can compare any Erlang terms. It would be impractical for the JIT to emit the code to handle all kinds of terms. Therefore, the JIT emits code that directly handles the most common case and calls a generic routine to handle everything else:

# is_lt_fss
    mov rsi, qword ptr [rbx+8]
    mov rdi, qword ptr [rbx]
    mov eax, edi
    and eax, esi
    and al, 15
    cmp al, 15
    short jne L39
    cmp rdi, rsi
    short jmp L40
L39:
    call 5447639136
L40:
    jge label_9

Let’s walk through the code. The first two instructions:

    mov rsi, qword ptr [rbx+8]
    mov rdi, qword ptr [rbx]

fetches the BEAM registers {x,1} and {x,0} into CPU registers.

The most common comparison is between two integers. Depending on the magnitude, integers can be represented in two different ways. On a 64-bit computer, signed integers that fit in 60 bits will be stored directly in a 64-bit word. The remaining 4 bits in the words are used for the tag, which for a small integer is 15. If the integer does not fit, it will be represented as a bignum, which is pointer to an object on the heap.

Here is the native code for testing that both operands are small:

    mov eax, edi
    and eax, esi
    and al, 15
    cmp al, 15
    short jne L39

If one or both of the operands have another tag than 15 (are not small integers), control is transferred to code at label L39 that handles all other types of terms.

The next lines do the comparison of the small integers. The code is written in a slightly convoluted way so that the conditional jump (jge label_9) that transfers control to the failure label can be shared with the generic code:

    cmp rdi, rsi
    short jmp L40
L39:
    call 5447639136
L40:
    jge label_9

Thus, without type information, 11 instructions are needed to implement is_lt.

Now let’s see what happens when types are available:

my_less_than(A, B) when is_integer(A), is_integer(B) ->
    .
    .
    .

When compiled by the compiler in OTP 25, the BEAM code is:

    {test,is_integer,{f,7},[{x,0}]}.
    {test,is_integer,{f,7},[{x,1}]}.
    {test,is_lt,{f,9},[{tr,{x,0},{t_integer,any}},{tr,{x,1},{t_integer,any}}]}.
    {move,{atom,smaller},{x,0}}.
    return.
  {label,9}.
    {move,{atom,larger_or_equal},{x,0}}.
    return.

The operands for the is_lt instruction now have types. The BEAM registers {x,0} and {x,1} have the type {t_integer,any}, which means an integer with an unknown range.

Having that knowledge of the types, the JIT can emit a slightly shorter test for a small integer:

# simplified small test since all other types are boxed
    mov eax, edi
    and eax, esi
    test al, 1
    short je L39

To do a better job, the JIT will need better type information. For example:

map_size_less_than(Map1, Map2) ->
    if
        map_size(Map1) < map_size(Map2) -> smaller;
        true -> larger_or_equal
    end.

The BEAM code looks like this:

    {gc_bif,map_size,{f,12},2,[{x,0}],{x,0}}.
    {gc_bif,map_size,{f,12},2,[{x,1}],{x,1}}.
    {test,is_lt,
          {f,12},
          [{tr,{x,0},{t_integer,{0,288230376151711743}}},
           {tr,{x,1},{t_integer,{0,288230376151711743}}}]}.
    {move,{atom,smaller},{x,0}}.
    return.
  {label,12}.
    {move,{atom,larger_or_equal},{x,0}}.
    return.

Both operands for is_lt now have the type {t_integer,{0,288230376151711743}}, meaning an integer in the range 0 through 288230376151711743 (that is, (1 bsl 58) - 1). There is no documented upper limit for the number of elements in a map, but for the foreseeable future, there is no way that the number of elements in a map will exceed or even get close to 288230376151711743.

Since both the lower and upper bounds for {x,0} and {x,1} fit in 60 bits, there is no need to test the type of the operands:

# is_lt_fss
    mov rsi, qword ptr [rbx+8]
    mov rdi, qword ptr [rbx]
# skipped test for small operands since they are always small
    cmp rdi, rsi
L42:
L43:
    jge label_12

Since the operands are always small, the call to the generic routine (following label L42) has been omitted.

Simplification of addition

Looking at arithmetic instructions, we will see the potential for nice simplifications by the JIT, but unfortunately we will also see the limitations of the type analysis done by the Erlang compiler in OTP 25.

Let’s look at the generated code for this function:

add1(X, Y) ->
    X + Y.

The BEAM code looks like this:

    {gc_bif,'+',{f,0},2,[{x,0},{x,1}],{x,0}}.
    return.

The JIT translates the + instruction to the following native instructions:

# i_plus_ssjd
    mov rsi, qword ptr [rbx]
    mov rdx, qword ptr [rbx+8]
# are both operands small?
    mov eax, esi
    and eax, edx
    and al, 15
    cmp al, 15
    short jne L15
# add with overflow check
    mov rax, rsi
    mov rcx, rdx
    and rcx, -16
    add rax, rcx
    short jno L14
L15:
    call 4328985696
L14:
    mov qword ptr [rbx], rax

The first two instructions:

    mov rsi, qword ptr [rbx]
    mov rdx, qword ptr [rbx+8]

loads the operands for the + operation BEAM registers into CPU registers.

The next 5 instructions tests for small operands:

# are both operands small?
    mov eax, esi
    and eax, edx
    and al, 15
    cmp al, 15
    short jne L15

The code is almost identical to the code in the is_lt instruction that we examined earlier. The only difference is that other CPU registers are used. If one or both of the operands is not a small integer, a jump is made to label L15, which looks like this:

L15:
    call 4328985696

This code calls a generic routine that can add any combination of small, bignums, or floats. The generic routine will also handle non-number operands by raising a badarith exception.

If both operands indeed are smalls, the following code adds them and checks for overflow:

# add with overflow check
    mov rax, rsi
    mov rcx, rdx
    and rcx, -16
    add rax, rcx
    short jno L14

If the addition overflowed, the generic addition routine is called. Otherwise, control is transferred to the following instruction:

    mov qword ptr [rbx], rax

which stores the result in {x,0}.

To summarize, the addition itself (including dealing with the tags) requires 4 instructions. However, 10 more instructions are needed to:

Fetch operands from BEAM registers.
Check that the operands are small integers (at most 60 bits).
Calling the generic addition routine.
Storing the result to a BEAM register.

Now let’s see what happens if types are introduced.

Consider:

add2(X0, Y0) ->
    X = 2 * X0,
    Y = 2 * Y0,
    X + Y.

The BEAM code looks like:

    {gc_bif,'*',{f,0},2,[{x,0},{integer,2}],{x,0}}.
    {gc_bif,'*',{f,0},2,[{x,1},{integer,2}],{x,1}}.
    {gc_bif,'+',{f,0},2,[{tr,{x,0},number},{tr,{x,1},number}],{x,0}}.
    return.

Types are propagated from arithmetic instructions to other arithmetic instructions. Because the result of * (if it succeeds) is a number (integer or float), the operands for the + instruction now have the type number.

Based on our experience of adding types to the < operator, we might guess that we would save only one instruction in the type test. We would be right:

# simplified test for small operands since both are numbers
    mov eax, esi
    and eax, edx
    test al, 1
    short je L22

Returning to the simpler example with addition and no multiplication, let’s add a guard to ensure that X and Y are integers:

add3(X, Y) when is_integer(X), is_integer(Y) ->
    X + Y.

That results in the following BEAM code:

    {test,is_integer,{f,5},[{x,0}]}.
    {test,is_integer,{f,5},[{x,1}]}.
    {gc_bif,'+',
            {f,0},
            2,
            [{tr,{x,0},{t_integer,any}},{tr,{x,1},{t_integer,any}}],
            {x,0}}.
    return.

The types for both operands are now {t_integer,any}. However, that will still result in the same simplified four-instruction sequence for testing small integers, because the integers might not fit in 60 bits.

Clearly, based on our experience with is_lt, we will need to establish a range for X and Y. A reasonable way to do that would be:

add4(X, Y) when is_integer(X), 0 =< X, X < 16#400,
                is_integer(Y), 0 =< Y, Y < 16#400 ->
    X + Y.

However, because of limitations in the compiler’s value range analysis, the types for the + operator will not improve:

    {test,is_integer,{f,19},[{x,0}]}.
    {test,is_ge,{f,19},[{tr,{x,0},{t_integer,any}},{integer,0}]}.
    {test,is_lt,{f,19},[{tr,{x,0},{t_integer,any}},{integer,1024}]}.
    {test,is_integer,{f,19},[{x,1}]}.
    {test,is_ge,{f,19},[{tr,{x,1},{t_integer,any}},{integer,0}]}.
    {test,is_lt,{f,19},[{tr,{x,1},{t_integer,any}},{integer,1024}]}.
    {gc_bif,'+',
            {f,0},
            2,
            [{tr,{x,0},{t_integer,any}},{tr,{x,1},{t_integer,any}}],
            {x,0}}.
    return.

To add insult to injury, the first 6 instructions cannot be simplified by the JIT because there is not sufficient type information. That is, the is_lt and is_ge instructions will comprise 11 instructions each.

We aim to improve the type analysis and optimizations in OTP 26 and generate better code for this example. We are also considering adding a new guard BIF in OTP 26 for testing that a term is an integer in a given range.

Meanwhile, while we wait for OTP 26, there is a way in OTP 25 to write an equivalent guard that will result in much more efficient code and establish known ranges for X and Y:

add5(X, Y) when X =:= X band 16#3FF,
                Y =:= Y band 16#3FF ->
    X + Y.

We are showing this way of writing guard for illustrative purposes only; we don’t recommend rewriting your guards in this way.

The band operator fails if not both of its operands are integers, so no is_integer/1 test is needed. The =:= comparison will return false if the corresponding variable is outside the range 0 through 16#3FF.

That will result in the following BEAM code, where the compiler now has been able to figure out the possible ranges for the operands of the + operator:

    {gc_bif,'band',{f,21},2,[{x,0},{integer,1023}],{x,2}}.
    {test,is_eq_exact,
          {f,21},
          [{tr,{x,0},{t_integer,any}},{tr,{x,2},{t_integer,{0,1023}}}]}.
    {gc_bif,'band',{f,21},2,[{x,1},{integer,1023}],{x,2}}.
    {test,is_eq_exact,
          {f,21},
          [{tr,{x,1},{t_integer,any}},{tr,{x,2},{t_integer,{0,1023}}}]}.
    {gc_bif,'+',
            {f,0},
            2,
            [{tr,{x,0},{t_integer,{0,1023}}},{tr,{x,1},{t_integer,{0,1023}}}],
            {x,0}}.
    return.

Also, the 4 instructions that precede the + instructions are now relatively efficient.

The band instruction needs to test the operands and be prepared to handle integers that don’t fit in 60 bits:

# i_band_ssjd
    mov rsi, qword ptr [rbx]
    mov eax, 16383
# is the operand small?
    mov edi, esi
    and edi, 15
    cmp edi, 15
    short jne L97
    and rax, rsi
    short jmp L98
L97:
    call 4456532680
    short je label_25
L98:
    mov qword ptr [rbx+16], rax

The is_eq_exact instruction benefits from type information derived from executing the band instruction. Since the right-hand side operand is known to be a small integer that fits in a machine word, a simple comparison is sufficient with no need for fallback code to handle other Erlang terms:

# is_eq_exact_fss
# simplified check since one argument is an immediate
    mov rdi, qword ptr [rbx+16]
    cmp qword ptr [rbx], rdi
    short jne label_25

The JIT generates the following code for the + operator:

# i_plus_ssjd
# add without overflow check
    mov rax, qword ptr [rbx]
    mov rsi, qword ptr [rbx+8]
    and rax, -16
    add rax, rsi
    mov qword ptr [rbx], rax

Simplifications for `base64`

As far as we know, base64 is the module in OTP that has benefited the most of the improvements in OTP 25.

Here follows benchmark results for a benchmark included in a Github issue. First the results for OTP 24 on my computer:

== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 19805 ms: 50 it/sec
fun base64:decode/1: 1000 iterations in 20075 ms: 49 it/sec

The results for OTP 25 on the same computer:

== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 16024 ms: 62 it/sec
fun base64:decode/1: 1000 iterations in 18306 ms: 54 it/sec

In OTP 25, the encoding is done in 80 percent of the time that OTP 24 needs. Decoding is also more than a second faster.

The base64 module has not been modified in OTP 25, so the improvements are entirely down to improvements in the compiler and the JIT.

Here is the clause of encode_binary/2 in the base64 module that does most of the work of encoding a binary to Base64:

encode_binary(<>, A) ->
    BB = (B1 bsl 16) bor (B2 bsl 8) bor B3,
    encode_binary(Ls,
                  <>).

The binary matching in the function head establishes ranges for the the variables B1, B2, and B3. (The types for all three variables will be {t_integer,{0,255}}.)

Because of the ranges, all of the bsl, bsr, band, and bor operations that follow do not need any type checks. Also, in the creation of the binary, there is no need to test whether the binary creation succeeded because all values are known to be small integers.

The 4 calls to the b64e/1 functions are inlined. The function looks like this:

-compile({inline, [{b64e, 1}]}).
b64e(X) ->
    element(X+1,
	    {$A, $B, $C, $D, $E, $F, $G, $H, $I, $J, $K, $L, $M, $N,
	     $O, $P, $Q, $R, $S, $T, $U, $V, $W, $X, $Y, $Z,
	     $a, $b, $c, $d, $e, $f, $g, $h, $i, $j, $k, $l, $m, $n,
	     $o, $p, $q, $r, $s, $t, $u, $v, $w, $x, $y, $z,
	     $0, $1, $2, $3, $4, $5, $6, $7, $8, $9, $+, $/}).

In OTP 25, the JIT will optimize calls to element/2 where the position argument is an integer and the tuple argument is a literal tuple. For the way element/2 is used in be64e/1, all type tests and range checks will be removed:

# bif_element_jssd
# skipped tuple test since source is always a literal tuple
L302:
    long mov rsi, 9223372036854775807
    mov rdi, qword ptr [rbx+24]
    lea rcx, qword ptr [rsi-2]
# skipped test for small position since it is always small
    mov rax, rdi
    sar rax, 4
# skipped check for position =:= 0 since it is always >= 1
# skipped check for negative position and position beyond tuple
    mov rax, qword ptr [rcx+rax*8]
L300:
L301:
    mov qword ptr [rbx+24], rax

That is 7 instructions with no conditional branches.

Please try this at home!

If you want to follow along and examine the native code for loaded modules, start the runtime system like this:

erl +JDdump true

The native code for all modules that are loaded will be dumped to files with the extension .asm.

To find code that has been simplified by the JIT, use this command:

egrep "simplified|skipped|without overflow" *.asm

To examine the BEAM code for a module, use the -S option. For example:

erlc -S base64.erl

Pull requests

Here are the main pull requests that implement type-based optimizations:

The Many-to-One Parallel Signal Sending Optimization

2021-11-05T00:00:00+00:00

This blog post discusses the parallel signal sending optimization that recently got merged into the master branch (scheduled to be included in Erlang/OTP 25). The optimization improves signal sending throughput when several processes send signals to a single process simultaneously on multicore machines. At the moment, the optimization is only active when one configures the receiving process with the {message_queue_data, off_heap} setting. The following figure gives an idea of what type of scalability improvement the optimization can give in extreme scenarios (number of Erlang processes sending signals on the x-axis and throughput on the y-axis):

This blog post aims to give you an understanding of how signal sending on a single node is implemented in Erlang and how the new optimization can yield the impressive scalability improvement illustrated in the figure above. Let us begin with a brief introduction to what Erlang signals are.

Erlang Signals

All concurrently executing entities (processes, ports, etc.) in an Erlang system communicate using asynchronous signals. The most common signal is normal messages that are typically sent between processes with the bang (!) operator. As Erlang takes pride in being a concurrent programming language, it is, of course, essential that signals are sent efficiently between different entities. Let us now discuss what guarantees Erlang programmers get about signal sending ordering, as this will help when learning how the new optimization works.

The Signal Ordering Guarantee

The signal ordering guarantee is described in the Erlang documentation like this:

“The only signal ordering guarantee given is the following: if an entity sends multiple signals to the same destination entity, the order is preserved; that is, if A sends a signal S1 to B, and later sends signal S2 to B, S1 is guaranteed not to arrive after S2.”

This guarantee means that if multiple processes send signals to a single process, all signals from the same process are received in the send order in the receiving process. Still, there is no ordering guarantee for two signals coming from two distinct processes. One should not think about signal sending as instantaneous. There can be an arbitrary delay after a signal has been sent until it has reached its destination, but all signals from A to B travel on the same path and cannot pass each other.

The guarantee has deliberately been designed to allow for efficient implementations and allow for future optimizations. However, as we will see in the next section, before the optimization presented in this blog post, the implementation did not take advantage of the permissive ordering guarantee for signals sent between processes running on the same node.

Single-Node Process-to-Process Implementation before the Optimization

Conceptually, the Erlang VM organized the data structure for an Erlang process as in the following figure before the optimization:

Of course, this is an extreme simplification of the Erlang process structure, but it is enough for our explanation. When a process has the {message_queue_data, off_heap} setting activated, the following algorithm is executed to send a signal:

Allocate a new linked list node containing the signal data
Acquire the OuterSignalQueueLock in the receiving process
Insert the new node at the end of the OuterSignalQueue
Release the OuterSignalQueueLock

When a receiving process has run out of signals in its InnerSignalQueue and/or wants to check if there are more signals in the outer queue, the following algorithm is executed:

Acquire the OuterSignalQueueLock
Append the OuterSignalQueue at the end of the InnerSignalQueue
Release the OuterSignalQueueLock

How signal sending works when the receiving process is configured with {message_queue_data, on_heap} is not so relevant for the main topic of this blog post. Still, understanding how {message_queue_data, on_heap} works will also give you an understaning of why the parallel signal queue optimization is not enabled when a process is configured with {message_queue_data, on_heap} (which is the default setting), so here is the algorithm for sending a signal to such a process:

Try to acquire the MainProcessLock with a try_lock call
- If the try_lock call succeeded:
  1. Allocate space for the signal data on the process’ main heap area and copy the signal data there
  2. Allocate a linked list node containing a pointer to the process heap-allocated signal data
  3. Acquire the OuterSignalQueueLock
  4. Insert the linked list node at the end of the OuterSignalQueue
  5. Release the OuterSignalQueueLock
  6. Release the MainProcessLock
- Else:
  1. Allocate a new linked list node containing the signal data
  2. Acquire the OuterSignalQueueLock
  3. Insert the new node at the end of the OuterSignalQueue
  4. Release the OuterSignalQueueLock

The advantage of {message_queue_data, on_heap} compared to {message_queue_data, off_heap} is that the signal data is copied directly to the receiving process main heap (when the try_lock call for the MainProcessLock succeeds). The disadvantage of {message_queue_data, on_heap} is that the sender creates extra contention on the receiver’s MainProcessLock. Notice that we cannot simply release the MainProcessLock directly after allocating the data on the receiver’s process heap. If a garbage collection happen before the signal have been inserted into the process’ heap, the signal data would be lost (holding the MainProcessLock prevents a garbage collection from happening). Therefore, {message_queue_data, off_heap} provides much better scalability than {message_queue_data, on_heap} when multiple processes send signals to the same process concurrently on a multicore system.

However, even though {message_queue_data, off_heap} scales better than {message_queue_data, on_heap} with the old implementation, signal senders still had to acquire the OuterSignalQueueLock for a short time. This lock can become a scalability bottleneck and a contended hot-spot when there are enough parallel senders. This is why we saw very poor scalability and even a slowdown for the old implementation in the benchmark figure above. Now, we are ready to look at the new optimization.

The Parallel Signal Sending Optimization

The optimization takes advantage of Erlang’s permissive signal ordering guarantee discussed above. It is enough to keep the order of signals coming from the same entity to ensure that the signal ordering guarantee holds. So there is no need for different senders to synchronize with each other! In theory, signal sending could therefore be parallelized perfectly. In practice, however, there is only one thread of execution that handles incoming signals, so we also have to keep in mind that we don’t want to slow down the receiver and ideally make receiving signals faster. As signal queue data is stored outside the process main heap area when the {message_queue_data, off_heap} setting is enabled, the garbage collector does not need to go through the whole signal queue, giving better performance for processes with a lot of signals in their signal queue. Therefore, it is also important for the optimization not to add unnecessary overhead when the OuterSignalQueueLock is uncontended, so that we do not slow down existing use cases for {message_queue_data, off_heap} too much.

Data Structure and Birds-Eye-View of Optimized Implementation

We decided to go for a design that enables the parallel signal sending optimization on demand when the contention on the OuterSignalQueueLock seems to be high to avoid as much overhead as possible when the optimization is unnecessary. Here is a conceptual view of the process structure when the optimization is not active (which is the initial state when creating a process with {message_queue_data, off_heap}):

The following figure shows a conceptual view of the process structure when the parallel signal sending optimization is turned on. The only difference between this and the previous figure is that the OuterSignalQueueBufferArray field now points to a structure containing an array with buffers.

When the parallel signal sending optimization is active, senders do not need to acquire the OuterSignalQueueLock anymore. Senders are mapped to a slot in the OuterSignalQueueBufferArray by a simple hash function that is applied to the process ID (senders without a process ID are currently mapped to the same slot). Before a sender takes the OuterSignalQueueLock in the receiving process’ structure, the sender tries to enqueue in its slot in the OuterSignalQueueBufferArray (if it exists). If the enqueue attempt succeeds, the sender can continue without even touching the OuterSignalQueueLock! The order of signals coming from the same sender is maintained because the same sender is always mapped to the same slot in the buffer array. Now, you have probably got an idea of why the signal sending throughput can increase so much with the new optimization, as we saw in the benchmark figure presented earlier. Essentially, the contention on the OuterSignalQueueLock gets distributed among the slots in the OuterSignalQueueBufferArray. The rest of the subsections in this section cover details of the implementation, so you can skip those if you do not want to dig deeper.

Adaptively Activating the Outer Signal Queue Buffers

As the figure above tries to illustrate, the OuterSignalQueueLock carries a statistics counter. When that statistics counter reaches a certain threshold, the new parallel signal sending optimization is activated by installing the OuterSignalQueueBufferArray in the process structure. The statistics counter for the lock is updated in a simple way. When a thread tries to acquire the OuterSignalQueueLock and the lock is already taken, the counter is increased, and otherwise, it is decreased, as the following code snippet illustrates:

void erts_proc_sig_queue_lock(Process* proc)
{
    if (EBUSY == erts_proc_trylock(proc, ERTS_PROC_LOCK_MSGQ)) {
        erts_proc_lock(proc, ERTS_PROC_LOCK_MSGQ);
        proc->sig_inq_contention_counter += 1;
    } else if(proc->sig_inq_contention_counter > 0) {
        proc->sig_inq_contention_counter -= 1;
    }
}

The Outer Signal Queue Buffer Array Structure

Currently, the number of slots in the OuterSignalQueueBufferArray is fixed to 64. Sixty-four slots should go a long way to reduce signal queue contention in most practical application that exists today. Few servers have more than 100 cores, and typical applications spend a lot of time doing other things than sending signals. Using 64 slots also allows us to implement a very efficient atomically updatable bitset containing information about which slots are currently non-empty (the NonEmptySlots field in the figure above). This bitset makes flushing the buffer array into the OuterSignalQueue more efficient since only the non-empty slots in the buffer array need to be visited and updated to perform the flush.

Sending Signals with the Optimization Activated

Pseudo-code for the algorithm that is executed when a process is sending a signal to another process that has the OuterSignalQueueBufferArray installed can be seen below:

Allocate a new linked list node containing the signal data
Map the process ID of the sender to the right slot I with the hash function
Acquire the SlotLock for the slot I
Check the IsAlive field for slot I
- If the IsAlive field’s value is true:
  1. Set the appropriate bit in the NonEmptySlots field, if the buffer is empty
  2. Insert the allocated signal node at the end of the BufferQueue for slot I
  3. Increase the NumberOfEnqueues in slot I by 1
  4. Release SlotLock for slot I
  5. The signal is enqueued, and the thread can continue with the next task
- Else (the OuterSignalQueueBufferArray has been deactivated):
  1. Release the lock for slot I
  2. Do the insert into the OuterSignalQueue in the same way as the signal sending algorithm did it prior to the optimization

Fetching Signals from the Outer Signal Queue Buffer Array and Deactivation of the Optimization

The algorithm for fetching signals from the outer signal queue uses the NonEmptySlots field in the OuterSignalQueueBufferArray, so it only needs to check slots that are guaranteed to be non-empty. At a high level, the routine works according to the following pseudo-code:

Acquire the OuterSignalQueueLock
For each non-empty slot in the buffer array:
1. Lock the slot
2. Append the signals in the slot to the end of OuterSignalQueue
3. Add the value of the slot’s NumberOfEnqueues field to the TotNumberOfEnqueues field in the OuterSignalQueueBufferArray
4. Reset the slot’s BufferQueue and NumberOfEnqueues fields
5. Unlock the slot
Increase the value of the NumberOfFlushes field in the OuterSignalQueueBufferArray by one
If the value of the NumberOfFlushes field has reached a certain threshold T:
- Calculate the average number of enqueues per flush (EnqPerFlush) during the last T flushes (TotNumberOfEnqueues / T).
  - If EnqPerFlush is below a certain threshold Q:
    - Deactivate the parallel signal sending optimization:
      1. For each slot in the OuterSignalQueueBufferArray:
        
        Acquire the SlotLock
        
        Append the signals in the slot (if any) to the end of OuterSignalQueue
        
        Set the slot’s IsAlive field to false
        
        Release the SlotLock
      2. Set the OuterSignalQueueBufferArray field in the process structure to NULL
      3. Schedule deallocation of the buffer array structure
  - Else if the average is equal to or above the threshold Q:
    - Set the NumberOfFlushes and the TotNumberOfEnqueues fields in the buffer array struct to 0
Append the OuterSignalQueue to the end of the InnerSignalQueue
Reset the OuterSignalQueue
Release the OuterSignalQueueLock

For simplicity, many details have been left out from the pseudo-code snippets above. However, if you have understood them, you have an excellent understanding of how signal sending in Erlang works, how the new optimization is implemented, and how it automatically activates and deactivates itself. Let us now dive a little bit deeper into benchmark results for the new implementation.

Benchmark

A configurable benchmark to measure the performance of both signal sending processes and receiving processes has been created. The benchmark lets N Erlang processes send signals (of configurable types and sizes) to a single process during a period of T seconds. Both N and T are configurable variables. A signal with size S has a payload consisting of a list of length S with word-sized (64 bits) items. The send throughput is calculated by dividing the number of signals that are sent by T. The receive throughput is calculated by waiting until all sent signals have been received and then dividing the total number of signals sent by the time between when the first signal was sent and when the last signal was received. The benchmark machine has 32 cores and two hardware threads per core (giving 64 hardware threads). You can find a detailed benchmark description on the signal queue benchmark page.

First, let us look at the results for very small messages (a list containing a single integer) below. The graph for the receive throughput is the same as we saw at the beginning of this blog post. Not surprisingly, the scalability for sending messages is much better after the optimization. More surprising is that the performance of receiving messages is also substantially improved. For example, with 16 processes, the receive throughput is 520 times better with the optimization! The improved receive throughput can be explained by the fact that in this scenario, the receiver has to fetch messages from the outer signal queue much more seldom. Sending is much faster after the optimization, so the receiver will bring more messages from the outer signal queue to the inner every time it runs out of messages. The sender can thus process messages from the inner queue for a longer time before it needs to fetch messages from the outer queue again. We cannot expect any improvement for the receiver beyond a certain point as there is only a single hardware thread that can work on processing messages at the same time.

Below are the results for larger messages (a list containing 100 integers). We do not get as good improvement in this scenario with a larger message size. With larger messages, the benchmark spends more time doing other work than sending and receiving messages. Things like the speed of the memory system and memory allocation might become limiting factors. Still, we get decent improvement both in the send throughput and receive throughput, as seen below.

You can find results for even larger messages as well as for non-message signals on the benchmark page. Real Erlang applications do much more than message and signal sending, so this benchmark is, of course, not representative of what kind of improvements real applications will get. However, the benchmarks show that we have pushed the threshold for when parallel message sending to a single process becomes a problem. Perhaps the new optimization opens up new interesting ways of writing software that was impractical due to previous performance reasons.

Possible Future Work

Users can configure processes with {message_queue_data, off_heap} or {message_queue_data, on_heap}. This configurability increases the burden for Erlang programmers as it can be difficult to figure out which one is better for a particular process. It would therefore make sense also to have a {message_queue_data, auto} option that would automatically detect lock contention even in on_heap mode and seamlessly switch between on_heap and off_heap based on how much contention is detected.

As discussed previously, 64 slots in the signal queue buffer array is a good start but might not be enough when servers have thousands of cores. A possible way to make the implementation even more scalable would be to make the signal queue buffer array expandable. For example, one could have contention detecting locks for each slot in the array. If the contention is high in a particular slot, one could expand this slot by creating a link to a subarray with buffers where senders can use another hash function (similar to how the HAMT data structure works).

Conclusion

The new parallel signal queue optimization that affects processes configured with {message_queue_data, off_heap} yields much better scalability when multiple processes send signals to the same process in parallel. The optimization has a very low overhead when the contention is low as it is only activated when its contention detection mechanism indicates that the contention is high.

Decentralized ETS Counters for Better Scalability

2021-08-03T00:00:00+00:00

A shared Erlang Term Storage (ETS) table is often an excellent place to store data that is updated and read from multiple Erlang processes frequently. ETS provides key-value stores to Erlang processes. When the write_concurrency option is activated, ETS tables use fine-grained locking internally. Therefore, a scenario where multiple processes insert and remove different items in an ETS table should scale well with the number of utilized cores. However, in practice the scalability for such scenarios is not yet perfect. This blog post will explore how the decentralized_counters option brings us one step closer to perfect scalability.

The ETS table option decentralized_counters (introduced in Erlang/OTP 22 for ordered_set tables and in Erlang/OTP 23 for the other table types) has made the scalability much better. A table with decentralized_counters activated uses decentralized counters instead of centralized counters to track the number of items in the table and the memory consumption. Unfortunately, tables with decentralized_counters activated will have slow operations to get the table size and memory usage (ets:info(Table, size) and ets:info(Table, memory)), so whether it is beneficial to turn decentralized_counters on or off depends on your use case. This blog post will give you a better understanding of when one should activate the decentralized_counters option and how the decentralized counters work.

Scalability with Decentralized ETS Counters

The following figure shows the throughput (operations/second) achieved when processes are doing inserts (ets:insert/2) and deletes (ets:delete/2) to an ETS table of the set type on a machine with 64 hardware threads both when decentralized_counters option is activated and when it is deactivated. The table types bag and duplicate_bag have similar scalability behavior as their implementation is based on the same hash table.

The following figure shows the results for the same benchmark but with a table of type ordered_set:

The interested reader can find more information about the benchmark at the benchmark website for decentralized_counters. The benchmark results above show that both set and ordered_set tables get a significant scalability boost when the decentralized_counter option is activated. The ordered_set type receives a more substantial scalability improvement than the set type. Tables of the set type have a fixed number of locks for the hash table buckets. The ordered_set table type is implemented with a contention adapting search tree that dynamically changes the locking granularity based on how much contention is detected. This implementation difference explains the difference in scalability between set and ordered_set. The interested reader can find details about the ordered_set implementation in an earlier blog post.

Worth noting is also that the Erlang VM that ran the benchmarks has been compiled with the configure option “./configure --with-ets-write-concurrency-locks=256”. The configure option --with-ets-write-concurrency-locks=256 changes the number of locks for hash-based ETS tables from the current default of 64 to 256 (256 is currently the max value one can set this configuration option to). Changing the implementation of the hash-based tables so that one can set the number of locks per table instance or so that the lock granularity is adjusted automatically seems like an excellent future improvement, but this is not what this blog post is about.

A centralized counter consists of a single memory word that is incremented and decremented with atomic instructions. The problem with a centralized counter is that modifications of the counter by multiple cores are serialized. This problem is amplified because frequent modifications of a single memory word by multiple cores cause a lot of expensive traffic in the cache coherence system. However, reading from a centralized counter is quite efficient as the reader only has to read a single memory word.

When designing the decentralized counters for ETS, we have tried to optimize for update performance and scalability as most applications need to get the size of an ETS table relatively rarely. However, since there may be applications out in the wild that frequently call ets:info(Table, size) and ets:info(Table, memory), we have chosen to make decentralized counters optional.

Another thing that might be worth keeping in mind is that the hash-based tables that use decentralized counters tend to use slightly more hash table buckets than the corresponding tables without decentralized counters. The reason for this is that, with decentralized counters activated, the resizing decision is based on an estimate of the number of items in the table rather than an exact count, and the resizing heuristics trigger an increase of the number of buckets more eagerly than a decrease.

Implementation

You will now learn how the decentralized counters in ETS works. The decentralized counter implementation exports an API that makes it easy to swap between a decentralized counter and a centralized one. ETS uses this to support the usage of both centralized and decentralized counters. The data structure for the decentralized counter is illustrated in the following picture. When is_decentralized = false, the counter field represents the current count instead of a pointer to an array of cache line padded counters.

When is_decentralized = true, processes that update (increment or decrement) the counter follow the pointer to the array of counters and increments the counter at the slot in the array that the current scheduler maps to (one takes the scheduler identifier modulo the number of slots in the array to get the appropriate slot). Updates do not need to do anything else, so they are very efficient and can scale perfectly with the number of cores as long as there are as many slots as schedulers. One can configure the maximum number of slots in the array of counters with the +dcg option.

To implement the ets:info(Table, size) and ets:info(Table, memory) operations, one also needs to read the current counter value. Reading the current counter value can be implemented by taking the sum of the values in the counter array. However, if this summation is done concurrently with updates to the array of counters, we could get strange results. For example, we could end up in a situation where ets:info(Table, size) returns a negative number, which is not exactly what we want. On the other hand, we want to make counter updates as fast as possible so having locks to protect the counters in the counter array is not a good solution. We opted for a solution that lets readers swap out the entire counter array and wait (using the Erlang VM’s thread progress system) until no updates can occur in the swapped-out array before the sum is calculated. The following example illustrates this approach:

[Step 1]

A thread is going to read the counter value.
[Step 2]

The reader starts by creating a new counter array.
[Step 3]

The pointer to the old counter array is changed to point to the new one with the snapshot_ongoing field set to true. This change can only be done when the snapshot_onging field is set to false in the old counter array.
[Step 4]

Now, the reader has to wait until all other threads that will update a counter in the old array have completed their updates. As mentioned, this can be done using the Erlang VM’s thread progress system. After that, the reader can safely calculate the sum of counters in the old counter array (the sum is 1406). The calculated sum is also given to the process that requested the count so that it can continue execution.
[Step 5]

The read operation is not done, even though we have successfully calculated a count. One must add the calculated sum from the old array to the new array to avoid losing something.
[Step 6]

Finally, the snapshot_ongoing field in the new counter array is set to false so that other read operations can swap out the new counter array.

Now, you should have got a basic understanding of how ETS’ decentralized counters work. You are also welcome to look at the source code in erl_flxctr.c and erl_flxctr.h if you are interested in details of the implementation.

As you can imagine, reading the value of a decentralized counter with, for example, ets:info(Table, size) is extremely slow compared to a centralized counter. Fortunately, most time that is spent reading the value of a decentralized counter is spent waiting for the thread progress system to report that it is safe to read the swapped-out array, and the read operation does not block any scheduler and does not consume any CPU time during this time. On the other hand, the decentralized counter can be updated in a very efficient and scalable way, so using decentralized counters is most likely to prefer, if you seldom need to get the size and the memory consumed by your shared ETS table.

Concluding Remarks

This blog post has described the implementation of the decentralized counter option for ETS tables. ETS tables with decentralized counters scale much better with the number of cores than ETS tables with centralized counters. However, as decentralized counters make ets:info(Table, size) and ets:info(Table, memory) very slow, one should not use them if any of these two operations need to be performed frequently.

Erlang/OTP

Erlang/OTP 28 Highlights

Priority Messages

Improvements of Comprehensions

Strict Generators

Zip Generators

Smarter Error Suggestions

Improvements to the Shell

Lazy Reads from stdin

Raw and Cooked Modes for noshell

Using fun Name/Arity to create funs in shell

New erlang:hibernate/0

Memory Usage Experiment

Warnings for Use of Old-Style Catch

Based Floating Point Literals

PCRE2

Why PCRE2 instead of PCRE?

Notable Changes:

Optimizations to TLS 1.3

Nominal Types

New Emacs Erlang Mode

Erlang/OTP 27 Highlights

Overhauled documentation system

Triple-Quoted strings

Sigils

No need to enable feature maybe

The new json module

Process labels

New functionality in STDLIB

New utility functions for set modules

New timer convenience functions that take funs

New ets functions

New SSL client-side stapling support

tprof: Yet another profiling tool

Multiple trace sessions

Quick trace session example

Native coverage support

Deprecating archives

Using a single archive in an Escript is not deprecated

The Optimizations in Erlang/OTP 27

A brief history of recent optimizations

What to expect of the JIT in Erlang/OTP 27

Please try this at home!

A simple record optimization

Updating records in place

Optimizing by generating less garbage

Optimization of funs

Integer arithmetic improvements

Numerous miscellaneous enhancements

Erlang/OTP 26 Highlights

The shell

Animations showing shell features

Improvements of maps

Changed ordering of atom keys

Map comprehensions

Inlined maps:get/3

Improved maps:merge/2

Improved map updates

The pull requests for map improvements

Improvements of the lists module

New function lists:enumerate/3

New options for the zip family of functions

No need to enable feature maybe in the runtime system

Improvements in the Erlang compiler and JIT

Incremental mode for Dialyzer

Using the dialyzer.config file

Running Dialyzer on proper

Pull request

argparse: A command line parser for Erlang

Pull request

SSL: Safer defaults

SSL: Improved checking of options

More Optimizations in the Compiler and JIT

What to expect of the JIT in OTP 26

Please try this at home!

Quick overview of type-based optimizations in OTP 25

The enhanced type-based optimizations in OTP 26

Combining guard tests

More code generation improvements

New BEAM instructions in OTP 26

Lazy Reads from `stdin`

Raw and Cooked Modes for `noshell`

Using `fun Name/Arity` to create funs in shell

New `erlang:hibernate/0`

No need to enable feature `maybe`

The new `json` module

New `timer` convenience functions that take funs

New `ets` functions

`tprof`: Yet another profiling tool

Inlined `maps:get/3`

Improved `maps:merge/2`

Improvements of the `lists` module

New function `lists:enumerate/3`

New options for the `zip` family of functions

No need to enable feature `maybe` in the runtime system

The benefit of `is_binary/1` guards

Revisiting the `base64` module

New functions in the `maps` and `lists` modules

`maps:groups_from_list/2,3`

`lists:enumerate/1,2`

`lists:uniq/1,2`

Selectable features and the new `maybe_expr` feature

The new `maybe_expr` feature EEP-49

Better support for `perf` and `gdb`

`perf` tips and tricks

New option `short` for `erlang:float_to_list/2` and `erlang:float_to_binary/2`

The new module `peer` supersedes the slave module

`gen_XXX` modules has got a new `format_status/1` callback.

The `timer` module has been modernized and made more efficient

Use `rand` anyway

`rand_SUITE:measure/1`