<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://www.erlang.org/blog.xml" rel="self" type="application/atom+xml" /><link href="https://www.erlang.org/" rel="alternate" type="text/html" /><updated>2026-04-13T02:13:20+00:00</updated><id>https://www.erlang.org/blog.xml</id><title type="html">Erlang/OTP</title><subtitle>The official home of the Erlang Programming Language</subtitle><entry><title type="html">Erlang/OTP 28 Highlights</title><link href="https://www.erlang.org/blog/highlights-otp-28/" rel="alternate" type="text/html" title="Erlang/OTP 28 Highlights" /><published>2025-05-20T00:00:00+00:00</published><updated>2025-05-20T00:00:00+00:00</updated><id>https://www.erlang.org/blog/highlights-otp-28</id><content type="html" xml:base="https://www.erlang.org/blog/highlights-otp-28/"><![CDATA[<p>Erlang/OTP 28 is finally here. This blog post will introduce the new
features that we are most excited about.</p>

<p>A list of all changes is found in <a href="https://erlang.org/patches/OTP-28.0">Erlang/OTP 28 Readme</a>.
Or, as always, look at the release notes of the application you are interested in.
For instance:
<a href="https://www.erlang.org/doc/apps/erts/notes.html#erts-16.0">Erlang/OTP 28 - Erts Release Notes - Version 16.0</a>.</p>

<p>This year’s highlights mentioned in this blog post are:</p>

<ul>
  <li><a href="#priority-messages">Priority Messages</a></li>
  <li><a href="#improvements-of-comprehensions">Improvements of Comprehensions</a></li>
  <li><a href="#smarter-error-suggestions">Smarter Error Suggestions</a></li>
  <li><a href="#improvements-to-the-shell">Improvements to the Shell</a></li>
  <li><a href="#new-erlanghibernate0">New <code>erlang:hibernate/0</code></a></li>
  <li><a href="#warnings-for-use-of-old-style-catch">Warnings for Use of Old-style Catch</a></li>
  <li><a href="#pcre2">PCRE2</a></li>
  <li><a href="#optimizations-to-tls-13">Optimizations to TLS 1.3</a></li>
  <li><a href="#based-floating-point-literals">Based Floating Point Literals</a></li>
  <li><a href="#nominal-types">Nominal Types</a></li>
  <li><a href="#new-emacs-erlang-mode">New Emacs Erlang Mode</a></li>
</ul>

<h1 id="priority-messages">Priority Messages</h1>

<p>Sometimes, it is important for urgent messages to <em>skip the queue</em> and
be read by the receiving process as soon as possible. Erlang/OTP 28 introduces
priority messages, an opt-in mechanism that allows the receiving process
to let certain messages get priority status.</p>

<p>By default, all messages are inserted to the end of the message queue of
a process. This can become cumbersome when the queue is long. An urgent
message may need to be read as soon as possible.</p>

<p>For example, the current message overload protection mechanism for <a href="https://www.erlang.org/doc/apps/kernel/logger.html"><code>logger</code></a>
polls its message queue length in order to know when it should start shedding
messages. It would have benefitted from using the <a href="https://www.erlang.org/doc/apps/erts/erlang.html#system_monitor/2"><code>long_message_queue</code></a>
monitoring functionality introduced in Erlang/OTP 27, but the only way to
get information like that is via a message, which would be inserted at the
end of the very long queue.</p>

<p>Priority message solves this problem by letting selected messages be inserted
before all ordinary messages, but still in the order they are received.</p>

<p>A receiver process can allow other processes to send priority message
to itself in two simple steps. The first step is to create a process alias
using <a href="https://erlang.org/doc/apps/erts/erlang.html#alias/1"><code>alias/1</code></a>:</p>

<pre><code class="language-erlang">PrioAlias = alias([priority])
</code></pre>

<p>This alias can then be distributed to other processes that should be able
to send priority messages to the receiver process. A sender process can
send a priority message by using <a href="https://erlang.org/doc/apps/erts/erlang.html#send/3"><code>erlang:send/3</code></a>,
passing the <code>PrioAlias</code> as the first argument, and the option <code>priority</code> in
the option list as the third argument:</p>

<pre><code class="language-erlang">erlang:send(PrioAlias, Message, [priority])
</code></pre>

<p>In this way, messages sent to a priority alias with the <code>priority</code> flag will
be inserted before ordinary messages in the message queue. Other processes
can still send ordinary messages to the priority alias by not using the
<code>priority</code> flag. If a message is sent to the priority alias without using
the <code>priority</code> flag, it will be treated as an ordinary message.</p>

<p>It is also possible to send an exit signal as a priority signal, like this:</p>

<pre><code class="language-erlang">exit(PrioAlias, Message, [priority])
</code></pre>

<p>If the receiver process wants to stop receiving priority messages, it
can do so by deactivating its priority alias:</p>

<pre><code class="language-erlang">true = unalias(PrioAlias)
</code></pre>

<p>After this, no priority message can be sent to the receiver process, because
the priority alias is no longer active. The receiver process can activate
and deactivate its priority alias again at any time.</p>

<p>Priority message reception can also be enabled for exit signals due to broken
links and messages triggered due to monitors. Since these signals are not
sent when a process calls a specific function for sending a signal, but
when specific events occur in the system, a priority alias cannot be used
for this. In order to enable such priority messages, you can pass the
<code>priority</code> option to either <a href="https://erlang.org/doc/apps/erts/erlang.html#monitor/3"><code>erlang:monitor/3</code></a>
or <a href="https://erlang.org/doc/apps/erts/erlang.html#link/2"><code>erlang:link/2</code></a>.</p>

<p>Priority messages respect Erlang’s existing guarantee: Signals still arrive
in the same order as they are sent, if they arrive at all. This change
only affects where messages are inserted in the queue. Performance-wise,
this feature preserves Erlang’s selective receive optimization. There is
no performance penalty for ordinary messages or priority messages.</p>

<p>For more details, see <a href="https://www.erlang.org/doc/system/ref_man_processes.html#priority-messages">the documentation of <em>priority messages</em></a>
and <a href="https://www.erlang.org/eeps/eep-0076">EEP-76</a>.</p>

<h1 id="improvements-of-comprehensions">Improvements of Comprehensions</h1>

<p>Erlang/OTP 28 introduces many useful updates in its comprehensions. All
of them are new language features that have been suggested as <a href="https://www.erlang.org/eeps">EEPs</a>.
Between the release of Erlang/OTP 27 and 28, there were 4 accepted EEPs related
to comprehensions. Features described by two of them are included in
Erlang/OTP 28. The other two are postponed to a later release. The <a href="https://www.erlang.org/doc/system/expressions.html#comprehensions">documentation for
comprehensions</a>
contains an up-to-date overview of all relevant features.</p>

<h2 id="strict-generators">Strict Generators</h2>

<p>Strict generator as described in <a href="https://www.erlang.org/eeps/eep-0070">EEP 70</a>
aims to improve expressiveness and safety for comprehensions.</p>

<p>In OTP 27 and earlier, when the right-hand side expression does not match
the left-hand side pattern in a comprehension generator, the term is ignored
and the evaluation continues on. In the following example, the element
<code>error</code> is silently skipped in the comprehension.</p>

<pre><code class="language-erlang">1&gt; [X ||{ok, X} &lt;- [{ok, 1}, error, {ok, 3}]].
[1,3]
</code></pre>

<p>This behavior can hide the presence of unexpected elements in the input
data. In the example above, what if the list should not contain anything
other than 2-tuples with the first element being <code>ok</code>? By using a strict
generator, the comprehension crashes when the pattern-matching fails with
the element <code>error</code>.</p>

<pre><code class="language-erlang">2&gt; [X ||{ok, X} &lt;:- [{ok, 1}, error, {ok, 3}]].
** exception error: no match of right hand side value error
</code></pre>

<p>Strict generators can be used in list generators (<code>&lt;:-</code>), binary generators
(<code>&lt;:=</code>), and map generators (<code>&lt;:-</code>). In contrast, the previously existing
generators are called <em>relaxed</em> generators.</p>

<p>Strict generators and relaxed generators can convey different intentions from
the programmer. The following example is rewritten from a comprehension in
the Erlang linter. It finds all nifs from an abstract form, and output them.
Obviously, not all forms are nifs. We want to ignore all forms that are not
nifs here. Using a relaxed generator here is correct.</p>

<pre><code class="language-erlang">Nifs = [Args || {attribute, _Anno, nifs, Args} &lt;- Forms].
</code></pre>

<p>More examples about strict and relaxed generators can be found in
<a href="https://www.erlang.org/doc/system/list_comprehensions.html">List Comprehensions</a>.</p>

<p>Sometimes, using either strict or relaxed generators is fine. When the
left-hand side pattern is a fresh variable, pattern matching cannot fail.
Using either leads to the same behavior. While the preference and use cases
might be individual, it is recommended to use strict generators when either
can be used. Using strict generators by default aligns with Erlang’s “Let
it crash” philosophy.</p>

<p>Now you can pick a more fitting tool for the job, without losing the brevity
of comprehensions. It is also a good time to review old code, and see if
strict generators are more fitting in certain places. The compiler team in
OTP has done <a href="https://github.com/erlang/otp/pull/9004">that</a>. Take a look
if you are curious.</p>

<h2 id="zip-generators">Zip Generators</h2>

<p>Zip generators as described in <a href="https://www.erlang.org/eeps/eep-0073">EEP 73</a>
makes it easier to iterate over multiple lists, binaries, or maps in parallel.</p>

<p>Erlang’s list comprehension extract elements in a nested or cartesian way
by default:</p>

<pre><code class="language-erlang">1&gt; [{X, Y} || X &lt;- [1, 2], Y &lt;- [a, b]].
[{1,a},{1,b},{2,a},{2,b}]
</code></pre>

<p>Using zip generators <code>&amp;&amp;</code>, we can change the default behavior and “zip”
generators together as if using <a href="https://erlang.org/doc/apps/stdlib/lists.html#zip/2"><code>lists:zip/2</code></a>:</p>

<pre><code class="language-erlang">2&gt; [{X, Y} || X &lt;- [1, 2] &amp;&amp; Y &lt;- [a, b]].
[{1,a},{2,b}]
</code></pre>

<p>Zip generators can be used with lists, binaries, and maps, and can be
mixed freely with all existing generators and filters. Unlike <a href="https://erlang.org/doc/apps/stdlib/lists.html#zip/2"><code>lists:zip/2</code></a>
and <a href="https://erlang.org/doc/apps/stdlib/lists.html#zip/3"><code>lists:zip/3</code></a>, you
can zip any number of generators together using <code>&amp;&amp;</code>s. The compiler avoids
creating intermediate tuples, yet preserving the same error behaviors as
these helper functions.</p>

<h1 id="smarter-error-suggestions">Smarter Error Suggestions</h1>

<p>The Erlang/OTP 28 compiler has levelled up its ability in spotting typos.
Now it gives you suggestions on how to fix them, whenever possible.</p>

<p>For example, the following code exports an undefined function <code>bar/1</code>.</p>

<pre><code class="language-erlang">-export([bar/1]).
baz(X) -&gt; X.
</code></pre>

<p>The Erlang/OTP 27 compiler correctly points out the undefined function.</p>

<pre><code class="language-erlang">t.erl:3:2: function bar/1 undefined
%   3| -export([bar/1]).
%    |  ^
</code></pre>

<p>The Erlang/OTP 28 compiler goes one step further. It suggests a possible
correction, according to all the defined functions in the module.</p>

<pre><code class="language-erlang">t.erl:3:2: function bar/1 undefined, did you mean baz/1?
%   3| -export([bar/1]).
%    |  ^
</code></pre>

<p>This applies to common error types, like <code>undefined_nif</code>, <code>unbound_var</code>,
<code>undefined_function</code>, <code>undefined_record</code>, and so on.</p>

<p>It also works for wrong arity. If you call a function with the wrong number
of arguments, the compiler will suggest available arities, like the following:</p>

<pre><code class="language-erlang">t.erl:6:12: function bar/2 undefined, did you mean bar/1,3,4?
</code></pre>

<p>This makes compilation errors easier to understand, and small mistakes
faster to fix. Try it out and you’ll notice the change!</p>

<h1 id="improvements-to-the-shell">Improvements to the Shell</h1>

<p>Erlang/OTP 28 brings several improvements to the shell interface, making
it more flexible, interactive and powerful than before.</p>

<h2 id="lazy-reads-from-stdin">Lazy Reads from <code>stdin</code></h2>

<p>Previously, Erlang’s <code>stdin</code> greedily read all input data, which could cause
problems with special characters. This is changed by <a href="https://github.com/erlang/otp/pull/8962">PR-8962</a>.
Now in Erlang/OTP 28, all reads from <code>stdin</code> are done upon request, like
only when an <a href="https://www.erlang.org/doc/apps/stdlib/io.html#get_line/2"><code>io:get_line/2</code></a>
or equivalent is called. This removes the need to use the <code>-noinput</code> flag, and
 resolves issues like <a href="https://github.com/erlang/otp/issues/8113">Issue-8113</a>.</p>

<h2 id="raw-and-cooked-modes-for-noshell">Raw and Cooked Modes for <code>noshell</code></h2>

<p>The <code>noshell</code> mode now supports two “submodes”:</p>
<ul>
  <li><code>cooked</code> is the default behavior, same as before.</li>
  <li><code>raw</code> is the new option that can bypass the line editing support of the
native terminal.</li>
</ul>

<p>In <code>raw</code> mode, you can build more interactive applications. It offers the
possibility to read keystrokes as they happen without the user typing Enter,
while disabling the line editing support and the echoing to <code>stdout</code>. The
following example is an escript that can read raw input (and immediately
prints it back out) without requiring the user to press Enter:</p>

<pre><code class="language-erlang">#!/usr/bin/env escript
%% t.es

main(_Args) -&gt;
    shell:start_interactive({noshell, raw}),
    io:format("Press any key, or press q to quit.\n"),
    loop().

loop() -&gt;
    case io:get_chars("", 1024) of
        "q" -&gt;
            io:format("Exit now.\n");
        Chars -&gt;
            io:format("~p", [Chars]),
            loop();
        {error, Reason} -&gt;
            io:format("Error reason: ~p~n", [Reason]),
            ok
    end.
</code></pre>

<p>With this, Erlang’s shell becomes a platform for building interactive
terminal applications.
The <a href="https://www.erlang.org/doc/apps/stdlib/custom_shell.html">custom shell</a>
documentation shows how to create a custom shell. The <a href="https://www.erlang.org/doc/apps/stdlib/terminal_interface.html">terminal interface</a>
documentation shows how to implement a tic-tac-toe game.</p>

<p>Try it out. We look forward to see more interactive applications created
using this feature.</p>

<h2 id="using-fun-namearity-to-create-funs-in-shell">Using <code>fun Name/Arity</code> to create funs in shell</h2>

<p>Thanks to <a href="https://github.com/erlang/otp/pull/8987">PR-8987</a>, you can
now use <code>fun Name/Arity</code> to create funs in shell. The fun can be created
from an auto-imported BIF, such as <a href="https://www.erlang.org/doc/apps/erts/erlang.html#is_atom/1"><code>is_atom/1</code></a>, as in the example below.</p>

<pre><code class="language-erlang">1&gt; F = fun is_atom/1.
fun erlang:is_atom/1
2&gt; F(a).
true
3&gt; F(42).
false
</code></pre>

<p>Or from a local function defined in shell, as in the following example.</p>

<pre><code class="language-erlang">1&gt; I = fun id/1.
#Fun&lt;erl_eval.42.18682967&gt;
2&gt; I(42).
** exception error: undefined shell command id/1
3&gt; id(I) -&gt; I.
ok
4&gt; I(42).
42
</code></pre>

<h1 id="new-erlanghibernate0">New <code>erlang:hibernate/0</code></h1>

<p>Erlang/OTP 28 introduces a new <a href="https://www.erlang.org/doc/apps/erts/erlang.html#hibernate/0"><code>erlang:hibernate/0</code></a>
function. This built-in function puts the calling process into a wait state
where its memory footprint is reduced as much as possible. When the process
receives its next message, it will wake up. Unlike the existing <a href="https://www.erlang.org/doc/apps/erts/erlang.html#hibernate/3"><code>erlang:hibernate/3</code></a>,
it does not discard the call stack.</p>

<p>This makes <a href="https://www.erlang.org/doc/apps/erts/erlang.html#hibernate/0"><code>erlang:hibernate/0</code></a>
useful for processes that expect long idle time, but want to have a simpler
hibernation.</p>

<h2 id="memory-usage-experiment">Memory Usage Experiment</h2>

<p>To demonstrate how efficient <a href="https://www.erlang.org/doc/apps/erts/erlang.html#hibernate/0"><code>erlang:hibernate/0</code></a>
is, we can make a benchmark that can spawn different number of processes
(starting from 1, going up to 1 million), let them either waiting for a
message using <code>receive</code> or using <a href="https://www.erlang.org/doc/apps/erts/erlang.html#hibernate/0"><code>erlang:hibernate/0</code></a>,
and then compare memory usage.</p>

<p>Here’s the test code for the first scenario, which uses <a href="https://www.erlang.org/doc/apps/erts/erlang.html#hibernate/0"><code>erlang:hibernate/0</code></a>:</p>

<pre><code class="language-erlang">-module(benchmark_hibernate).
-export([worker/0, spawn_all/1]).

worker() -&gt;
    erlang:hibernate().

spawn_all(0) -&gt;
    timer:sleep(1000),
    io:format("Memory usage: ~p~n", [erlang:memory()]),
    timer:sleep(1000),
    io:format("Memory usage after 1s: ~p~n", [erlang:memory()]),
    ok;
spawn_all(N) -&gt;
    spawn(?MODULE, worker, []),
    spawn_all(N-1).
</code></pre>

<p>Here’s the test code for the second scenario. Processes stay idle but they
do not hibernate:</p>

<pre><code class="language-erlang">-module(benchmark_receive).
-export([worker2/0, spawn_all/1]).

worker2() -&gt;
    receive
        _  -&gt; ok
    end.

spawn_all(0) -&gt;
    timer:sleep(1000),
    io:format("Memory usage: ~p~n", [erlang:memory()]),
    timer:sleep(1000),
    io:format("Memory usage after 1s: ~p~n", [erlang:memory()]),
    ok;
spawn_all(N) -&gt;
    spawn(?MODULE, worker, []),
    spawn_all(N-1).
</code></pre>

<p>Memory usage is measured by <a href="https://www.erlang.org/doc/apps/erts/erlang.html#memory/0"><code>erlang:memory()</code></a>
after 1 million processes have been spawned. For the final result, we take
the average of two measurements.</p>

<p>We spawn 1, 10 thousand, 100 thousand, and 1 million processes for both
scenarios. Results are summarized in the following table:</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Number of Processes</th>
      <th>Memory Used (Mb)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Hibernated</td>
      <td>1</td>
      <td>44.8</td>
    </tr>
    <tr>
      <td>Without <code>hibernate/0</code></td>
      <td>1</td>
      <td>47.0</td>
    </tr>
    <tr>
      <td>Hibernated</td>
      <td>10,000</td>
      <td>55.5</td>
    </tr>
    <tr>
      <td>Without <code>hibernate/0</code></td>
      <td>10,000</td>
      <td>73.4</td>
    </tr>
    <tr>
      <td>Hibernated</td>
      <td>100,000</td>
      <td>130.3</td>
    </tr>
    <tr>
      <td>Without <code>hibernate/0</code></td>
      <td>100,000</td>
      <td>307.1</td>
    </tr>
    <tr>
      <td>Hibernated</td>
      <td>1,000,000</td>
      <td>827.9</td>
    </tr>
    <tr>
      <td>Without <code>hibernate/0</code></td>
      <td>1,000,000</td>
      <td>2687.1</td>
    </tr>
  </tbody>
</table>

<p>When there is only 1 process, the memory usage reduction is not obvious
yet. When there are 1 million mostly idle processes, that’s more than 75%
reduction in memory usage if you use <a href="https://www.erlang.org/doc/apps/erts/erlang.html#hibernate/0"><code>erlang:hibernate/0</code></a>!</p>

<h1 id="warnings-for-use-of-old-style-catch">Warnings for Use of Old-Style Catch</h1>

<p>Erlang/OTP 28 introduces a warning for using the old style <code>catch Expr</code>,
instead of <code>try ... catch ... end</code>.</p>

<p>The more simplistic <code>catch Expr</code> is problematic in that it catches
<em>all</em> exceptions and can therefore hide bugs. For example, if the
intention is to catch exceptions raised by <a href="https://www.erlang.org/doc/apps/erts/erlang.html#throw/1"><code>throw/1</code></a>,
the old-style <code>catch</code> will also catch runtime errors. Using its alternative
<code>try ... catch ... end</code> can offer better clarity.</p>

<p>In a future release, the use of the old <code>catch</code> construct will by
default result in compiler warnings. To facilitate removing usages of
the old-style <code>catch</code>, the compiler now has an option
<code>warn_deprecated_catch</code>. It can be enabled on the project level or the
module level to prevent new uses of the old-style catch.</p>

<p>If you have added <code>warn_deprecated_catch</code> at the project-level, the
warning can be suppressed in individual modules that have not yet been
updated by adding the <code>-compile(nowarn_deprecated_catch)</code> to them.</p>

<p>Here are some common uses of the old style <code>catch Expr</code>. We will show how
to replace them with <code>try ... catch ... end</code> and briefly explain why it is
a better solution.</p>

<p><em>Example 1</em>: Using <code>catch Expr</code> to handle a possible <code>throw</code></p>

<p><a href="https://www.erlang.org/doc/apps/erts/erlang.html#throw/1"><code>throw/1</code></a> is
often used to quickly return from a deep recursion. If <code>tree_walker/1</code> is
a function that traverses a tree and sometimes throws a value, the old-style
catch could be used like this:</p>

<pre><code class="language-erlang">Result = catch maybe_throw().
</code></pre>

<p>It can be refactored to the following code:</p>

<pre><code class="language-erlang">Result = try tree_walker(Tree) of
             Value -&gt; Value
         catch
             throw:Reason -&gt; Reason
         end.
</code></pre>

<p>This is a bit longer, but it is also safer. For example, if the caller
of <code>tree_walker/1</code> passes in an invalid tree (such as <code>not_a_tree</code>),
the <code>try/catch</code> will not catch the resulting crash, allowing the
bug to be noticed and fixed early.</p>

<p>To have the same ensurance that crashes are not hidden when using the
old-style <code>catch</code>, you would have to write, which is as much code as the
new <code>try/catch</code>:</p>

<pre><code class="language-erlang">Result = case catch tree_walker(Tree) of
            {'EXIT',Error} -&gt;
                 error(Error);
            Value -&gt;
                 Value
         end.
</code></pre>

<p><em>Example 2</em>: Using <code>catch Expr</code> to match a specific error in a test case</p>

<pre><code class="language-erlang">test_bad_argument(Term) -&gt;
    {'EXIT',{badarg,_}} = catch list_to_atom(Term).
</code></pre>

<p>It can be refactored to the following code:</p>

<pre><code class="language-erlang">test_bad_argument(Term) -&gt;
    try list_to_atom(Term) of
        _Value -&gt; error(not_supposed_to_succeed)
    catch
        error:badarg -&gt; ok
    end.
</code></pre>

<p>An easier way is to include the following header file:</p>

<pre><code class="language-erlang">-include_lib("stdlib/include/assert.hrl").
</code></pre>

<p>With that in place, you can simply write:</p>

<pre><code class="language-erlang">test_bad_argument(Term) -&gt;
    ?assertError(badarg, list_to_atom(Term)).
</code></pre>

<p>That will also result in more information being given if the test case
fails:</p>

<pre><code class="language-erlang">1&gt; t:test_bad_argument("ok").
** exception error: {assertException,[{module,t},
                                      {line,6},
                                      {expression,"list_to_atom ( Term )"},
                                      {pattern,"{ error , badarg , [...] }"},
                                      {unexpected_success,ok}]}
     in function  t:test_bad_argument/1 (t.erl:6)
</code></pre>

<p>It is likely that the compiler will start generate warnings for the
old-style <code>catch</code> in Erlang/OTP 29 or 30. If you are still using the
old style <code>catch Expr</code> in your code, now is a good time to start
refactoring.</p>

<h1 id="based-floating-point-literals">Based Floating Point Literals</h1>

<p>Erlang/OTP 28 extends its floating point syntax to support floating point
literals using any bases, similar to Ada and C99/C++17. This is based on
<a href="https://www.erlang.org/eeps/eep-0075">EEP-75</a>.</p>

<p>In Erlang, you can already write integers in different bases:</p>

<pre><code class="language-erlang">1&gt; 2#100.
4
2&gt; 3#100.
9
</code></pre>

<p>Now, you can do the same with floating point numbers:</p>

<pre><code class="language-erlang">1&gt; 2#0.011.
0.375
2&gt; 3#0.011.
0.14814814814814814
3&gt; 16#0.011#e5.
4352.0
</code></pre>

<p>Such an exact representation of floating point numbers is especially useful
in code generating tools. With only the base 10 representation, it is difficult
to convert floats from and to other bases without precision loss. With
based literals, you can even preserve bit level precision. For example,
<code>2#0.10101#e8</code> represents the exact layout of a binary float.</p>

<h1 id="pcre2">PCRE2</h1>

<p>Erlang/OTP 28 uplifts the <a href="https://www.erlang.org/doc/apps/stdlib/re.html"><code>re</code></a>
module to use PCRE2, instead of the PCRE library. This change is mostly
backward compatible with PCRE with respect to regular expression syntax, but
it also introduces some different behaviors.</p>

<p>The full documentation about breaking changes and incompatibilities can
be found in <a href="https://www.erlang.org/doc/apps/stdlib/re_incompat.html">PCRE2 Migration</a>.</p>

<h2 id="why-pcre2-instead-of-pcre">Why PCRE2 instead of PCRE?</h2>

<p>PCRE2 is more in line with modern standards, especially Perl regex, which
is stricter about pattern syntax and catches invalid patterns early. This
makes your regex code safer, at the cost of breaking some old regex patterns.</p>

<h2 id="notable-changes">Notable Changes:</h2>

<ul>
  <li>Stricter Syntax Validation: For example, <code>\i</code>, <code>\B</code>, <code>\8</code> all result
in errors.</li>
</ul>

<pre><code class="language-erlang">% Erlang/OTP 27
1&gt; re:run("AMM", ~S"\M").
{match,[{1,1}]}

% Erlang/OTP 28
1&gt; re:run("AMM", ~S"\M").
** exception error: bad argument
     in function  re:run/2
        called as re:run("AMM",~S"\M")
        *** argument 2: could not parse regular expression
                        unrecognized character follows \ on character 1

</code></pre>

<ul>
  <li>
    <p>Unicode Property Updates: Characters matched by properties using <code>\p{...}</code>
may have changed, according to the updated Unicode character property data.</p>
  </li>
  <li>
    <p><a href="https://www.erlang.org/doc/apps/stdlib/re.html#split/3"><code>re:split/3</code></a>
with Branch Reset Groups (<code>(?|...)</code>): The following example may evaluate to
<code>[[],"abc",[],[]]</code> in some interpretations of PCRE and Perl versions,
differing from PCRE2’s result.</p>
  </li>
</ul>

<pre><code class="language-erlang">1&gt; re:split("abcabc", ~S"(?|(abc)|(xyz))\1", [{return, list}]).
[[],"abc",[]]
</code></pre>

<p>It is worth noting that the internal format produced by <a href="https://www.erlang.org/doc/apps/stdlib/re.html#compile/2"><code>re:compile/2</code></a>
has changed in Erlang/OTP 28. It cannot be reused across nodes or OTP versions.</p>

<p>This upgrade offers better long-term maintainability, but you may need to
test your existing regex code before upgrading.</p>

<h1 id="optimizations-to-tls-13">Optimizations to TLS 1.3</h1>

<p>The performance of <a href="https://www.erlang.org/doc/apps/ssl/ssl.html">SSL</a>
with TLS 1.3 has been optimized. The optimization reduces the general
overhead for application data transmission. To measure the improvement
from Erlang/OTP 27.1 to Erlang/OTP 28, we ran a small message echo benchmark
and measure the time for roundtrips.</p>

<p>Results are shown in the following table:</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Samples</th>
      <th>Average</th>
      <th>Std Dev</th>
      <th>Median</th>
      <th>P99</th>
      <th>Iteration</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Erlang/OTP 28</td>
      <td>25</td>
      <td>65186</td>
      <td>5.87%</td>
      <td>66828</td>
      <td>68749</td>
      <td>38352 ns</td>
    </tr>
    <tr>
      <td>Erlang/OTP 27</td>
      <td>25</td>
      <td>51730</td>
      <td>4.64%</td>
      <td>51418</td>
      <td>57296</td>
      <td>48328 ns</td>
    </tr>
  </tbody>
</table>

<p>In general, you can expect a 15% - 25% speed-up in Erlang/OTP 28 if you
are using TLS 1.3. No changes are needed in your code. If your application
uses TLS 1.3, this is a good reason to upgrade to Erlang/OTP 28.</p>

<h1 id="nominal-types">Nominal Types</h1>

<p>Nominal type-checking as described in <a href="https://www.erlang.org/eeps/eep-0069">EEP 69</a>
adds an alternative type system to Dialyzer. Nominal types can be declared
using the syntax <code>-nominal</code>. The main use case of nominal types is to prevent
accidental misuse of types with the same structure.</p>

<p>To start with, we can declare two nominal types <code>meter/0</code> and <code>foot/0</code>
like the following:</p>

<pre><code class="language-erlang">-nominal meter() :: integer().
-nominal foot() :: integer().
</code></pre>

<p>Because <code>meter/0</code> and <code>foot/0</code> have different names and they are both nominal
types, they are not compatible. Dialyzer performs nominal type-checking
on input and output types of functions and specifications. For example,
we can define functions <code>int_to_meter/1</code> and <code>foo/0</code> like the following:</p>

<pre><code class="language-erlang">-spec int_to_meter(integer()) -&gt; meter().
int_to_meter(X) -&gt; X.

-spec foo() -&gt; foot().
foo() -&gt; int_to_meter(24).
</code></pre>

<p>The specification of <code>int_to_meter/1</code> declares the function’s return type
to be <code>meter()</code>, so the result of <code>int_to_meter(24)</code> has type <code>meter()</code>.
However, the specification of <code>foo/0</code> declares the function’s return type
to be <code>foot()</code>. The two nominal types are not compatible. Therefore, Dialyzer
raises the following warning for our example:</p>

<pre><code class="language-erlang">Invalid type specification for function foo/0.
The success typing is foo() -&gt; (meter() :: integer())
But the spec is foo() -&gt; foot()
The return types do not overlap
</code></pre>

<p>On the other hand, a nominal type is compatible with a non-opaque, non-nominal
type with the same structure. We can define the function <code>return_integer/0</code>
like this:</p>

<pre><code class="language-erlang">-spec return_integer() -&gt; integer().
return_integer() -&gt; int_to_meter(24).
</code></pre>

<p>The specification says that <code>return_integer/0</code> should return an <code>integer()</code>
type. However, the result of <code>int_to_meter(24)</code> has type <code>meter()</code>, so
<code>return_integer/0</code> will also return a <code>meter()</code> type. <code>integer()</code> is not
a nominal type.
The structure of <code>meter()</code> is compatible with <code>integer()</code>. Dialyzer can
analyze the function above without raising a warning.</p>

<p>There are exceptions to the nominal type-checking rules shown above. For more
details, see <a href="https://www.erlang.org/doc/system/nominals.html">Nominals</a> in the
reference manual.</p>

<h1 id="new-emacs-erlang-mode">New Emacs Erlang Mode</h1>

<p>Althought this is not included in the Erlang/OTP 28 release, members of the OTP
team are developing a new Emacs Erlang mode using treesitter. If you are an
Emacs user, you can get it from <a href="https://github.com/erlang/emacs-erlang-ts">Github</a>
or <a href="https://melpa.org/#/erlang-ts">Melpa</a> and try it out.</p>

<p>The new Erlang mode handles strings and documentation a lot better than the
old one. See the screenshot below for an example:</p>

<p><img src="/blog/images/28-emacs.png" alt="Source Code of `ssl:send/2` in the New Emacs Mode" /></p>

<p>If you are interested in contributing to this project, all help is appreciated.</p>]]></content><author><name>Isabell Huang</name></author><category term="erlang" /><category term="otp" /><category term="28" /><category term="release" /><summary type="html"><![CDATA[Erlang/OTP 28 is finally here. This blog post will introduce the new features that we are most excited about.]]></summary></entry><entry><title type="html">Erlang/OTP 27 Highlights</title><link href="https://www.erlang.org/blog/highlights-otp-27/" rel="alternate" type="text/html" title="Erlang/OTP 27 Highlights" /><published>2024-05-20T00:00:00+00:00</published><updated>2024-05-20T00:00:00+00:00</updated><id>https://www.erlang.org/blog/highlights-otp-27</id><content type="html" xml:base="https://www.erlang.org/blog/highlights-otp-27/"><![CDATA[<p>Erlang/OTP 27 is finally here. This blog post will introduce the new
features that we are most excited about.</p>

<p>A list of all changes is found in <a href="https://erlang.org/patches/OTP-27.0">Erlang/OTP 27 Readme</a>.
Or, as always, look at the release notes of the application you are interested in.
For instance:
<a href="https://www.erlang.org/doc/apps/erts/notes.html#erts-15.0">Erlang/OTP 27 - Erts Release Notes - Version 15.0</a>.</p>

<p>This year’s highlights mentioned in this blog post are:</p>

<ul>
  <li><a href="#overhauled-documentation-system">Overhauled documentation system</a></li>
  <li><a href="#triple-quoted-strings">Triple-Quoted strings</a></li>
  <li><a href="#sigils">Sigils</a></li>
  <li><a href="#no-need-to-enable-feature-maybe">No need to enable feature <code>maybe</code></a></li>
  <li><a href="#the-new-json-module">The new <code>json</code> module</a></li>
  <li><a href="#process-labels">Process labels</a></li>
  <li><a href="#new-functionality-in-stdlib">New functionality in STDLIB</a></li>
  <li><a href="#new-ssl-client-side-stapling-support">New SSL client-side stapling support</a></li>
  <li><a href="#tprof-yet-another-profiling-tool"><code>tprof</code>: Yet another profiling tool</a></li>
  <li><a href="#multiple-trace-sessions">Multiple trace sessions</a></li>
  <li><a href="#native-coverage-support">Native coverage support</a></li>
  <li><a href="#deprecating-archives">Deprecating archives</a></li>
</ul>

<h1 id="overhauled-documentation-system">Overhauled documentation system</h1>

<p>The Erlang/OTP documentation before Erlang/OTP 27 was authored in
<a href="https://en.wikipedia.org/wiki/XML">XML</a>, from which the
<a href="https://www.erlang.org/docs/26/apps/erl_docgen/">Erl_Docgen</a>
application could generate HTML web pages, PDFs, or Unix man pages.
The reason for generating PDFs is that the documentation used to be
printed as
<a href="https://erlangforums.com/t/old-printed-otp-documentation-cover/1989/2">actual paper books</a>.
The last time the books were printed were for Erlang/OTP R7 released in 2000.</p>

<p>As an example, here is the XML code for
<a href="https://www.erlang.org/docs/26/man/lists#duplicate-2"><code>lists:duplicate/2</code></a>
from Erlang/OTP 26:</p>

<pre><code class="language-xml">    &lt;func&gt;
      &lt;name name="duplicate" arity="2" since=""/&gt;
      &lt;fsummary&gt;Make &lt;c&gt;N&lt;/c&gt; copies of element.&lt;/fsummary&gt;
      &lt;desc&gt;
        &lt;p&gt;Returns a list containing &lt;c&gt;&lt;anno&gt;N&lt;/anno&gt;&lt;/c&gt; copies of term
          &lt;c&gt;&lt;anno&gt;Elem&lt;/anno&gt;&lt;/c&gt;.&lt;/p&gt;
        &lt;p&gt;&lt;em&gt;Example:&lt;/em&gt;&lt;/p&gt;
        &lt;pre&gt;
&gt; &lt;input&gt;lists:duplicate(5, xx).&lt;/input&gt;
[xx,xx,xx,xx,xx]&lt;/pre&gt;
      &lt;/desc&gt;
    &lt;/func&gt;
</code></pre>

<p>The XML code was stored in separate files, not in the source
code. When building the documentation, the function specs from the
source code would be combined with the text from the documentation
file. It was the responsibility of the writer to ensure that variables
mentioned in the documentation body matched the names in the function
spec.</p>

<p>One thing never said about Erl_Docgen and the old documentation system
was that it made writing documentation enjoyable and effortless. That
was one thing we wanted to change with the new documentation system.
We wanted to make it fun to write documentation, or at least to
require less attention to tedious details such as using XML tags
correctly.</p>

<p>In Erlang/OTP 27, the documentation is written in
<a href="https://en.wikipedia.org/wiki/Markdown">Markdown</a> and is placed in
the source code before the function spec and implementation. Here is
the documentation and implementation of
<a href="https://www.erlang.org/doc/man/lists#duplicate/2"><code>lists:duplicate/2</code></a>
in Erlang/OTP 27:</p>

<pre><code>-doc """
Returns a list containing `N` copies of term `Elem`.

_Example:_

```erlang
&gt; lists:duplicate(5, xx).
[xx,xx,xx,xx,xx]
```
""".

-spec duplicate(N, Elem) -&gt; List when
      N :: non_neg_integer(),
      Elem :: T,
      List :: [T],
      T :: term().

duplicate(N, X) when is_integer(N), N &gt;= 0 -&gt; duplicate(N, X, []).

duplicate(0, _, L) -&gt; L;
duplicate(N, X, L) -&gt; duplicate(N-1, X, [X|L]).
```
</code></pre>

<p>The documentation is placed in a
<a href="#triple-quoted-strings">triple-quoted string</a>
following
the <a href="https://www.erlang.org/eeps/eep-0059"><code>-doc</code> attribute</a>.</p>

<p>Having the documentation near the spec makes its easy to ensure that
the text refers to variables defined in the function spec.</p>

<p>Another goal we had was to replace Erl_Docgen with a tool more widely
used so that we wouldn’t have to carry the entire burden for
maintaining it. We did that by using
<a href="https://hexdocs.pm/ex_doc/readme.html">ExDoc</a>, which is also used by
the <a href="https://elixir-lang.org">Elixir</a> language and most, if not all,
Elixir projects.</p>

<p>An issue that arose is whether it’s advisable to include user
documentation within the source code. Wouldn’t this make it much harder
to maintain the code?</p>

<p>I don’t claim to have a universal response to that concern, but in the
case of Erlang/OTP, most actively developed code exists within modules
lacking documentation. Typically, OTP applications consist of one or
a few modules containing the documented API, while the bulk of the
implementation is found in other modules.</p>

<p>For example, the interface to the Erlang compiler is found in the
<a href="https://www.erlang.org/doc/man/compile">compile</a> module, while most
of the code being executed resides in one of the other 59 modules
of the Compiler application. Similarly, the <a href="https://www.erlang.org/doc/apps/ssl">SSL
application</a> comprises 76 modules,
of which merely four contain documentation.</p>

<p>Another application that is frequently updated is
<a href="https://www.erlang.org/doc/apps/erts">ERTS</a>. However, most of ERTS is
implemented in C (and some C++), while much of the actual
Erlang code within ERTS is located in modules without documentation.</p>

<p>There are, of course, some exceptions to how applications are
structured, for example the STDLIB application, where most modules are
documented. However, STDLIB is a mature application that is updated
relatively infrequently.</p>

<h1 id="triple-quoted-strings">Triple-Quoted strings</h1>

<p>To facilitate writing documentation attributes containing many lines
of text, triple-quoted strings as described in <a href="https://www.erlang.org/eeps/eep-0064">EEP
64</a> have been
implemented. Triple-quoted strings come in handy whenever one needs
to include multiple line of text in Erlang source code. For example,
assume that we want to define a function that outputs some
quotations:</p>

<pre><code>1&gt; t:quotes().
"I always have a quotation for everything -
it saves original thinking." - Dorothy L. Sayers

"Real stupidity beats artificial intelligence every time."
- Terry Pratchett
ok
</code></pre>

<p>In Erlang/OTP 26, there are several different ways to do that, but of none
of them are particularly satisfying. For example, the text can be put into a
single string:</p>

<pre><code class="language-erlang">quotes() -&gt;
    S = "\"I always have a quotation for everything -
it saves original thinking.\" - Dorothy L. Sayers

\"Real stupidity beats artificial intelligence every time.\"
- Terry Pratchett\n",
    io:put_chars(S).
</code></pre>

<p>This works, but is ugly. We must also remember to escape every quote
character.</p>

<p>A cleaner way is to use multiple strings, one for each line, letting
the compiler combine them:</p>

<pre><code class="language-erlang">quotes() -&gt;
    S = "\"I always have a quotation for everything -\n"
        "it saves original thinking.\" - Dorothy L. Sayers\n"
        "\n"
        "\"Real stupidity beats artificial intelligence every time.\"\n"
        "- Terry Pratchett\n",
    io:put_chars(S).
</code></pre>

<p>That is a little bit nicer, but we’ll need to type more quote characters
and we must not forget to add <code>\n</code> at the end of each string. To
make sure that we don’t forget to insert the newlines, we could delegate
that mundane chore to the computer:</p>

<pre><code class="language-erlang">quotes() -&gt;
    S = ["\"I always have a quotation for everything -",
         "it saves original thinking.\" - Dorothy L. Sayers",
         "",
         "\"Real stupidity beats artificial intelligence every time.\"",
         "- Terry Pratchett"],
    io:put_chars(lists:join("\n", S)),
    io:nl().
</code></pre>

<p>In Erlang/OTP 27, we can use a triple-quoted string:</p>

<pre><code>quotes() -&gt;
    S = """
        "I always have a quotation for everything -
        it saves original thinking." - Dorothy L. Sayers

        "Real stupidity beats artificial intelligence every time."
        - Terry Pratchett
        """,
    io:put_chars(S),
    io:nl().
</code></pre>

<p>The ending <code>"""</code> determines how much each line in the string should be
indented. The same characters that precede <code>"""</code> are deleted from all
lines between the beginning and terminating delimiters. For this
particular example, all space characters are removed since all have
the same indentation as the terminating <code>"""</code>.  Neither quote
characters nor backslashes are special in the lines enclosed by the
triple-quotes, so there is no need to escape anything.</p>

<p>Here is another example to show the versatility of triple-quoted
strings:</p>

<pre><code>effect_warning() -&gt;
    """
    f() -&gt;
        %% Test that the compiler warns for useless tuple building.
        {a,b,c},
        ok.
    """.
</code></pre>

<p>The function returns a string containing a short Erlang function.</p>

<p>Assuming that <code>effect_warning/0</code> is defined in module <code>t</code>, it can be
called like so:</p>

<pre><code>1&gt; io:format("~ts\n", [t:effect_warning()]).
f() -&gt;
    %% Test that the compiler warns for useless tuple building.
    {a,b,c},
    ok.
</code></pre>

<p>Note that indentation of the Erlang code for function <code>f/0</code> is retained.</p>

<p>For more information, see section <a href="https://www.erlang.org/doc/reference_manual/data_types#string">String</a>
in the Reference Manual.</p>

<h1 id="sigils">Sigils</h1>

<p>Sigils for string literals as described in <a href="https://www.erlang.org/eeps/eep-0066">EEP 66</a>
have been implemented.</p>

<p>Continuing with the theme of quotes, let’s explore why sigils were
introduced into Erlang, drawing inspiration from the wisdom of ancient
Greek philosophers:</p>

<pre><code class="language-erlang">1&gt; t:greek_quote().
"Know thyself" (Greek: Γνῶθι σαυτόν)
ok
</code></pre>

<p>In Erlang/OTP 26, this can be implemented as follows:</p>

<pre><code class="language-erlang">greek_quote() -&gt;
    S = "\"Know thyself\" (Greek: Γνῶθι σαυτόν)",
    io:format("~ts\n", [S]).
</code></pre>

<p>At this point, we get some customer feedback indicating that the
modules containing all the quotes are consuming an excessive amount of
memory. Each character in a string consumes 16 bytes of memory (on a
64-bit computer). That could be reduced to one byte for each character
if a binary were to be used instead of a string.  (Actually, one byte
for each US ASCII character and two bytes for each Greek letter.)</p>

<p>That change should be really easy. Let’s try:</p>

<pre><code class="language-erlang">greek_quote() -&gt;
    S = &lt;&lt;"\"Know thyself\" (Greek: Γνῶθι σαυτόν)"&gt;&gt;,
    io:format("~ts\n", [S]).
</code></pre>

<p>That works for the English text, but not for the Greek characters:</p>

<pre><code class="language-erlang">2&gt; t:greek_quote().
"Know thyself" (Greek: ½ö¸¹ Ã±ÅÄÌ½)
</code></pre>

<p>What’s wrong?</p>

<p>Strings in binary expression are by default assumed to be a sequence
of byte-size characters. Therefore, this expression:</p>

<pre><code class="language-erlang">1&gt; &lt;&lt;"Γνῶθι"&gt;&gt;.
&lt;&lt;147,189,246,184,185&gt;&gt;
</code></pre>

<p>is <a href="https://en.wikipedia.org/wiki/Syntactic_sugar">syntactic sugar</a> for:</p>

<pre><code class="language-erlang">2&gt; &lt;&lt;$Γ:8, $ν:8, $ῶ:8, $θ:8, $ι:8&gt;&gt;.
&lt;&lt;147,189,246,184,185&gt;&gt;
</code></pre>

<p>It is necessary to specify that the characters are to be encoded as
<a href="https://en.wikipedia.org/wiki/UTF-8">UTF-8</a>
encoded characters by appending an <code>/utf8</code> suffix:</p>

<pre><code class="language-erlang">greek_quote() -&gt;
    S = &lt;&lt;"\"Know thyself\" (Greek: Γνῶθι σαυτόν)"/utf8&gt;&gt;,
    io:format("~ts\n", [S]).
</code></pre>

<p>That works because <code>&lt;&lt;"Γνῶθι"/utf8&gt;&gt;</code> is syntactic sugar for
<code>&lt;&lt;$Γ/utf8, $ν/utf8, $ῶ/utf8, $θ/utf8, $ι/utf8&gt;&gt;</code>.</p>

<p>Enter sigils.</p>

<pre><code>greek_quote() -&gt;
    S = ~B["Know thyself" (Greek: Γνῶθι σαυτόν)],
    io:format("~ts\n", [S]).
</code></pre>

<p>The <code>~</code> character begins a sigil. It is usually followed by a letter that
indicates how the characters in the string should be interpreted or encoded.</p>

<p>In this case the character <code>B</code> means that the characters should be put into a binary in UTF-8 encoding,
and also that that no escape characters are allowed.</p>

<p>After <code>B</code> follows the start delimiter, in this case <code>[</code>.  Since no escape characters
are allowed, it is necessary to choose delimiters that don’t occur in the string
contents. After the contents follows the end delimiter, in this case <code>]</code>.</p>

<p><code>~b</code> creates a binary in the same way as <code>~B</code>, except that backslashes
will be interpreted as escape characters. This can be useful if one
wants to insert control characters such as TAB (<code>\t</code>) into a binary:</p>

<pre><code>1&gt; ~b"abc\txyz".
&lt;&lt;"abc\txyz"&gt;&gt;
</code></pre>

<p>Here we used the <code>"</code> character as delimiters as it is not used within
the string.</p>

<p>If we omit the letter after <code>~</code>, we will get the same result:</p>

<pre><code>2&gt; ~"abc\txyz".
&lt;&lt;"abc\txyz"&gt;&gt;
</code></pre>

<p>The default sigil (no letter following <code>~</code>) creates a binary, just
like <code>~b</code> and <code>~B</code>, but whether escape characters are interpreted
depends on the form of the string. Triple-quoted strings do not by
default interpret escape sequences such as <code>\n</code>, but plain inline
strings do, so <code>~"abc\ndef"</code> works as you might expect, and you can
always prefix an existing string like <code>"abc\ndef"</code> with a <code>~</code> to turn
it into a binary without fear of changing its content.</p>

<p>Returning to the quotations example from the previous section, let’s
see how a binary literal can be created by inserting <code>~</code> before the
leading <code>"""</code>:</p>

<pre><code>quotes() -&gt;
    S = ~"""
         "I always have a quotation for everything -
         it saves original thinking." - Dorothy L. Sayers

         "Real stupidity beats artificial intelligence every time."
         - Terry Pratchett
         """,
    io:put_chars(S),
    io:nl().
</code></pre>

<p>For a triple-quoted string, the default sigil and <code>~B</code> always produces
the same binary. The <code>~b</code> sigil can be used when escape characters
must be supported.</p>

<p><code>~s</code> creates a string in the usual way. The only useful way it differs
from a plain quoted string is that the delimiters can be switched. That
way, one can avoid the hassle of escaping quote characters and still
get to use control characters such as TAB:</p>

<pre><code>3&gt; ~s{"abc\txyz"}.
"\"abc\txyz\""
</code></pre>

<p>Used for a triple-quoted string it enables the use of escape characters:</p>

<pre><code>4&gt; ~s"""
    \tabc
    \tdef
    """.
"\tabc\n\tdef"
</code></pre>

<p><code>~S</code> creates a string, but does not support escaping of characters
within the string, similar to <code>~B</code>.</p>

<p>For more information, see section <a href="https://www.erlang.org/doc/reference_manual/data_types#sigil">Sigil</a>
in the Reference Manual.</p>

<p>(<strong>UPDATE</strong>: The description of the default sigil has been corrected. Thanks
to Richard Carlsson for pointing out this error.)</p>

<h1 id="no-need-to-enable-feature-maybe">No need to enable feature <code>maybe</code></h1>

<p>The <a href="https://www.erlang.org/doc/reference_manual/expressions#maybe">maybe expression</a>
was introduced as a <a href="https://www.erlang.org/doc/reference_manual/features.html">feature</a>
in Erlang/OTP 25. In that release, it was necessary to enable it both in
the compiler and the runtime system.</p>

<p>Erlang/OTP 26 lifted the necessity to enable <code>maybe</code> in the runtime system.</p>

<p>Now in Erlang/OTP 27, <code>maybe</code> is enabled by default in the compiler.
In the example from
<a href="https://www.erlang.org/blog/otp-26-highlights/#no-need-to-enable-feature-maybe-in-the-runtime-system">last year’s blog post</a>,
the line <code>-feature(maybe_expr, enable).</code> can now be removed:</p>

<pre><code>$ cat t.erl
-module(t).
-export([listen_port/2]).
listen_port(Port, Options) -&gt;
    maybe
        {ok, ListenSocket} ?= inet_tcp:listen(Port, Options),
        {ok, Address} ?= inet:sockname(ListenSocket),
        {ok, {ListenSocket, Address}}
    end.
$ erlc t.erl
$ erl
Erlang/OTP 27 . . .

Eshell V15.0  (abort with ^G)
1&gt; t:listen_port(50000, []).
{ok,{#Port&lt;0.5&gt;,{{0,0,0,0},50000}}}
</code></pre>

<p>When <code>maybe</code> is used as an atom, it need to be quoted. For example:</p>

<pre><code class="language-erlang">will_succeed(. . .) -&gt; yes;
will_succeed(. . .) -&gt; no;
   .
   .
   .
will_succeed(_) -&gt; 'maybe'.
</code></pre>

<p>Alternatively, it is still possible to disable the <code>maybe_expr</code> feature. With
the feature disabled, <code>maybe</code> can be used as an atom without quotes.</p>

<p>One way to disable <code>maybe</code> is to use the <code>-disable-feature</code> option when compiling.
For example:</p>

<pre><code>erlc -disable-feature maybe_expr *.erl
</code></pre>

<p>Another way to disable <code>maybe</code> is to add the following directive to
the source code:</p>

<pre><code>-feature(maybe_expr, disable).
</code></pre>

<h1 id="the-new-json-module">The new <code>json</code> module</h1>

<p>There is a new module <a href="https://www.erlang.org/doc/man/json"><code>json</code></a> in
STDLIB for generating and parsing
<a href="https://en.wikipedia.org/wiki/JSON">JSON (JavaScript Object Notation)</a>.</p>

<p>It is implemented by <a href="https://github.com/michalmuskala">Michał
Muskała</a> who has also implemented
the <a href="https://github.com/michalmuskala/jason"><code>Jason</code></a> library for
Elixir. <code>Jason</code> is known for being faster than other pure Erlang or
Elixir JSON libraries. The <code>json</code> module is not a pure translation of
the Elixir code for Jason, but a re-implementation with even better
performance than <code>Jason</code>.</p>

<p>As an example, imagine that we have this file <code>quotes.json</code> with
quotes from the film <a href="https://en.wikipedia.org/wiki/Jason_and_the_Argonauts_(1963_film)">Jason and the
Argonauts</a>:</p>

<pre><code class="language-json">[
    {"quote": "The gods are best served by those who need their help the least.",
     "attribution": "Zeus",
     "verified": true},
    {"quote": "Now the voyage is over, I don't want any trouble to begin.",
     "attribution": "Jason",
     "verified": true}
]
</code></pre>

<p>The JSON contents of the file can be be decoded by calling
<a href="https://www.erlang.org/doc/man/json#decode/1">json:decode/1</a>:</p>

<pre><code class="language-erlang">1&gt; {ok,JSON} = file:read_file("quotes.json").
{ok,&lt;&lt;"[\n   {\"quote\": \"The gods are best served by those who need their help the least.\",\n    \"attribution\": \"Zeus\""...&gt;&gt;}
2&gt; json:decode(JSON).
[#{&lt;&lt;"attribution"&gt;&gt; =&gt; &lt;&lt;"Zeus"&gt;&gt;,
   &lt;&lt;"quote"&gt;&gt; =&gt;
       &lt;&lt;"The gods are best served by those who need their help the least."&gt;&gt;,
   &lt;&lt;"verified"&gt;&gt; =&gt; true},
 #{&lt;&lt;"attribution"&gt;&gt; =&gt; &lt;&lt;"Jason"&gt;&gt;,
   &lt;&lt;"quote"&gt;&gt; =&gt;
       &lt;&lt;"Now the voyage is over, I don't want any trouble to begin."&gt;&gt;,
   &lt;&lt;"verified"&gt;&gt; =&gt; true}]
</code></pre>

<p>By default, for safety, the keys for objects are translated to binaries. Using atoms
could open up for
<a href="https://en.wikipedia.org/wiki/Denial-of-service_attack">denial-of-service attacks</a>
if a malicious JSON object would define millions of unique keys.</p>

<p>For convenience, it is still possible to convert keys to atoms in
a safe way by using a <em>decoder callback</em>. Here is an example:</p>

<pre><code class="language-erlang">1&gt; Push = fun(Key, Value, Acc) -&gt; [{binary_to_existing_atom(Key), Value} | Acc] end.
#Fun&lt;erl_eval.40.39164016&gt;
</code></pre>

<p>This fun converts the key for a JSON object to an <strong>existing</strong> atom,
or raises an exception if no such atom exists.</p>

<p>Since this example is run from the shell, we’ll need to make sure that all possible keys
are known atoms:</p>

<pre><code class="language-erlang">2&gt; {quote,attribution,verified}.
{quote,attribution,verified}
</code></pre>

<p>This would normally not be necessary when JSON decoding is done in an Erlang module,
because the atoms to be used as keys would presumably be defined naturally by being used
when processing the decoded JSON objects.</p>

<p>With this preparation done, the JSON decoder can be called using the <code>Push</code> fun
as an <code>object_push</code> decoder callback:</p>

<pre><code class="language-erlang">3&gt; {Qs,_,&lt;&lt;&gt;&gt;} = json:decode(JSON, [], #{object_push =&gt; Push}), Qs.
[#{quote =&gt;
       &lt;&lt;"The gods are best served by those who need their help the least."&gt;&gt;,
   attribution =&gt; &lt;&lt;"Zeus"&gt;&gt;,verified =&gt; true},
 #{quote =&gt;
       &lt;&lt;"Now the voyage is over, I don't want any trouble to begin."&gt;&gt;,
   attribution =&gt; &lt;&lt;"Jason"&gt;&gt;,verified =&gt; true}]
</code></pre>

<p>The <a href="https://www.erlang.org/doc/man/json#encode/1">json:encode/1</a> function encodes
an Erlang term to JSON:</p>

<pre><code class="language-erlang">4&gt; io:format("~ts\n", [json:encode(Qs)]).
[{"quote":"The gods are best served by those who need their help the least.","attribution":"Zeus","verified":true},{"quote":"Now the voyage is over, I don't want any trouble to begin.","attribution":"Jason","verified":true}]
ok
</code></pre>

<p>The encoder accepts binaries, atoms, and integer as keys for objects,
so there is no need to customize encoding for this particular example.</p>

<p>However, when necessary, it is possible to customize the encoding. For
example, assume that we want to store each quotation in a three-tuple
instead of in a map:</p>

<pre><code class="language-erlang">1&gt; Q = [{~"The gods are best served by those who need their help the least.",
~"Zeus",true},
{~"Now the voyage is over, I don't want any trouble to begin.",
~"Jason",true}].
[{&lt;&lt;"The gods are best served by those who need their help the least."&gt;&gt;,
  &lt;&lt;"Zeus"&gt;&gt;,true},
 {&lt;&lt;"Now the voyage is over, I don't want any trouble to begin."&gt;&gt;,
  &lt;&lt;"Jason"&gt;&gt;,true}]
</code></pre>

<p>The <code>json:encode/1</code> function does not handle that format by default, but it can be
handled by defining an <em>encoder function</em>:</p>

<pre><code class="language-erlang">quote_encoder({Q, A, V}, Encode)
  when is_binary(Q), is_binary(A), is_boolean(V) -&gt;
    json:encode_map(#{quote =&gt; Q,
                      attribution =&gt; A,
                      verified =&gt; V},
                    Encode);
quote_encoder(Other, Encode) -&gt;
    json:encode_value(Other, Encode).
</code></pre>

<p>The first clause matches a tuple of size three that looks like a
quotation. If it matches, it is converted to the map representation
for a JSON object, which is then converted by the utility function
<a href="https://www.erlang.org/doc/man/json#encode_map/2">json:encode_map/1</a>
to JSON.</p>

<p>The second clause handles all other Erlang terms by calling the
default encoding function
<a href="https://www.erlang.org/doc/man/json#encode_value/2">json:encode_value/2</a>
for converting a term to JSON.</p>

<p>Assuming that this function is defined in module <code>t</code>, the conversion to JSON
is invoked as follows:</p>

<pre><code class="language-erlang">2&gt; io:format("~ts\n", [json:encode(Q, fun t:quote_encoder/2)]).
[{"quote":"The gods are best served by those who need their help the least.","attribution":"Zeus","verified":true},{"quote":"Now the voyage is over, I don't want any trouble to begin.","attribution":"Jason","verified":true}]
</code></pre>

<p>The JSON encoder will call the callback recursively for given term. That can
be clearly seen if we modify the second clause of <code>quote_encoder/2</code> to also
print the value of <code>Other</code>:</p>

<pre><code class="language-erlang">3&gt; json:encode(Q, fun t:quote_encoder/2), ok.
-- [{&lt;&lt;"The gods are best served by those who need their help the least."&gt;&gt;,
     &lt;&lt;"Zeus"&gt;&gt;,true},
    {&lt;&lt;"Now the voyage is over, I don't want any trouble to begin."&gt;&gt;,
     &lt;&lt;"Jason"&gt;&gt;,true}]
-- &lt;&lt;"quote"&gt;&gt;
-- &lt;&lt;"The gods are best served by those who need their help the least."&gt;&gt;
-- &lt;&lt;"attribution"&gt;&gt;
-- &lt;&lt;"Zeus"&gt;&gt;
-- &lt;&lt;"verified"&gt;&gt;
-- true
-- &lt;&lt;"quote"&gt;&gt;
-- &lt;&lt;"Now the voyage is over, I don't want any trouble to begin."&gt;&gt;
-- &lt;&lt;"attribution"&gt;&gt;
-- &lt;&lt;"Jason"&gt;&gt;
-- &lt;&lt;"verified"&gt;&gt;
-- true
</code></pre>

<h1 id="process-labels">Process labels</h1>

<p>As an help for debugging or observing in general, labels can be now
set on non-registered processes using
<a href="https://www.erlang.org/doc/man/proc_lib#set_label/1"><code>proc_lib:set_label/1</code></a>.</p>

<p>The label is an arbitrary term. The label is shown by the the shell
command <code>i/0</code> and by <a href="https://www.erlang.org/doc/man/observer"><code>observer</code></a>.
They can also be found in the dictionary section of a
<a href="https://www.erlang.org/doc/man/crashdump_viewer">crash dump</a>.</p>

<p>Here is an example where five labeled quote-handler processes are started and
inspected:</p>

<pre><code class="language-erlang">1&gt; F = fun(I) -&gt;
   spawn_link(fun() -&gt;
     proc_lib:set_label({quote_handler, I}),
     receive _ -&gt; ok end
   end)
   end.
#Fun&lt;erl_eval.42.39164016&gt;
2&gt; Ps = [F(I) || I &lt;- lists:seq(1, 5)].
[&lt;0.91.0&gt;,&lt;0.92.0&gt;,&lt;0.93.0&gt;,&lt;0.94.0&gt;,&lt;0.95.0&gt;]
3&gt; proc_lib:get_label(hd(Ps)).
{quote_handler,1}
4&gt; i().
Pid                   Initial Call                          Heap     Reds Msgs
Registered            Current Function                     Stack
&lt;0.0.0&gt;               erl_init:start/2                       987     5347    0
init                  init:loop/1                              2
   .
   .
   .
{quote_handler,1}     prim_eval:'receive'/2                    9
&lt;0.92.0&gt;              erlang:apply/2                         233     4006    0
{quote_handler,2}     prim_eval:'receive'/2                    9
&lt;0.93.0&gt;              erlang:apply/2                         233     4006    0
{quote_handler,3}     prim_eval:'receive'/2                    9
&lt;0.94.0&gt;              erlang:apply/2                         233     4006    0
{quote_handler,4}     prim_eval:'receive'/2                    9
&lt;0.95.0&gt;              erlang:apply/2                         233     4006    0
{quote_handler,5}     prim_eval:'receive'/2                    9
Total                                                     642876  1156835    0
                                                             438
ok
</code></pre>

<p>The SSH and and SSL applications have been updated to label the processes they
create.</p>

<h1 id="new-functionality-in-stdlib">New functionality in STDLIB</h1>

<h2 id="new-utility-functions-for-set-modules">New utility functions for set modules</h2>

<p>The three sets modules in STDLIB —
<a href="https://www.erlang.org/doc/man/sets"><code>sets</code></a>,
<a href="https://www.erlang.org/doc/man/gb_sets"><code>gb_sets</code></a>, and
<a href="https://www.erlang.org/doc/man/ordsets"><code>ordsets</code></a> —
have new functions <code>is_equal/2</code>, <code>map/2</code>, and <code>filtermap/2</code>.</p>

<p>The <code>is_equal/2</code> function is useful when one needs to find out whether two
sets contain the same elements. Comparing with <code>==</code> or <code>=:=</code> is not always
reliable. For example:</p>

<pre><code class="language-erlang">1&gt; Seq = lists:seq(1, 20, 2).
[1,3,5,7,9,11,13,15,17,19]
2&gt; gb_sets:from_list(Seq) == gb_sets:delete(10, gb_sets:from_list([10|Seq])).
false
3&gt; gb_sets:is_equal(gb_sets:from_list(Seq), gb_sets:delete(10, gb_sets:from_list([10|Seq]))).
true
</code></pre>

<p>The <code>map/2</code> maps the element of a set, producing a new set:</p>

<pre><code class="language-erlang">4&gt; Seq = lists:seq(1, 20, 2).
[1,3,5,7,9,11,13,15,17,19]
#Fun&lt;erl_eval.42.39164016&gt;
5&gt; ordsets:to_list(ordsets:map(fun(N) -&gt; N div 4 end, ordsets:from_list(Seq))).
[0,1,2,3,4]
</code></pre>

<p>The <code>filtermap/2</code> function can map and filter at the same time. Here is an example
showing how to multiply each integer in a set by 100 and remove non-integers:</p>

<pre><code class="language-erlang">1&gt; Mixed = [1,2,3,a,b,c].
[1,2,3,a,b,c]
2&gt; F = fun(N) when is_integer(N) -&gt; {true,N * 100};
   (_) -&gt; false
   end.
#Fun&lt;erl_eval.42.39164016&gt;
3&gt; sets:to_list(sets:filtermap(F, sets:from_list(Mixed))).
[300,200,100]
</code></pre>

<h2 id="new-timer-convenience-functions-that-take-funs">New <code>timer</code> convenience functions that take funs</h2>

<p>In Erlang/OTP 26, the functions in the
<a href="https://www.erlang.org/doc/man/timer"><code>timer</code></a> module don’t accept funs.
It is certainly possibly to pass in a fun in the argument for
<a href="https://www.erlang.org/doc/man/erlang#apply/2"><code>erlang:apply/2</code></a>,
but if one makes a mistake it will be only be noticed when the
timer expires:</p>

<pre><code class="language-erlang">1&gt; timer:apply_after(10, erlang, apply, [fun() -&gt; io:put_chars("now!\n") end]).
{ok,{once,#Ref&lt;0.2380540714.1485570051.86513&gt;}}
=ERROR REPORT==== 10-Apr-2024::05:56:43.894073 ===
Error in process &lt;0.109.0&gt; with exit value:
{undef,[{erlang,apply,[#Fun&lt;erl_eval.43.105768164&gt;],[]}]}
</code></pre>

<p>Here the empty argument list for the fun was forgotten. It should have been:</p>

<pre><code class="language-erlang">2&gt; timer:apply_after(10, erlang, apply, [fun() -&gt; io:put_chars("now!\n") end, []]).
{ok,{once,#Ref&lt;0.2380540714.1485570051.86522&gt;}}
now!
</code></pre>

<p>In Erlang/OTP 27, using a fun is much easier:</p>

<pre><code class="language-erlang">1&gt; timer:apply_after(10, fun() -&gt; io:put_chars("now!\n") end).
{ok,{once,#Ref&lt;0.3845681669.1215561736.51634&gt;}}
now!
</code></pre>

<p>In systems that use hot code updating, using a local fun for a long-running
timer is not ideal. The code that defines the fun could have been replaced,
and when the timer finally expires the call will fail. Therefore, it is also
possible to pass a fun as well as its arguments, making it possible to use
use a remote fun that will survive hot code updating:</p>

<pre><code class="language-erlang">2&gt; timer:apply_after(10, fun io:put_chars/1, ["now\n"]).
{ok,{once,#Ref&lt;0.3845681669.1215561736.51650&gt;}}
now
</code></pre>

<p>The <code>apply_interval/*</code> and <code>apply_repeatedly/*</code> functions now also accept
funs.</p>

<h2 id="new-ets-functions">New <code>ets</code> functions</h2>

<p>The new functions
<a href="https://www.erlang.org/doc/man/ets#first_lookup/1"><code>ets:first_lookup/1</code></a>
and
<a href="https://www.erlang.org/doc/man/ets#next_lookup/2"><code>ets:next_lookup/2</code></a>
simplifies and speeds up traversing an ETS table:</p>

<pre><code class="language-erlang">1&gt; T = ets:new(example, [ordered_set]).
#Ref&lt;0.1968915180.2077884419.247786&gt;
2&gt; ets:insert(T, [{I,I*I} || I &lt;- lists:seq(1, 10)]).
true
3&gt; {K1,_} = ets:first_lookup(T).
{1,[{1,1}]}
4&gt; {K2,_} = ets:next_lookup(T, K1).
{2,[{2,4}]}
5&gt; {K3,_} = ets:next_lookup(T, K2).
{3,[{3,9}]}
6&gt; {K4,_} = ets:next_lookup(T, K3).
{4,[{4,16}]}
</code></pre>

<p>Similarly,
<a href="https://www.erlang.org/doc/man/ets#last_lookup/1"><code>ets:last_lookup/1</code></a>
and
<a href="https://www.erlang.org/doc/man/ets#prev_lookup/2"><code>ets:prev_lookup/2</code></a>
can be used to traverse a table in reverse order.</p>

<p>The new function
<a href="https://www.erlang.org/doc/man/ets#update_element/4"><code>ets:update_element/4</code></a>
is similar to
<a href="https://www.erlang.org/doc/man/ets#update_element/3"><code>ets:update_element/3</code></a>,
but makes it possible to supply a default object when there is no existing
object with the given key:</p>

<pre><code class="language-erlang">1&gt; T = ets:new(example, []).
#Ref&lt;0.878413430.1983512583.205850&gt;
2&gt; ets:update_element(T, a, {2, true}, {a, true}).
true
3&gt; ets:lookup(T, a).
[{a,true}]
</code></pre>

<h1 id="new-ssl-client-side-stapling-support">New SSL client-side stapling support</h1>

<p>A new feature in the SSL client in Erlang/OTP 27 is support for <a href="https://en.wikipedia.org/wiki/OCSP_stapling">OCSP
stapling</a> for easier and
faster verification of the revocation status of server
certificates.</p>

<p>With OCSP stapling, the SSL client can streamline the validation of
revocation status. Normally the client would have to query the
<a href="https://en.wikipedia.org/wiki/Certificate_authority">CA (Certificate
Authority)</a> using
<a href="https://en.wikipedia.org/wiki/Online_Certificate_Status_Protocol">OCSP (Online Certificate Status
Protocol)</a>
to ensure that the server’s certificate has not been
<a href="https://en.wikipedia.org/wiki/Certificate_revocation">revoked</a>.</p>

<p>The basic idea behind OCSP stapling is that the server itself will
proactively query the CA regarding the revocation status for its own
certificate and “staple” the time-stamped OCSP response from the CA to
the certificate. When a client connects, the server passes along
its OCSP-stapled certificate to the client. To verify the revocation
status, the client only needs to check that the OCSP response was
signed by the CA.</p>

<p>Here follows an example showing how OCSP stapling can be enabled in the
SSL client:</p>

<pre><code class="language-erlang">1&gt; ssl:start().
ok
2&gt; {ok, Socket} = ssl:connect("duckduckgo.com", 443,
                              [{cacerts, public_key:cacerts_get()},
                               {stapling, staple}]).
{ok,{sslsocket,{gen_tcp,#Port&lt;0.5&gt;,tls_connection,undefined},
               [&lt;0.122.0&gt;,&lt;0.121.0&gt;]}}
</code></pre>

<h1 id="tprof-yet-another-profiling-tool"><code>tprof</code>: Yet another profiling tool</h1>

<p>In Erlang/OTP 27, the new profiling tool
<a href="https://www.erlang.org/doc/man/tprof"><code>tprof</code></a>
joins the existing profiling tools
<a href="https://www.erlang.org/doc/man/cprof"><code>cprof</code></a>,
<a href="https://www.erlang.org/doc/man/eprof"><code>eprof</code></a>,
and <a href="https://www.erlang.org/doc/man/fprof"><code>fprof</code></a>.</p>

<p>Why introduce a new profiling tool?</p>

<p>One reason is that <code>cprof</code> and <code>eprof</code> perform similar profiling
tasks, but the naming of the API functions are different. It is quite
easy to mix up the names when running one tool after the other, and
running them after each other is not uncommon.  For example, when
trying to find a
<a href="https://en.wikipedia.org/wiki/Bottleneck_(software)">bottleneck</a> in a
complex running Erlang system, one approcach is to first use
<code>cprof</code> to get a rough idea of the general part of the system where a
bottleneck could be located. After that, <code>eprof</code> is run on a limited
part of the system trying to narrow it down. Directly running <code>eprof</code>
on a large Erlang application could overload it and bring it down.</p>

<p>Using <code>tprof</code>, the same function is used for both counting calls and
measuring the time for each call. Here is how to count calls when
<code>lists:seq(1, 1000)</code> is called:</p>

<pre><code>1&gt; tprof:profile(lists, seq, [1, 1000], #{type =&gt; call_count}).
FUNCTION          CALLS  [    %]
lists:seq/2           1  [ 0.40]
lists:seq_loop/3    251  [99.60]
                         [100.0]
ok
</code></pre>

<p>Note that call counting is always done for all processes.</p>

<p>The bulk of the work for <code>lists:seq/2</code> is done in <code>lists:seq_loop/3</code>,
which was called 251 times. Since we asked for 1000 integers, we
reach the conclusion that each tail-recursive call to <code>seq_loop/3</code>
creates four list elements at once. That can be confirmed by
looking at the
<a href="https://github.com/erlang/otp/blob/ca50a5d73703f74e2eae1ca40bbe6c4f027f9f98/lib/stdlib/src/lists.erl#L409-L416">source code</a>.</p>

<p>To measure the time for each call, we only need to replace
<code>call_count</code> with <code>call_time</code>:</p>

<pre><code>2&gt; tprof:profile(lists, seq, [1, 1000], #{type =&gt; call_time}).

****** Process &lt;0.94.0&gt;  --  100.00% of total ***
FUNCTION          CALLS  TIME (μs)  PER CALL  [     %]
lists:seq/2           1          0      0.00  [  0.00]
lists:seq_loop/3    251         50      0.20  [100.00]
                                50            [ 100.0]
ok
</code></pre>

<p>Call time is only measured the process that called
<a href="https://erlang.org/doc/man/tprof#profile/4"><code>tprof:profile/4</code></a>
and any process spawned by that process.</p>

<p>By replacing <code>call_time</code> with <code>call_memory</code> the amount of memory consumed
by each call will be measured:</p>

<pre><code>3&gt; tprof:profile(lists, seq, [1, 1000], #{type =&gt; call_memory}).

****** Process &lt;0.97.0&gt;  --  100.00% of total ***
FUNCTION          CALLS  WORDS  PER CALL  [     %]
lists:seq_loop/3    251   2000      7.97  [100.00]
                          2000            [ 100.0]
ok
</code></pre>

<p>The total number of words created is 2000, which make sense since each
list element needs 2 words. The number of words consumed per call is
<code>2000 / 251</code>, which is approximately 7.97 or almost 8. That also makes
sense since each tail-recursive call creates 4 list elements, or 8
words, and there are 250 such calls. The remaining call creates the final
empty list (<code>[]</code>).</p>

<p><code>call_memory</code> tracing was introduced in the runtime system in
Erlang/OTP 26, but was not exposed in any existing profiling tool
because it didn’t really fit in any of them. It made more sense to enable
support for it in a new tool.</p>

<h1 id="multiple-trace-sessions">Multiple trace sessions</h1>

<p>Tracing makes it possible to observe, debug, analyse, and measure the
performance of a running Erlang system. Over the year, numerous tools
using tracing has been developed. In Erlang/OTP alone, several tools
leverage tracing for different purposes:</p>

<ul>
  <li>
    <p><a href="https://www.erlang.org/doc/man/dbg"><code>dbg</code></a>, <a href="https://www.erlang.org/doc/man/ttb"><code>ttb</code></a> -
general tracing tools</p>
  </li>
  <li>
    <p><a href="https://www.erlang.org/doc/man/etop"><code>etop</code></a> - similar to <code>top</code> in Unix</p>
  </li>
  <li>
    <p><a href="https://www.erlang.org/doc/man/eprof"><code>eprof</code></a>,
<a href="https://www.erlang.org/doc/man/cprof"><code>cprof</code></a>,
<a href="https://www.erlang.org/doc/man/fprof"><code>fprof</code></a>,
<a href="https://www.erlang.org/doc/man/tprof"><code>tprof</code></a> - profiling tools</p>
  </li>
  <li>
    <p><a href="https://www.erlang.org/doc/apps/et"><code>et</code></a> - event tracer</p>
  </li>
  <li>
    <p><a href="https://www.erlang.org/doc/man/debugger"><code>debugger</code></a> - uses tracing
internally when evaluating <code>receive</code> expressions</p>
  </li>
</ul>

<p>In Erlang/OTP 26 and earlier tracing had some limitations:</p>

<ul>
  <li>
    <p>There could only be a single tracer per traced process.</p>
  </li>
  <li>
    <p>The configuration for which processes and functions to trace were
global within the runtime system.</p>
  </li>
</ul>

<p>Those limitations meant that different tracing tools could easily step
on each other’s toes. The treacherous part was that using multiple tracing
tools at the same time would seem to work for a while… until it didn’t.</p>

<p>In Erlang/OTP 27, multiple trace sessions can be created. Each trace
session has its own tracer process and configuration for which
processes and functions to trace.</p>

<p>To create a trace session and set up tracing, there is the new
<a href="https://www.erlang.org/doc/man/trace"><code>trace</code></a> module in the Kernel
application. Tools that set up tracing using that module will no longer
interfere with each other. Tools that use the
<a href="https://www.erlang.org/doc/man/erlang#trace/3">old API</a>
will share a single global trace session.</p>

<p>In the initial Erlang/OTP 27 release, some of the tools using tracing
have been updated to use trace sessions. Other tools will be updated in
upcoming maintenance releases.</p>

<p>We have tried to design the new API in a way to make it relatively
easy for maintainers of external tools to migrate their code.  Apart
from the names of the functions and the first argument (the session
argument), the other arguments and their semantics are almost entirely
identical to the old API.</p>

<h2 id="quick-trace-session-example">Quick trace session example</h2>

<p>Here is an example to show how the new API is used. First we’ll need
a tracer process that prints all trace messages it receives:</p>

<pre><code class="language-erlang">1&gt; Tracer = spawn(fun F() -&gt; receive M -&gt; io:format("== ~p ==\n", [M]), F() end end).
&lt;0.90.0&gt;
</code></pre>

<p>Having a tracer process, we can create a trace session:</p>

<pre><code class="language-erlang">2&gt; Session = trace:session_create(my_session, Tracer, []).
{#Ref&lt;0.179442114.3923902468.103849&gt;,{my_session,0}}
</code></pre>

<p>Next we turn on call tracing on the current process:</p>

<pre><code class="language-erlang">3&gt; trace:process(Session, self(), true, [call]).
1
</code></pre>

<p>Make sure that module <code>array</code> is loaded and trace all calls in it:</p>

<pre><code class="language-erlang">4&gt; l(array).
{module,array}
5&gt; trace:function(Session, {array,'_','_'}, [], [local]).
89
</code></pre>

<p>Next create a new array:</p>

<pre><code class="language-erlang">6&gt; array:new(10).
== {trace,&lt;0.88.0&gt;,call,{array,new,"\n"}} ==
{array,10,0,undefined,10}
== {trace,&lt;0.88.0&gt;,call,{array,new_0,[10,0,false]}} ==
== {trace,&lt;0.88.0&gt;,call,{array,new_1,["\n",0,false,undefined]}} ==
== {trace,&lt;0.88.0&gt;,call,{array,new_1,[[],10,true,undefined]}} ==
== {trace,&lt;0.88.0&gt;,call,{array,new,[10,true,undefined]}} ==
== {trace,&lt;0.88.0&gt;,call,{array,find_max,"\t\n"}} ==
</code></pre>

<p>Note that trace messages are randomly intermingled with the return value
of the call.</p>

<p>When we are done, we can destroy the session:</p>

<pre><code class="language-erlang">7&gt; trace:session_destroy(Session).
</code></pre>

<p>If we don’t destroy the session, it will be automatically destroyed when
the last reference to it goes away.</p>

<h1 id="native-coverage-support">Native coverage support</h1>

<p>The <a href="https://www.erlang.org/doc/man/cover">Cover</a> tool for determining
<a href="https://en.wikipedia.org/wiki/Code_coverage">code coverage</a> has long been
part of Erlang/OTP.</p>

<p>Traditionally, Cover collected its coverage metrics without the
help of any specialized functionality in the runtime system. To count how
many times each line in a module was executed, Cover
<a href="https://en.wikipedia.org/wiki/Instrumentation_(computer_programming)">instrumented</a>
abstract code for the module by inserting calls to
<a href="https://www.erlang.org/doc/man/ets#update_counter/3"><code>ets:update_counter/3</code></a>
on each executable line.</p>

<p>That worked, but the cover-instrumented Erlang code would always run
slower. How much slower depended on the nature of the code being
tested.</p>

<p>In Erlang/OTP 27, runtime systems supporting the
<a href="https://www.erlang.org/blog/a-first-look-at-the-jit/">JIT (just-in-time compiler)</a>
can now collect coverage metrics in the runtime system with minimal
performance overhead.</p>

<p>The Cover tool has been updated to automatically take advantage of
native coverage support if supported by the runtime system. When
running the test suites for most OTP applications, there is no
noticeable difference in execution time running with and without
Cover.</p>

<p>The native coverage support can also be used directly for performing
measurements that Cover cannot accomplish, such as collecting metrics
for code that is executed while the Erlang runtime system is starting.</p>

<p>Here is a quick example showing how we can collect coverage metrics
for <code>init</code>, which is the first module executed when starting up the
runtime system. First we need to instruct the runtime system to
instrument all functions in all modules with extra code to count the
number of times each function is called:</p>

<pre><code>$ bin/erl +JPcover function_counters
</code></pre>

<p>The runtime system starts normally. We can now read out the counters
for the <code>init</code> module:</p>

<pre><code>1&gt; lists:reverse(lists:keysort(2, code:get_coverage(function, init))).
[{{archive_extension,0},392},
 {{get_argument1,2},198},
 {{objfile_extension,0},101},
 {{boot_loop,2},64},
 {{request,1},55},
 {{to_strings,1},44},
 {{do_handle_msg,2},38},
 {{handle_msg,2},38},
 {{b2s,1},38},
 {{get_argument,2},33},
 {{get_argument,1},31},
 {{'-load_modules/2-lc$^0/1-0-',1},30},
 {{'-load_modules/2-lc$^1/1-2-',1},30},
 {{'-load_modules/2-lc$^2/1-3-',1},30},
 {{'-load_modules/2-lc$^3/1-4-',1},30},
 {{extract_var,2},30},
 {{'-prepare_loading_fun/0-fun-0-',3},29},
 {{eval_script,2},23},
 {{append,1},18},
 {{get_arguments,1},18},
 {{reverse,1},17},
 {{check,2},17},
 {{ensure_loaded,2},16},
 {{ensure_loaded,1},16},
 {{do_load_module,2},14},
 {{do_ensure_loaded,2},14},
 {{get_flag_args,...},12},
 {{...},...},
 {...}|...]
</code></pre>

<p>The returned list of counter values for each function is sorted in
descending order on the number of time each function was executed.</p>

<p>For more information, see
<a href="https://www.erlang.org/doc/man/code#module-native-coverage-support">Native Coverage Support</a>
in the documentation for the <code>code</code> module.</p>

<h1 id="deprecating-archives">Deprecating archives</h1>

<p><a href="https://www.erlang.org/doc/man/code#module-loading-of-code-from-archive-files">Archives</a>
is experimental functionality that has existed in Erlang/OTP for a
long time. Part of the support for archives is deprecated in Erlang/OTP 27.</p>

<p>The reason is that the performance of code loading from archives has
never been great. Even worse is that the very existence of the archive
functionality degrades the performance of code loading even when no
archives are used, and complicates or prevents optimizations aimed at
reducing startup time.</p>

<p>In Erlang/OTP 27, the following functionality is deprecated:</p>

<ul>
  <li>
    <p>Using archives for packaging a single application or parts of a single application
into an archive file that is included in the code path. This functionality will
likely be removed in Erlang/OTP 28.</p>
  </li>
  <li>
    <p>The <a href="https://www.erlang.org/doc/man/code#lib_dir/2"><code>code:lib_dir/2</code></a>
function. This function was introduced to allow reading files
inside archives. In Erlang/OTP 28, the function itself will not be
removed, but it will most likely no longer support looking into
archives.</p>
  </li>
  <li>
    <p>All functionality to handle archives in module
<a href="https://www.erlang.org/doc/man/erl_prim_loader"><code>erl_prim_loader</code></a>.
That same functionality is likely to be removed in Erlang/OTP 28.</p>
  </li>
  <li>
    <p>The <code>-code_path_choice</code> flag for <code>erl</code>. In Erlang/OTP 27, the default
has changed from <code>relaxed</code> to <code>strict</code>. This flag is likely to be removed
in Erlang/OTP 28.</p>
  </li>
</ul>

<p>In order to use archives in Erlang/OTP 27, it is necessary to use the flag
<code>-code_path_choice relaxed</code>.</p>

<h2 id="using-a-single-archive-in-an-escript-is-not-deprecated">Using a single archive in an Escript is <strong>not</strong> deprecated</h2>

<p>An archive can still be used to hold all files needed by an
<a href="https://www.erlang.org/doc/apps/erts/escript_cmd.html">Escript</a>.
However, to access files in the archive (for example, to read templates or other
data files), the only supported way guaranteed to work in future
releases is to use the
<a href="https://www.erlang.org/doc/man/escript#extract/2"><code>escript:extract/2</code></a>
function.</p>]]></content><author><name>Björn Gustavsson</name></author><category term="erlang" /><category term="otp" /><category term="27" /><category term="release" /><summary type="html"><![CDATA[Erlang/OTP 27 is finally here. This blog post will introduce the new features that we are most excited about.]]></summary></entry><entry><title type="html">The Optimizations in Erlang/OTP 27</title><link href="https://www.erlang.org/blog/optimizations/" rel="alternate" type="text/html" title="The Optimizations in Erlang/OTP 27" /><published>2024-04-23T00:00:00+00:00</published><updated>2024-04-23T00:00:00+00:00</updated><id>https://www.erlang.org/blog/optimizations</id><content type="html" xml:base="https://www.erlang.org/blog/optimizations/"><![CDATA[<p>This post explores the new optimizations for record updates as well as
some of the other improvements. It also gives a brief historic
overview of recent optimizations leading up to Erlang/OTP 27.</p>

<h3 id="a-brief-history-of-recent-optimizations">A brief history of recent optimizations</h3>

<p>The modern history of optimizations for Erlang begins in
January 2018. We had realized that we had reached the limit of the
optimizations that were possible working on <a href="https://www.erlang.org/blog/a-brief-beam-primer/">BEAM
code</a> in the Erlang
compiler.</p>

<ul>
  <li>
    <p>Erlang/OTP 22 introduced a new <a href="https://en.wikipedia.org/wiki/Static_single-assignment_form">SSA-based intermediate
representation</a>
in the compiler. Read the full story in <a href="https://www.erlang.org/blog/ssa-history/">SSA
History</a>.</p>
  </li>
  <li>
    <p>Erlang/OTP 24 introduced the <a href="https://www.erlang.org/blog/a-first-look-at-the-jit/">JIT (Just In Time
compiler)</a>,
which improved performance by emitting native code for BEAM instructions
at load-time.</p>
  </li>
  <li>
    <p>Erlang/OTP 25 introduced <a href="https://www.erlang.org/blog/type-based-optimizations-in-the-jit/">type-based optimization in the
JIT</a>,
which allowed the Erlang compiler to pass type information to the
JIT to help it emit better native code. While that improved the
native code emitted by the JIT, limitations in both the compiler and
the JIT prevented the JIT to take full advantage of the type information.</p>
  </li>
  <li>
    <p>Erlang/OTP 26 <a href="https://www.erlang.org/blog/more-optimizations/">improved the type-based
optimizations</a>.
The most noticeable performance improvements were matching and
construction of binaries using the bit syntax. Those improvements,
combined with changes to the <code>base64</code> module itself, made encoding
to Base64 about 4 times as fast and decoding from Base64 more than 3
times as fast.</p>
  </li>
</ul>

<h3 id="what-to-expect-of-the-jit-in-erlangotp-27">What to expect of the JIT in Erlang/OTP 27</h3>

<p>The major compiler and JIT improvement in Erlang/OTP 27 is
optimization of record operations, but there are also many smaller
optimizations that make the code smaller and/or faster.</p>

<h3 id="please-try-this-at-home">Please try this at home!</h3>

<p>While this blog post will show many examples of generated code, I have
attempted to explain the optimizations in English as well. Feel free
to skip the code examples.</p>

<p>On the other hand, if you want more code examples…</p>

<p>To examine the native code for loaded modules, start the runtime system like this:</p>

<pre><code class="language-bash">erl +JDdump true
</code></pre>

<p>The native code for all modules that are loaded will be dumped to files with the
extension <code>.asm</code>.</p>

<p>To examine the BEAM code for a module, use the <code>-S</code> option when
compiling. For example:</p>

<pre><code class="language-bash">erlc -S base64.erl
</code></pre>

<h3 id="a-simple-record-optimization">A simple record optimization</h3>

<p>To get started, let’s look at a simple record optimization that was not done
in Erlang/OTP 26 and earlier. Suppose we have this module:</p>

<pre><code class="language-erlang">-record(foo, {a,b,c,d,e}).

update(N) -&gt;
    R0 = #foo{},
    R1 = R0#foo{a=N},
    R2 = R1#foo{b=2},
    R2#foo{c=3}.
</code></pre>

<p>Here is <a href="https://www.erlang.org/blog/a-brief-beam-primer/">BEAM code</a> for the
record operations:</p>

<pre><code>    {update_record,{atom,reuse},
                   6,
                   {literal,{foo,undefined,undefined,undefined,undefined,
                                 undefined}},
                   {x,0},
                   {list,[2,{x,0}]}}.
    {update_record,{atom,copy},6,{x,0},{x,0},{list,[3,{integer,2}]}}.
    {update_record,{atom,copy},6,{x,0},{x,0},{list,[4,{integer,3}]}}.
</code></pre>

<p>That is, all three record update operations have been retained as separate
<a href="https://www.erlang.org/blog/more-optimizations/#updating-records-in-otp-26"><code>update_record</code></a>
instructions. Each operation creates a new record by copying the unchanged parts of the
record and filling in the new values in the correct position.</p>

<p>The compiler in Erlang/OTP 27 will essentially rewrite <code>update/1</code> to:</p>

<pre><code class="language-erlang">update(N) -&gt;
    #foo{a=N,b=2,c=3}.
</code></pre>

<p>which will produce the following BEAM code for the record creation:</p>

<pre><code>    {put_tuple2,{x,0},
                {list,[{atom,foo},
                       {x,0},
                       {integer,2},
                       {integer,3},
                       {atom,undefined},
                       {atom,undefined}]}}.
</code></pre>

<p>Those optimizations were implemented in the following pull requests:</p>

<ul>
  <li>
    <p><a href="https://github.com/erlang/otp/pull/7491">#7491: Merge consecutive record updates</a></p>
  </li>
  <li>
    <p><a href="https://github.com/erlang/otp/pull/8086">#8086: Combine creation of a record with subsequent record updates</a></p>
  </li>
</ul>

<h3 id="updating-records-in-place">Updating records in place</h3>

<p>To explore the more sophisticated record optimization introduced in Erlang/OTP 27,
consider this example:</p>

<pre><code class="language-erlang">-module(count1).
-export([count/1]).

-record(s, {atoms=0,other=0}).

count(L) -&gt;
    count(L, #s{}).

count([X|Xs], #s{atoms=C}=S) when is_atom(X) -&gt;
    count(Xs, S#s{atoms=C+1});
count([_|Xs], #s{other=C}=S) -&gt;
    count(Xs, S#s{other=C+1});
count([], S) -&gt;
    S.
</code></pre>

<p><code>count(List)</code> counts the number of atoms and the number of other terms in the
given list. For example:</p>

<pre><code class="language-erlang">1&gt; -record(s, {atoms=0,other=0}).
ok
2&gt; count1:count([a,b,c,1,2,3,4,5]).
#s{atoms = 3,other = 5}
</code></pre>

<p>Here follows the BEAM code emitted for <code>count/2</code>:</p>

<pre><code>    {test,is_nonempty_list,{f,6},[{x,0}]}.
    {get_list,{x,0},{x,2},{x,0}}.
    {test,is_atom,{f,5},[{x,2}]}.
    {get_tuple_element,{x,1},1,{x,2}}.
    {gc_bif,'+',{f,0},3,[{tr,{x,2},{t_integer,{0,'+inf'}}},{integer,1}],{x,2}}.
    {test_heap,4,3}.
    {update_record,{atom,inplace},
                   3,
                   {tr,{x,1},
                       {t_tuple,3,true,
                                #{1 =&gt; {t_atom,[s]},
                                  2 =&gt; {t_integer,{0,'+inf'}},
                                  3 =&gt; {t_integer,{0,'+inf'}}}}},
                   {x,1},
                   {list,[2,{tr,{x,2},{t_integer,{1,'+inf'}}}]}}.
    {call_only,2,{f,4}}. % count/2
  {label,5}.
    {get_tuple_element,{x,1},2,{x,2}}.
    {gc_bif,'+',{f,0},3,[{tr,{x,2},{t_integer,{0,'+inf'}}},{integer,1}],{x,2}}.
    {test_heap,4,3}.
    {update_record,{atom,inplace},
                   3,
                   {tr,{x,1},
                       {t_tuple,3,true,
                                #{1 =&gt; {t_atom,[s]},
                                  2 =&gt; {t_integer,{0,'+inf'}},
                                  3 =&gt; {t_integer,{0,'+inf'}}}}},
                   {x,1},
                   {list,[3,{tr,{x,2},{t_integer,{1,'+inf'}}}]}}.
    {call_only,2,{f,4}}. % count/2
  {label,6}.
    {test,is_nil,{f,3},[{x,0}]}.
    {move,{x,1},{x,0}}.
    return.
</code></pre>

<p>The first two instructions test whether the first argument in <code>{x,0}</code> is a non-empty list
and if so extracts the first element of the list:</p>

<pre><code>    {test,is_nonempty_list,{f,6},[{x,0}]}.
    {get_list,{x,0},{x,2},{x,0}}.
</code></pre>

<p>The next instruction tests whether the first element is an atom. If not, a jump
is made to the code for the second clause.</p>

<pre><code>    {test,is_atom,{f,5},[{x,2}]}.
</code></pre>

<p>Next the counter for the number of atoms seen is fetched from the record and
incremented by one:</p>

<pre><code>    {get_tuple_element,{x,1},2,{x,2}}.
    {gc_bif,'+',{f,0},3,[{tr,{x,2},{t_integer,{0,'+inf'}}},{integer,1}],{x,2}}.
</code></pre>

<p>Next follows allocation of heap space and the updating of the record:</p>

<pre><code>    {test_heap,4,3}.
    {update_record,{atom,inplace},
                   3,
                   {tr,{x,1},
                       {t_tuple,3,true,
                                #{1 =&gt; {t_atom,[s]},
                                  2 =&gt; {t_integer,{0,'+inf'}},
                                  3 =&gt; {t_integer,{0,'+inf'}}}}},
                   {x,1},
                   {list,[3,{tr,{x,2},{t_integer,{1,'+inf'}}}]}}.
</code></pre>

<p>The <code>test_heap</code> instruction ensures that there is sufficient room on the heap
for copying the record (4 words).</p>

<p>The <code>update_record</code> instruction was introduced in Erlang/OTP 26. Its
first operand is an atom that is a hint from the compiler to help the
JIT emit better code. In Erlang/OTP 26 the hints <code>reuse</code> and <code>copy</code>
are used. For more about those hints, see
<a href="https://www.erlang.org/blog/more-optimizations/#updating-records-in-otp-26">Updating records in OTP 26</a>.</p>

<p>In Erlang/OTP 27, there is a new hint called <code>inplace</code>. The compiler
emits that hint when it has determined that nowhere in the runtime
system is there another reference to the tuple except for the
reference used for the <code>update_record</code> instruction. In other
words, from the <strong>compiler’s</strong> point of view, if the runtime system
were to directly update the existing record without first copying it,
the observable behavior of the program would not change. As soon will
be seen, from the <strong>runtime system’s</strong> point of view, directly updating
the record is not always safe.</p>

<p>This new optimization was implemented by Frej Drejhammar. It builds
on and extends the compiler passes added in Erlang/OTP 26 for
<a href="https://www.erlang.org/blog/more-optimizations/#appending-to-binaries-in-otp-26">appending to a binary</a>.</p>

<p>Now let’s see what the JIT will do when a <code>record_update</code> instruction
has an <code>inplace</code> hint. Here is the complete native code for the
instruction:</p>

<pre><code># update_record_in_place_IsdI
    mov rax, qword ptr [rbx+8]
    mov rcx, qword ptr [rbx+16]
    test cl, 1
    short je L38           ; Update directly if small integer.

    ; The new value is a bignum.
    ; Test whether the tuple is in the safe part of the heap.

    mov rdi, [r13+480]     ; Get the high water mark
    cmp rax, r15           ; Compare tuple pointer to heap top
    short jae L39          ; Jump and copy if above
    cmp rax, rdi           ; Compare tuple pointer to high water
    short jae L38          ; Jump and overwrite if above high water

    ; The tuple is not in the safe part of the heap.
    ; Fall through to the copy code.

L39:                       ; Copy the current record
    vmovups ymm0, [rax-2]
    vmovups [r15], ymm0
    lea rax, qword ptr [r15+2] ; Set up tagged pointer to copy
    add r15, 32            ; Advance heap top past the copy

L38:
    mov rdi, rcx           ; Get new value for atoms field
    mov qword ptr [rax+22], rdi
    mov qword ptr [rbx+8], rax
</code></pre>

<p>(Lines starting with <code>#</code> are comments emitted by the JIT, while the text
that follows <code>;</code> is a comment added by me for clarification.)</p>

<p>The BEAM loader renames an <code>update_record</code> instruction with an <code>inplace</code> hint
to <code>update_record_in_place</code>.</p>

<p>The first two instructions load the tuple to be update into CPU register <code>rax</code> and
the new counter value (<code>C + 1</code>) into <code>rcx</code>.</p>

<pre><code>    mov rax, qword ptr [rbx+8]
    mov rcx, qword ptr [rbx+16]
</code></pre>

<p>The next two instructions test whether the new counter value is a
small integer that fits into a word. The test has been simplified to a
more efficient test that is only safe when the value is known to be an
integer. If it is a small integer, it is always safe to jump to the code
that updates the existing tuple:</p>

<pre><code>    test cl, 1
    short je L38           ; Update directly if small integer.
</code></pre>

<p>If it is not a small integer, it must be a <strong>bignum</strong>, that is a
signed integer that does not fit in 60 bits and therefore have to be
stored on the heap with <code>rcx</code> containing a tagged pointer to the bignum on
the heap.</p>

<p>If <code>rcx</code> is a pointer to a term on the heap, it is not always safe to
directly updating the existing tuple. That is because of the way the
Erlang <a href="https://www.erlang.org/doc/apps/erts/garbagecollection#generational-garbage-collection">generational garbage
collector</a>
works. Each Erlang process has two heaps for keeping Erlang terms:
the young heap and the old heap. Terms on the young heap are allowed to
reference terms on the old heap, but not vice versa. That means that
if the tuple to be updated resides on the old heap, it is not safe to
update one of its elements so that it will reference a term on the young
heap.</p>

<p>Therefore, the JIT needs to emit code to ensure that the pointer to
the tuple resides in the “safe part” of the young heap:</p>

<pre><code>    mov rdi, [r13+480]     ; Get the high water mark
    cmp rax, r15           ; Compare tuple pointer to heap top
    short jae L39          ; Jump and copy if above
    cmp rax, rdi           ; Compare tuple pointer to high water
    short jae L38          ; Jump and overwrite if above high water
</code></pre>

<p>The safe part of the heap is between the high water mark and the heap
top.  If the tuple is below the high water mark, if it is still alive,
it will be copied to the old heap in the next garbage collection.</p>

<p>If the tuple is in the safe part, the copy code is skipped by jumping
to the code that stores the new value into the existing tuple.</p>

<p>If not, the next part will copy the existing record to the heap.</p>

<pre><code>L39:                       ; Copy the current record
    vmovups ymm0, [rax-2]
    vmovups [r15], ymm0
    lea rax, qword ptr [r15+2] ; Set up tagged pointer to copy
    add r15, 32            ; Advance heap top past the copy
</code></pre>

<p>The copying is done using <a href="https://en.wikipedia.org/wiki/Advanced_Vector_Extensions">AVX instructions</a>.</p>

<p>Next follows the code that writes the new value into the tuple:</p>

<pre><code>L38:
    mov rdi, rcx           ; Get new value for atoms field
    mov qword ptr [rax+22], rdi
    mov qword ptr [rbx+8], rax
</code></pre>

<p>If all the new values being written into the existing record are known
never to be tagged pointers, the native instructions can be
simplified. Consider this module:</p>

<pre><code class="language-erlang">-module(whatever).
-export([main/1]).

-record(bar, {bool,pid}).

main(Bool) when is_boolean(Bool) -&gt;
    flip_state(#bar{bool=Bool,pid=self()}).

flip_state(R) -&gt;
    R#bar{bool=not R#bar.bool}.
</code></pre>

<p>The <code>update_record</code> instruction looks like this:</p>

<pre><code>    {update_record,{atom,inplace},
                   3,
                   {tr,{x,0},
                       {t_tuple,3,true,
                                #{1 =&gt; {t_atom,[bar]},
                                  2 =&gt; {t_atom,[false,true]},
                                  3 =&gt; pid}}},
                   {x,0},
                   {list,[2,{tr,{x,1},{t_atom,[false,true]}}]}}.
</code></pre>

<p>Based on the type for the new value, <code>{t_atom,[false,true]}</code>, the
JIT is able to generate much shorter code than for the previous example:</p>

<pre><code># update_record_in_place_IsdI
    mov rax, qword ptr [rbx]
# skipped copy fallback because all new values are safe
    mov rdi, qword ptr [rbx+8]
    mov qword ptr [rax+14], rdi
    mov qword ptr [rbx], rax
</code></pre>

<p>References to literals (such as <code>[1,2,3]</code>) are also safe, because literals
are stored in a special literal area, and the garbage collector
handles them specially. Consider this code:</p>

<pre><code class="language-erlang">-record(state, {op, data}).

update_state(R0, Op0, Data) -&gt;
    R = R0#state{data=Data},
    case Op0 of
        add -&gt; R#state{op=fun erlang:'+'/2};
        sub -&gt; R#state{op=fun erlang:'-'/2}
    end.
</code></pre>

<p>Both of the record updates in the <code>case</code> can be done in place. Here
is the BEAM code for the record update in the first clause:</p>

<pre><code>    {update_record,{atom,inplace},
                   3,
                   {tr,{x,0},{t_tuple,3,true,#{1 =&gt; {t_atom,[state]}}}},
                   {x,0},
                   {list,[2,{literal,fun erlang:'+'/2}]}}.
</code></pre>

<p>Since the value to be written is a literal, the JIT emits simpler
code without the copy fallback:</p>

<pre><code># update_record_in_place_IsdI
    mov rax, qword ptr [rbx]
# skipped copy fallback because all new values are safe
    long mov rdi, 9223372036854775807  ; Placeholder for address to fun
    mov qword ptr [rax+14], rdi
    mov qword ptr [rbx], rax
</code></pre>
<p>The large integer <code>9223372036854775807</code> is a placeholder that will be
patched later when the address of the literal fun will be known.</p>

<p>Here is the pull request for updating tuples in place:</p>

<ul>
  <li><a href="https://github.com/erlang/otp/pull/8090">#8090: Destructive tuple update</a></li>
</ul>

<h3 id="optimizing-by-generating-less-garbage">Optimizing by generating less garbage</h3>

<p>When updating a record in place, omitting the copying of the existing
record should be a clear win, except perhaps for very small records.</p>

<p>What is less clear is the effect on garbage collection. Updating a
tuple in place is an example of optimizing by generating less
garbage. By creating less garbage, the expectation is that garbage
collections should occur less often, which should improve the
performance of the program.</p>

<p>Because of the highly variable execution time for doing a garbage
collection, it is notoriously difficult to benchmark optimizations
that reduce the amount of garbage created. Often the outcomes of
benchmarks do not apply to performing the same tasks in a real
application.</p>

<p>My own <a href="https://en.wikipedia.org/wiki/Anecdotal_evidence">anecdotal evidence</a>
suggests that in most cases there are no measurable performance wins by
producing less garbage.</p>

<p>I also remember when an optimization that reduced the size of an
Erlang term resulted in a benchmark being consistently slower. It took
the author of that optimization several days of investigation to confirm
that the slowdown in the benchmark was not the fault of his optimization,
but by creating less garbage, garbage collection happened at a later time
when it happened to be much more expensive.</p>

<p>On average we expect that this optimization should improve performance,
especially for large records.</p>

<h3 id="optimization-of-funs">Optimization of funs</h3>

<p>The internal representation of funs in the runtime system has
changed in Erlang/OTP 27, making possible several new optimizations.</p>

<p>As an example, consider this function:</p>

<pre><code class="language-erlang">madd(A, C) -&gt;
    fun(B) -&gt; A * B + C end.
</code></pre>

<p>In Erlang/OTP 26, the native code for creating the fun looks like so:</p>

<pre><code># i_make_fun3_FStt
L38:
    long mov rsi, 9223372036854775807 ; Placeholder for dispatch table
    mov edx, 1
    mov ecx, 2
    mov qword ptr [r13+80], r15
    mov rbp, rsp
    lea rsp, qword ptr [rbx-128]
    vzeroupper
    mov rdi, r13
    call 4337160320       ; Call helper function in runtime system
    mov rsp, rbp
    mov r15, qword ptr [r13+80]
# Move fun environment
    mov rdi, qword ptr [rbx]
    mov qword ptr [rax+40], rdi
    mov rdi, qword ptr [rbx+8]
    mov qword ptr [rax+48], rdi
# Create boxed ptr
    or al, 2
    mov qword ptr [rbx], rax
</code></pre>
<p>The large integer <code>9223372036854775807</code> is a placeholder
for a value that will be filled in later.</p>

<p>Most of the work of actually creating the fun object is done by
calling a helper function (the <code>call 4337160320</code> instruction) in the
runtime system.</p>

<p>In Erlang/OTP 27, the part of fun that resides on the heap of the
calling process has been simplified so that it is now smaller than in
Erlang/OTP 26, and most importantly does not contain anything that is
too tricky to initialize in inline code.</p>

<p>The code for creating the fun is not only shorter, but it also doesn’t
need to call any function in the runtime system:</p>

<pre><code># i_make_fun3_FStt
L38:
    long mov rax, 9223372036854775807 ; Placeholder for dispatch table
# Create fun thing
    mov qword ptr [r15], 196884
    mov qword ptr [r15+8], rax
# Move fun environment
# (moving two items)
    vmovups xmm0, xmmword ptr [rbx]
    vmovups xmmword ptr [r15+16], xmm0
L39:
    long mov rdi, 9223372036854775807 ; Placeholder for fun reference
    mov qword ptr [r15+32], rdi
# Create boxed ptr
    lea rax, qword ptr [r15+2]
    add r15, 40
    mov qword ptr [rbx], rax
</code></pre>

<p>The difference from Erlang/OTP 26 is that the parts of the fun that is only
needed when loading and unloading code are no longer stored on the heap.
Instead those parts are stored in the literal pool area belonging to the loaded
code for the module, and are shared by all instances of the same fun.</p>

<p>The part of the fun that resides on the process heap is two words smaller
compared to Erlang/OTP 26.</p>

<p>The creation of the fun environment has also been optimized. In Erlang/OTP 26,
four instructions were needed:</p>

<pre><code># Move fun environment
    mov rdi, qword ptr [rbx]
    mov qword ptr [rax+40], rdi
    mov rdi, qword ptr [rbx+8]
    mov qword ptr [rax+48], rdi
</code></pre>

<p>In Erlang/OTP 27, using <a href="https://en.wikipedia.org/wiki/Advanced_Vector_Extensions">AVX instructions</a> both variables (<code>A</code> and <code>C</code>)
can be moved using only two instructions:</p>

<pre><code># Move fun environment
# (moving two items)
    vmovups xmm0, xmmword ptr [rbx]
    vmovups xmmword ptr [r15+16], xmm0
</code></pre>

<p>Another optimization made possible by the changed fun representation
is testing for a fun having a specific arity (the number of expected
arguments when calling it). For example:</p>

<pre><code class="language-erlang">ensure_fun_0(F) when is_function(F, 0) -&gt; ok.
</code></pre>

<p>Here is the native code emitted by the JIT in Erlang/OTP 26:</p>

<pre><code># is_function2_fss
    mov rdi, qword ptr [rbx]   ; Fetch `F` from {x,0}.

    rex test dil, 1            ; Test whether the term is a tagged pointer...
    short jne label_3          ; ... otherwise fail.

    mov eax, dword ptr [rdi-2] ; Pick up the header word.
    cmp eax, 212               ; Test whether it is a fun...
    short jne label_3          ; ... otherwise fail.

    cmp byte ptr [rdi+22], 0   ; Test whether the arity is 0...
    short jne label_3          ; ... otherwise fail.
</code></pre>

<p>In Erlang/OTP 27, the arity for the fun (the number of expected arguments) is
stored in the header word of the fun term, which means that the test
for a fun can be combined with the test for its arity:</p>

<pre><code># is_function2_fss
    mov rdi, qword ptr [rbx]   ; Fetch `F` from {x,0}.

    rex test dil, 1            ; Test whether the term is a tagged pointer...
    short jne label_3          ; ... otherwise fail.

    cmp word ptr [rdi-2], 20   ; Test whether this is a fun with arity 0...
    short jne label_3          ; ... otherwise fail.
</code></pre>

<p>All external funs are now literals stored outside all process heaps. As an
example, consider the following functions:</p>

<pre><code class="language-erlang">my_fun() -&gt;
    fun ?MODULE:some_function/0.

mfa(M, F, A) -&gt;
    fun M:F/A.
</code></pre>

<p>In Erlang/OTP 26, the external fun returned by <code>my_fun/0</code> would not occupy
any room on the heap of the calling process, while the dynamic external fun
returned by <code>mfa/3</code> would need 5 words on the heap of the calling process.</p>

<p>In Erlang/OTP 27, neither of the funs will require any room on the heap of
the calling process.</p>

<p>Those optimizations were implemented in the following pull requests:</p>

<ul>
  <li>
    <p><a href="https://github.com/erlang/otp/pull/7948">#7948: Optimize reference counting of local funs</a></p>
  </li>
  <li>
    <p><a href="https://github.com/erlang/otp/pull/7314">#7314: Shrink and optimize funs (again)</a></p>
  </li>
  <li>
    <p><a href="https://github.com/erlang/otp/pull/7894">#7894: Share external funs globally</a></p>
  </li>
  <li>
    <p><a href="https://github.com/erlang/otp/pull/7713/commits/ae127203ac2423d057e1ef151d4ca8b114740b84">x86_64: Optimize creation of fun
environment</a>
(part of <a href="https://github.com/erlang/otp/pull/7713">#7713</a>)</p>
  </li>
</ul>

<h3 id="integer-arithmetic-improvements">Integer arithmetic improvements</h3>

<p>In the end of June last year, we released the <a href="https://www.erlang.org/patches/otp-26.0.2">OTP 26.0.2 patch</a>
for Erlang/OTP 26 that made
<a href="https://www.erlang.org/doc/man/erlang#binary_to_integer-1"><code>binary_to_integer/1</code></a> faster.</p>

<p>To find out how much faster, run this benchmark:</p>

<pre><code class="language-erlang">bench() -&gt;
    Size = 1_262_000,
    String = binary:copy(&lt;&lt;"9"&gt;&gt;, Size),
    {Time, _Val} = timer:tc(erlang, binary_to_integer, [String]),
    io:format("Size: ~p, seconds: ~p\n", [Size, Time / 1_000_000]).
</code></pre>

<p>It measures the time to convert a binary holding 1,262,000 digits to an integer.</p>

<p>Running an unpatched Erlang/OTP 26 on my Intel-based iMac from 2017,
the benchmark finishes in about 10 seconds.</p>

<p>The same benchmark run using Erlang/OTP 26.0.2 finishes in about 0.4 seconds.</p>

<p>The speed-up was achieved by three separate optimizations:</p>

<ul>
  <li>
    <p><code>binary_to_integer/1</code> was implemented as a BIF in C using a naive
algorithm that didn’t scale well.  It was replaced with a
<a href="https://en.wikipedia.org/wiki/Divide-and-conquer_algorithm">divide-and-conquer
algorithm</a>
implemented in Erlang. (Implementing the new algorithm as a BIF wasn’t
faster than the Erlang version.)</p>
  </li>
  <li>
    <p>The runtime system’s function for doing multiplication of large
integers was modified to use the <a href="https://en.wikipedia.org/wiki/Karatsuba_algorithm">Karatsuba
algorithm</a>, which
is a divide-and-conquer multiplication algorithm invented in the 1960s.</p>
  </li>
  <li>
    <p>Some of the low-level helper functions for arithmetic with large
integers (bignums) were modified to take advantage of a 128-bit integer data
type on 64-bit CPUs when supported by the C compiler.</p>
  </li>
</ul>

<p>Those improvements were implemented in the following pull request:</p>

<ul>
  <li><a href="https://github.com/erlang/otp/pull/7426">#7426: Optimize binary_to_integer/1 and friends</a></li>
</ul>

<p>In Erlang/OTP 27, some additional improvement of integer arithemetic
were implemented.  That reduced the execution time for the
<code>binary_to_integer/1</code> benchmark to about 0.3 seconds.</p>

<p>Those improvements are found in the following pull request:</p>

<ul>
  <li><a href="https://github.com/erlang/otp/pull/7553">#7553: Optimize integer arithmetic</a></li>
</ul>

<p>Those arithmetic enchancements improve the running times for the
<a href="https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/pidigits-erlang-2.html">pidigits
benchmark</a>:</p>

<table>
  <thead>
    <tr>
      <th>Version</th>
      <th> </th>
      <th>Seconds</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>26.0</td>
      <td> </td>
      <td><code>7.635</code></td>
    </tr>
    <tr>
      <td>26.2.1</td>
      <td> </td>
      <td><code>2.959</code></td>
    </tr>
    <tr>
      <td>27.0</td>
      <td> </td>
      <td><code>2.782</code></td>
    </tr>
  </tbody>
</table>

<p>(Run on my M1 MacBook Pro.)</p>

<h3 id="numerous-miscellaneous-enhancements">Numerous miscellaneous enhancements</h3>

<p>Many enhancements have been made to the code generation for many
instructions, as well as a few to the Erlang compiler.  Here follows a
single example to show one of the improvements to the <code>=:=</code> operator:</p>

<pre><code class="language-erlang">ensure_empty_map(Map) when Map =:= #{} -&gt;
    ok.
</code></pre>

<p>Here is the BEAM code for the <code>=:=</code> operator as used in this example:</p>

<pre><code>    {test,is_eq_exact,{f,1},[{x,0},{literal,#{}}]}.
</code></pre>

<p>Here is the native code for Erlang/OTP 26:</p>

<pre><code># is_eq_exact_fss
L45:
    long mov rsi, 9223372036854775807
    mov rdi, qword ptr [rbx]
    cmp rdi, rsi
    short je L44                  ; Succeeded if the same term.

    rex test dil, 1
    short jne label_1             ; Fail quickly if not a tagged pointer.

    ; Call the general runtime function for comparing two terms.
    mov rbp, rsp
    lea rsp, qword ptr [rbx-128]
    vzeroupper
    call 4549723200
    mov rsp, rbp

    test eax, eax
    short je label_1               ; Fail if unequal.
L44:
</code></pre>

<p>The code begins with a few tests to quickly succeed or fail, but in
practice those are unlikely to trigger for this example, which means
that the call to the general routine in the runtime system for
comparing two terms will almost always be called.</p>

<p>In Erlang/OTP 27, the JIT emits special code for testing whether a
term is an empty map:</p>

<pre><code># is_eq_exact_fss
# optimized equality test with empty map
    mov rdi, qword ptr [rbx]
    rex test dil, 1
    short jne label_1              ; Fail if not a tagged pointer.

    cmp dword ptr [rdi-2], 300
    short jne label_1              ; Fail if not a map.

    cmp dword ptr [rdi+6], 0
    short jne label_1              ; Fail if size is not zero.
</code></pre>

<p>Here follows the main pull requests for miscellaneous enhancements in Erlang/OTP 27:</p>

<ul>
  <li>
    <p><a href="https://github.com/erlang/otp/pull/7563">#7563: Enhance type analysis</a></p>
  </li>
  <li>
    <p><a href="https://github.com/erlang/otp/pull/7956">#7956: Do some minor enhancements of the code generation in the JIT</a></p>
  </li>
  <li>
    <p><a href="https://github.com/erlang/otp/pull/7713">#7713: Improve code generation for the JIT</a></p>
  </li>
  <li>
    <p><a href="https://github.com/erlang/otp/pull/8040">#8040: Improve caching of BEAM registers</a></p>
  </li>
</ul>]]></content><author><name>Björn Gustavsson</name></author><category term="BEAM" /><category term="JIT" /><summary type="html"><![CDATA[This post explores the new optimizations for record updates as well as some of the other improvements. It also gives a brief historic overview of recent optimizations leading up to Erlang/OTP 27.]]></summary></entry><entry><title type="html">Erlang/OTP 26 Highlights</title><link href="https://www.erlang.org/blog/otp-26-highlights/" rel="alternate" type="text/html" title="Erlang/OTP 26 Highlights" /><published>2023-05-16T00:00:00+00:00</published><updated>2023-05-16T00:00:00+00:00</updated><id>https://www.erlang.org/blog/otp-26-highlights</id><content type="html" xml:base="https://www.erlang.org/blog/otp-26-highlights/"><![CDATA[<p>Erlang/OTP 26 is finally here. This blog post will introduce the new
features that we are most excited about.</p>

<p>A list of all changes is found in <a href="/patches/OTP-26.0">Erlang/OTP 26 Readme</a>.
Or, as always, look at the release notes of the application you are interested in.
For instance: <a href="https://www.erlang.org/doc/apps/erts/notes.html#erts-14.0">Erlang/OTP 26 - Erts Release Notes - Version 14.0</a>.</p>

<p>This year’s highlights mentioned in this blog post are:</p>

<ul>
  <li><a href="#the-shell">The shell</a></li>
  <li><a href="#improvements-of-maps">Improvements of maps</a></li>
  <li><a href="#improvements-of-the-lists-module">Improvements of the <code>lists</code> module</a></li>
  <li><a href="#no-need-to-enable-feature-maybe-in-the-runtime-system">No need to enable feature <code>maybe</code> in the runtime system</a></li>
  <li><a href="#improvements-in-the-Erlang-compiler-and-jit">Improvements in the Erlang compiler and JIT</a></li>
  <li><a href="#incremental-mode-for-dialyzer">Incremental mode for Dialyzer</a></li>
  <li><a href="#argparse-a-command-line-parser-for-erlang">argparse: A command line parser for Erlang</a></li>
  <li><a href="#ssl-safer-defaults">SSL: Safer defaults</a></li>
  <li><a href="#ssl-improved-checking-of-options">SSL: Improved checking of options</a></li>
</ul>

<h1 id="the-shell">The shell</h1>

<p>OTP 26 brings many improvements to the experience of using the Erlang shell.</p>

<p>For example, functions can now be defined directly in the shell:</p>

<pre><code>1&gt; factorial(N) -&gt; factorial(N, 1).
ok
2&gt; factorial(N, F) when N &gt; 1 -&gt; factorial(N - 1, F * N);
.. factorial(_, F) -&gt; F.
ok
3&gt; factorial(5).
120
</code></pre>

<p>The shell prompt changes to <code>..</code> when the previous line is not a
complete Erlang construct.</p>

<p>Functions defined in this way are evaluated using the
<a href="https://www.erlang.org/doc/man/erl_eval.html">erl_eval</a> module, not
compiled by the Erlang compiler. That means that the performance will
not be comparable to compiled Erlang code.</p>

<p>It also possible to define types, specs, and records, making it
possible to paste code from a module directly into the shell for
testing. For example:</p>

<pre><code>1&gt; -record(coord, {x=0.0 :: float(), y=0.0 :: float()}).
ok
2&gt; -type coord() :: #coord{}.
ok
3&gt; -spec add(coord(), coord()) -&gt; coord().
ok
4&gt; add(#coord{x=X1, y=Y1}, #coord{x=X2, y=Y2}) -&gt;
..     #coord{x=X1+X2, y=Y1+Y2}.
ok
5&gt; Origin = #coord{}.
#coord{x = 0.0,y = 0.0}
6&gt; add(Origin, #coord{y=10.0}).
#coord{x = 0.0,y = 10.0}
</code></pre>

<p>The auto-completion feature in the shell has been vastly improved,
supporting auto-completion of variables, record names, record fields
names, map keys, function parameter types, and file names.</p>

<p>For example, instead of typing the variable name <code>Origin</code>, I can just
type <code>O</code> and press TAB to expand it to <code>Origin</code> since the only
variable defined in the shell with the initial letter <code>O</code> is
<code>Origin</code>. That is a little bit difficult to illustrate in a blog post,
so let’s introduce another variable starting with <code>O</code>:</p>

<pre><code>7&gt; Oxford = #coord{x=51.752022, y=-1.257677}.
#coord{x = 51.752022,y = -1.257677}
</code></pre>

<p>If I now press <code>O</code> and TAB, the shell shows the possible completions:</p>

<pre><code>8&gt; O
bindings
Origin    Oxford
</code></pre>

<p>(The word <code>bindings</code> is shown in bold and underlined.)</p>

<p>If I press <code>x</code> and TAB, the word is completed to <code>Oxford</code>:</p>

<pre><code>8&gt; Oxford.
#coord{x = 51.752022,y = -1.257677}
</code></pre>

<p>To type <code>#coord{</code> is is sufficient to type <code>#</code> and TAB (because there is
only one record currently defined in the shell):</p>

<pre><code>9&gt; #coord{
</code></pre>

<p>Pressing TAB one more time causes the field names in the record to be
printed:</p>

<pre><code>9&gt; #coord{
fields
x=    y=
</code></pre>

<p>When trying to complete something which has many possible expansions,
the shell attempts to show the most likely completions first.  For
example, if I type <code>l</code> and press TAB, the shell shows a list of BIFs
beginning with the letter <code>l</code>:</p>

<pre><code>10&gt; l
bifs
length(                   link(                     list_to_atom(
list_to_binary(           list_to_bitstring(        list_to_existing_atom(
list_to_float(            list_to_integer(          list_to_pid(
list_to_port(             list_to_ref(              list_to_tuple(
Press tab to see all 37 expansions
</code></pre>

<p>Pressing TAB again, more BIFs are shown, as well as possible shell commands
and modules:</p>

<pre><code>10&gt; l
bifs
length(                   link(                     list_to_atom(
list_to_binary(           list_to_bitstring(        list_to_existing_atom(
list_to_float(            list_to_integer(          list_to_pid(
list_to_port(             list_to_ref(              list_to_tuple(
load_module(
commands
l(     lc(    lm(    ls(
modules
lcnt:                      leex:                      lists:
local_tcp:                 local_udp:                 log_mf_h:
logger:                    logger_backend:            logger_config:
logger_disk_log_h:         logger_filters:            logger_formatter:
logger_h_common:           logger_handler_watcher:    logger_olp:
logger_proxy:              logger_server:             logger_simple_h:
logger_std_h:              logger_sup:
</code></pre>

<p>Typing <code>ists:</code> (to complete the word <code>lists</code>) and pressing TAB, a
partial list of functions in the <code>lists</code> modules are shown:</p>

<pre><code>10&gt; lists:
functions
all(            any(            append(         concat(         delete(
droplast(       dropwhile(      duplicate(      enumerate(      filter(
filtermap(      flatlength(     flatmap(        flatten(        foldl(
foldr(          foreach(        join(           keydelete(      keyfind(
Press tab to see all 72 expansions
</code></pre>

<p>Typing <code>m</code> and pressing TAB, the list of functions is narrowed down to
just those beginning with the letter <code>m</code>:</p>

<pre><code>10&gt; lists:m
functions
map(            mapfoldl(       mapfoldr(       max(            member(
merge(          merge3(         min(            module_info(
</code></pre>

<h2 id="animations-showing-shell-features">Animations showing shell features</h2>

<ul>
  <li>
    <p><a href="https://asciinema.org/a/iLU2CVuH7kOHFLaCxe6GLI2D5">Local functions in the shell</a></p>
  </li>
  <li>
    <p><a href="https://asciinema.org/a/iZTr7Wz2HBbDUOikhkplT2VUS">Multi-line editing in the shell</a></p>
  </li>
  <li>
    <p><a href="https://asciinema.org/a/RmBrWarb1wiUUBg6Rz0Ylqii8">File name and function name completion</a></p>
  </li>
  <li>
    <p><a href="https://asciinema.org/a/I2DsfnEaeXijVGiW8aI6YzJNT">Bindings and records in the shell</a></p>
  </li>
</ul>

<h1 id="improvements-of-maps">Improvements of maps</h1>

<h2 id="changed-ordering-of-atom-keys">Changed ordering of atom keys</h2>

<p>OTP 25 and earlier releases printed small maps (up to 32 elements)
with atom keys according to the term order of their keys:</p>

<pre><code class="language-erlang">1&gt; AM = #{a =&gt; 1, b =&gt; 2, c =&gt; 3}.
#{a =&gt; 1,b =&gt; 2,c =&gt; 3}
2&gt; maps:to_list(AM).
[{a,1},{b,2},{c,3}]
</code></pre>

<p>In OTP 26, as an optimization for certain map operations, such as
<code>maps:from_list/1</code>, maps with atom keys are now sorted in a different
order. The new order is undefined and may change between different
invocations of the Erlang VM. On my computer at the time of writing,
I got the following order:</p>

<pre><code class="language-erlang">1&gt; AM = #{a =&gt; 1, b =&gt; 2, c =&gt; 3}.
#{c =&gt; 3,a =&gt; 1,b =&gt; 2}
2&gt; maps:to_list(AM).
[{c,3},{a,1},{b,2}]
</code></pre>

<p>There is a new modifier <code>k</code> for format strings to specify that maps should
be sorted according to the term order of their keys before printing:</p>

<pre><code class="language-erlang">3&gt; io:format("~kp\n", [AM]).
#{a =&gt; 1,b =&gt; 2,c =&gt; 3}
ok
</code></pre>

<p>It is also possible to use a <a href="https://www.erlang.org/doc/man/io.html#format-1">custom ordering
fun</a>.  For example,
to order the map elements in reverse order based on their keys:</p>

<pre><code class="language-erlang">4&gt; io:format("~Kp\n", [fun(A, B) -&gt; A &gt; B end, AM]).
#{c =&gt; 3,b =&gt; 2,a =&gt; 1}
ok
</code></pre>

<p>There is also a new
<a href="https://www.erlang.org/doc/man/maps.html#iterator-2">maps:iterator/2</a>
function that supports iterating over the elements of the map in a more
intuitive order. Examples will be shown in the next section.</p>

<h2 id="map-comprehensions">Map comprehensions</h2>

<p>In OTP 25 and earlier, it was common to combine <code>maps:from_list/1</code> and
<code>maps:to_list/1</code> with list comprehensions. For example:</p>

<pre><code class="language-erlang">1&gt; M = maps:from_list([{I,I*I} || I &lt;- lists:seq(1, 5)]).
#{1 =&gt; 1,2 =&gt; 4,3 =&gt; 9,4 =&gt; 16,5 =&gt; 25}
</code></pre>

<p>In OTP 26, that can be written more succinctly with a <a href="https://www.erlang.org/doc/reference_manual/expressions.html#comprehensions"><strong>map comprehension</strong></a>:</p>

<pre><code class="language-erlang">1&gt; M = #{I =&gt; I*I || I &lt;- lists:seq(1, 5)}.
#{1 =&gt; 1,2 =&gt; 4,3 =&gt; 9,4 =&gt; 16,5 =&gt; 25}
</code></pre>

<p>With a <strong>map generator</strong>, a comprehension can now iterate over the
elements of a map. For example:</p>

<pre><code class="language-erlang">2&gt; [K || K := V &lt;- M, V &lt; 10].
[1,2,3]
</code></pre>

<p>Using a map comprehension with a map generator, here is an example
showing how keys and values can be swapped:</p>

<pre><code class="language-erlang">3&gt; #{V =&gt; K || K := V &lt;- M}.
#{1 =&gt; 1,4 =&gt; 2,9 =&gt; 3,16 =&gt; 4,25 =&gt; 5}
</code></pre>

<p>Map generators accept map iterators as well as maps. Especially useful
are the ordered iterators returned from the new
<a href="https://www.erlang.org/doc/man/maps.html#iterator-2">maps:iterator/2</a>
function:</p>

<pre><code class="language-erlang">4&gt; AM = #{a =&gt; 1, b =&gt; 2, c =&gt; 1}.
#{c =&gt; 1,a =&gt; 1,b =&gt; 2}
5&gt; [{K,V} || K := V &lt;- maps:iterator(AM, ordered)].
[{a,1},{b,2},{c,1}]
6&gt; [{K,V} || K := V &lt;- maps:iterator(AM, reversed)].
[{c,1},{b,2},{a,1}]
7&gt; [{K,V} || K := V &lt;- maps:iterator(AM, fun(A, B) -&gt; A &gt; B end)].
[{c,1},{b,2},{a,1}]

</code></pre>

<p>Map comprehensions were first suggested in <a href="https://www.erlang.org/eeps/eep-0058">EEP 58</a>.</p>

<h2 id="inlined-mapsget3">Inlined <code>maps:get/3</code></h2>

<p>In OTP 26, the compiler will inline calls to
<a href="https://www.erlang.org/doc/man/maps.html#get-3">maps:get/3</a>, making them slightly
more efficient.</p>

<h2 id="improved-mapsmerge2">Improved <code>maps:merge/2</code></h2>

<p>When merging two maps, the
<a href="https://www.erlang.org/doc/man/maps.html#merge-2">maps:merge/2</a>
function will now try to reuse the <a href="https://www.erlang.org/doc/efficiency_guide/maps.html#how-small-maps-are-implemented">key
tuple</a>
from one of the maps in order to reduce the memory usage for maps.</p>

<p>For example:</p>

<pre><code class="language-erlang">1&gt; maps:merge(#{x =&gt; 13, y =&gt; 99, z =&gt; 100}, #{x =&gt; 0, z =&gt; -7}).
#{y =&gt; 99,x =&gt; 0,z =&gt; -7}
</code></pre>

<p>The resulting map has the same three keys as the first map, so it can reuse the
key tuple from the first map.</p>

<p>This optimization is not possible if one of the maps has any key not present
in the other map. For example:</p>

<pre><code class="language-erlang">2&gt; maps:merge(#{x =&gt; 1000}, #{y =&gt; 2000}).
#{y =&gt; 2000,x =&gt; 1000}
</code></pre>

<h2 id="improved-map-updates">Improved map updates</h2>

<p>Updating of a map using the <code>=&gt;</code> operator has been improved to avoid
updates that don’t change the value of the map or its <a href="https://www.erlang.org/doc/efficiency_guide/maps.html#how-small-maps-are-implemented">key
tuple</a>.
For example:</p>

<pre><code class="language-erlang">1&gt; M = #{a =&gt; 42}.
#{a =&gt; 42}
2&gt; M#{a =&gt; 42}.
#{a =&gt; 42}
</code></pre>

<p>The update operation does not change the value of the map, so in order
to save memory, the original map is returned.</p>

<p>(A <a href="https://github.com/erlang/otp/pull/1889">similar optimization for the <code>:=</code>
operator</a> was implemented 5
years ago.)</p>

<p>When updating the values of keys that already exist in a map using the
<code>=&gt;</code> operator, the key tuple will now be re-used. For example:</p>

<pre><code class="language-erlang">3&gt; M#{a =&gt; 100}.
#{a =&gt; 100}
</code></pre>

<h2 id="the-pull-requests-for-map-improvements">The pull requests for map improvements</h2>

<p>For anyone who wants to dig deeper, here are the main pull requests
for maps for OTP 26:</p>

<ul>
  <li><a href="https://github.com/erlang/otp/pull/6727">Implement map comprehensions (EEP-58)</a></li>
  <li><a href="https://github.com/erlang/otp/pull/6151">Use in-memory atom ordering for map ordering</a></li>
  <li><a href="https://github.com/erlang/otp/pull/6718">Add maps:iterator/2 with ~k and ~K format options for printing ordered maps</a></li>
  <li><a href="https://github.com/erlang/otp/pull/7003">sys_core_fold: Inline maps:get/3</a></li>
  <li><a href="https://github.com/erlang/otp/pull/7004">Optimize maps:merge/2 of small maps</a></li>
  <li><a href="https://github.com/erlang/otp/pull/6267">Inline creation of small maps with literal keys</a></li>
  <li><a href="https://github.com/erlang/otp/pull/6178">Enhance creation of maps with literal keys</a></li>
  <li><a href="https://github.com/erlang/otp/pull/6657">Do not allocate a new map when the value is the same encore</a></li>
</ul>

<h1 id="improvements-of-the-lists-module">Improvements of the <code>lists</code> module</h1>

<h2 id="new-function-listsenumerate3">New function <code>lists:enumerate/3</code></h2>

<p>In OTP 25, <a href="https://erlang.org/doc/man/lists.html#enumerate-1">lists_enumerate()</a>
was introduced. For example:</p>

<pre><code class="language-erlang">1&gt; lists:enumerate([a,b,c]).
[{1,a},{2,b},{3,c}]
2&gt; lists:enumerate(0, [a,b,c]).
[{0,a},{1,b},{2,c}]
</code></pre>

<p>In OTP 26, <a href="https://erlang.org/doc/man/lists.html#enumerate-3">lists:enumerate/3</a>
completes the family of functions by allowing an increment to be specified:</p>

<pre><code class="language-erlang">3&gt; lists:enumerate(0, 10, [a,b,c]).
[{0,a},{10,b},{20,c}]
4&gt; lists:enumerate(0, -1, [a,b,c]).
[{0,a},{-1,b},{-2,c}]
</code></pre>

<h2 id="new-options-for-the-zip-family-of-functions">New options for the <code>zip</code> family of functions</h2>

<p>The <code>zip</code> family of functions in the <code>lists</code> module combines two or three lists
into a single list of tuples. For example:</p>

<pre><code class="language-erlang">1&gt; lists:zip([a,b,c], [1,2,3]).
[{a,1},{b,2},{c,3}]

</code></pre>

<p>The existing <code>zip</code> functions fail if the lists don’t have the same length:</p>

<pre><code class="language-erlang">2&gt; lists:zip([a,b,c,d], [1,2,3]).
** exception error: no function clause matching . . .
</code></pre>

<p>In OTP 26, the <a href="https://www.erlang.org/doc/man/lists.html#zip-2"><code>zip</code>
functions</a> now take
an extra <code>How</code> parameter that determines what should happen when the
lists are of unequal length.</p>

<p>For some use cases for <code>zip</code>, ignoring the superfluous elements in the
longer list or lists can make sense. That can be done using the <code>trim</code>
option:</p>

<pre><code class="language-erlang">3&gt; lists:zip([a,b,c,d], [1,2,3], trim).
[{a,1},{b,2},{c,3}]
</code></pre>

<p>For other use cases it could make more sense to extend the shorter
list or lists to the length of the longest list. That can be done
using the <code>{pad, Defaults}</code> option, where <code>Defaults</code> should be a tuple
having the same number of elements as the number of lists. For
<code>lists:zip/3</code>, that means that the <code>Defaults</code> tuple should have two
elements:</p>

<pre><code class="language-erlang">4&gt; lists:zip([a,b,c,d], [1,2,3], {pad, {zzz, 999}}).
[{a,1},{b,2},{c,3},{d,999}]
5&gt; lists:zip([a,b,c], [1,2,3,4,5], {pad, {zzz, 999}}).
[{a,1},{b,2},{c,3},{zzz,4},{zzz,5}]
</code></pre>

<p>For <code>lists:zip3/3</code> the <code>Defaults</code> tuple should have three elements:</p>

<pre><code class="language-erlang">6&gt; lists:zip3([], [a], [1,2,3], {pad, {0.0, zzz, 999}}).
[{0.0,a,1},{0.0,zzz,2},{0.0,zzz,3}]
</code></pre>

<h1 id="no-need-to-enable-feature-maybe-in-the-runtime-system">No need to enable feature <code>maybe</code> in the runtime system</h1>

<p>In OTP 25, the <a href="https://www.erlang.org/doc/reference_manual/features.html">feature
concept</a>
and the <a href="https://www.erlang.org/doc/reference_manual/expressions.html#maybe">maybe
feature</a>
were introduced. In order to use <code>maybe</code> in OTP 25, it is necessary to
enable it in both the compiler and the runtime system. For example:</p>

<pre><code>$ cat t.erl
-module(t).
-feature(maybe_expr, enable).
-export([listen_port/2]).
listen_port(Port, Options) -&gt;
    maybe
        {ok, ListenSocket} ?= inet_tcp:listen(Port, Options),
        {ok, Address} ?= inet:sockname(ListenSocket),
        {ok, {ListenSocket, Address}}
    end.
$ erlc t.erl
$ erl
Erlang/OTP 25 . . .

Eshell V13.1.1  (abort with ^G)
1&gt; t:listen_port(50000, []).
=ERROR REPORT==== 6-Apr-2023::12:01:20.373223 ===
Loading of . . ./t.beam failed: {features_not_allowed,
                                 [maybe_expr]}

** exception error: undefined function t:listen_port/2
2&gt; q().
$ erl -enable-feature maybe_expr
Erlang/OTP 25 . . .

Eshell V13.1.1  (abort with ^G)
1&gt; t:listen_port(50000, []).
{ok,{#Port&lt;0.5&gt;,{{0,0,0,0},50000}}}
</code></pre>

<p>In OTP 26, it is no longer necessary to enable a feature in the
runtime system in order to load modules that are using it.
It is sufficient to have <code>-feature(maybe_expr, enable).</code> in the module.</p>

<p>For example:</p>

<pre><code>$ erlc t.erl
$ erl
Erlang/OTP 26 . . .

Eshell V14.0 (press Ctrl+G to abort, type help(). for help)
1&gt; t:listen_port(50000, []).
{ok,{#Port&lt;0.4&gt;,{{0,0,0,0},50000}}}
</code></pre>

<h1 id="improvements-in-the-erlang-compiler-and-jit">Improvements in the Erlang compiler and JIT</h1>

<p>OTP 26 improves on the type-based optimizations in the JIT introduced
last year, but the most noticable improvements are for matching and
construction of binaries using the bit syntax. Those improvements,
combined with changes to the <code>base64</code> module itself, makes encoding to
Base64 about 4 times faster and decoding from Base64 more than 3
times faster.</p>

<p>More details about these improvements can be found in the blog post
<a href="https://www.erlang.org/blog/more-optimizations">More Optimizations in the Compiler and JIT</a>.</p>

<p>Worth mentioning here is also the re-introduction of an optimization
that was lost when the JIT was introduced in OTP 24:</p>

<p><a href="https://github.com/erlang/otp/pull/6963">erts: Reintroduce literal fun optimization</a></p>

<p>It turns out that this optimization is important for the
<a href="https://github.com/michalmuskala/jason">jason</a> library. Without it,
<a href="https://github.com/michalmuskala/jason/pull/161">JSON decoding is 10 percent
slower</a>.</p>

<h1 id="incremental-mode-for-dialyzer">Incremental mode for Dialyzer</h1>

<p>Dialyzer has a new incremental mode implemented by Tom Davies. The
incremental mode can greatly speed up the analysis when only small
changes have been done to a code base.</p>

<p>Let’s jump straight into an example. Assuming that we want to prepare
a pull request for the <code>stdlib</code> application, here is how we can use Dialyzer’s
incremental mode to show warnings for any issues in <code>stdlib</code>:</p>

<pre><code>$ dialyzer --incremental --apps erts kernel stdlib compiler crypto --warning_apps stdlib
Proceeding with incremental analysis... done in 0m14.91s
done (passed successfully)
</code></pre>

<p>Let’s break down the command line:</p>

<ul>
  <li>
    <p>The <code>--incremental</code> option tells Dialyzer to use the incremental mode.</p>
  </li>
  <li>
    <p>The <code>--warning_apps stdlib</code> lists the application that we want
warnings for. In this case, it’s the <code>stdlib</code> application.</p>
  </li>
  <li>
    <p>The <code>--apps erts kernel stdlib compiler crypto</code> option lists the
applications that should be analyzed, but without generating any
warnings.</p>
  </li>
</ul>

<p>Dialyzer analyzed all modules given for the <code>--apps</code> and
<code>--warning_apps</code> options. On my computer, the analysis finished in
about 15 seconds.</p>

<p>If I immediately run Dialyzer with the same arguments, it finishes pretty much
instantaneously because nothing has been changed:</p>

<pre><code>$ dialyzer --incremental --warning_apps stdlib --apps erts kernel stdlib compiler crypto
done (passed successfully)
</code></pre>

<p>If I do any change to the <code>lists</code> module (for example, by adding a new
function), Dialyzer will re-analyze all modules that depend on it
directly or indirectly:</p>

<pre><code>$ dialyzer --incremental --warning_apps stdlib --apps erts kernel stdlib compiler crypto
There have been changes to analyze
    Of the 270 files being tracked, 1 have been changed or removed,
    resulting in 270 requiring analysis because they depend on those changes
Proceeding with incremental analysis... done in 0m14.95s
done (passed successfully)
</code></pre>

<p>It turns out that all modules in the analyzed applications depend on
the <code>lists</code> module directly or indirectly.</p>

<p>If I change something in the <code>base64</code> module, the re-analysis will be
much quicker because there are fewer dependencies:</p>

<pre><code>$ dialyzer --incremental --warning_apps stdlib --apps erts kernel stdlib compiler crypto
There have been changes to analyze
    Of the 270 files being tracked, 1 have been changed or removed,
    resulting in 3 requiring analysis because they depend on those changes
Proceeding with incremental analysis... done in 0m1.07s
done (passed successfully)
</code></pre>

<p>In this case only three modules needed to be re-analyzed, which was
done in about one second.</p>

<h2 id="using-the-dialyzerconfig-file">Using the dialyzer.config file</h2>

<p>Note that all of the examples above used the same command line.</p>

<p>When running Dialyzer in the incremental mode, the list of
applications to be analyzed and the list of applications to produce
warnings for must be supplied every time Dialyzer is invoked.</p>

<p>To avoid having to supply the application lists on the command line,
they can be put into a configuration file named <code>dialyzer.config</code>.
To find out in which directory Dialyzer will look for the configuration
file, run the following command:</p>

<pre><code>$ dialyzer --help
  .
  .
  .
Configuration file:
     Dialyzer's configuration file may also be used to augment the default
     options and those given directly to the Dialyzer command. It is commonly
     used to avoid repeating options which would otherwise need to be given
     explicitly to Dialyzer on every invocation.

     The location of the configuration file can be set via the
     DIALYZER_CONFIG environment variable, and defaults to
     within the user_config location given by filename:basedir/3.

     On your system, the location is currently configured as:
       /Users/bjorng/Library/Application Support/erlang/dialyzer.config

     An example configuration file's contents might be:

       {incremental,
         {default_apps,[stdlib,kernel,erts]},
         {default_warning_apps,[stdlib]}
       }.
       {warnings, [no_improper_lists]}.
       {add_pathsa,["/users/samwise/potatoes/ebin"]}.
       {add_pathsz,["/users/smeagol/fish/ebin"]}.

  .
  .
  .

</code></pre>

<p>Near the end there is information about the configuration file and where Dialyzer
will look for it.</p>

<p>To shorten the command line for our previous examples, the following term can
be stored in the <code>dialyzer.config</code>:</p>

<pre><code>{incremental,
 {default_apps, [erts,kernel,stdlib,compiler,crypto]},
 {default_warning_apps, [stdlib]}
}.
</code></pre>

<p>Now it is sufficient to just give the <code>--incremental</code> option to Dialyzer:</p>

<pre><code>$ dialyzer --incremental
done (passed successfully)
</code></pre>

<h2 id="running-dialyzer-on-proper">Running Dialyzer on proper</h2>

<p>As a final example, let’s run Dialyzer on
<a href="https://github.com/proper-testing/proper/">PropER</a>.</p>

<p>To do that, the <code>default_warnings_apps</code> option in the configuration
file must be changed to <code>proper</code>. It is also necessary to add the
<code>add_pathsa</code> option to prepend the path of the <code>proper</code> application to
the code path:</p>

<pre><code>{incremental,
 {default_apps, [erts,kernel,stdlib,compiler,crypto]},
 {default_warning_apps, [proper]}
}.
{add_pathsa, ["/Users/bjorng/git/proper/_build/default/lib/proper"]}.
</code></pre>

<p>Running Dialyzer:</p>

<pre><code>$ dialyzer --incremental
There have been changes to analyze
    Of the 296 files being tracked,
    26 have been changed or removed,
    resulting in 26 requiring analysis because they depend on those changes
Proceeding with incremental analysis...
proper.erl:2417:13: Unknown function cover:start/1
proper.erl:2426:13: Unknown function cover:stop/1
proper_symb.erl:249:9: Unknown function erl_syntax:atom/1
proper_symb.erl:250:5: Unknown function erl_syntax:revert/1
proper_symb.erl:250:23: Unknown function erl_syntax:application/3
proper_symb.erl:257:51: Unknown function erl_syntax:nil/0
proper_symb.erl:259:49: Unknown function erl_syntax:cons/2
proper_symb.erl:262:5: Unknown function erl_syntax:revert/1
proper_symb.erl:262:23: Unknown function erl_syntax:tuple/1
 done in 0m2.36s
done (warnings were emitted)
</code></pre>

<p>Dialyzer found 26 new files to analyze (the BEAM files in the <code>proper</code> application).
Those were analyzed in about two and a half seconds.</p>

<p>Dialyzer emitted warnings for unknown functions because <code>proper</code> calls
functions in applications that were not being analyzed. To eliminate those warnings,
the <code>tools</code> and <code>syntax_tools</code> applications can be added to the list of applications
in the list of <code>default_apps</code>:</p>

<pre><code>{incremental,
 {default_apps, [erts,kernel,stdlib,compiler,crypto,tools,syntax_tools]},
 {default_warning_apps, [proper]}
}.
{add_pathsa, ["/Users/bjorng/git/proper/_build/default/lib/proper"]}.
</code></pre>

<p>With that change to the configuration file, no more warnings are printed:</p>

<pre><code>$ dialyzer --incremental
There have been changes to analyze
    Of the 319 files being tracked,
    23 have been changed or removed,
    resulting in 38 requiring analysis because they depend on those changes
Proceeding with incremental analysis... done in 0m6.47s
</code></pre>

<p>It is also possible to include warning options in the configuration
file, for example to disable warnings for non-proper lists or to enable
warnings for unmatched returns. Let’s enable warnings for unmatched
returns:</p>

<pre><code>{incremental,
 {default_apps, [erts,kernel,stdlib,compiler,crypto,tools,syntax_tools]},
 {default_warning_apps, [proper]}
}.
{warnings, [unmatched_returns]}.
{add_pathsa, ["/Users/bjorng/git/proper/_build/default/lib/proper"]}.
</code></pre>

<p>When warnings options are changed, Dialyzer will re-analyze all modules:</p>

<pre><code>$ dialyzer --incremental
PLT was built for a different set of enabled warnings,
so an analysis must be run for 319 modules to rebuild it
Proceeding with incremental analysis... done in 0m19.43s
done (passed successfully)
</code></pre>

<h2 id="pull-request">Pull request</h2>

<p><a href="https://github.com/erlang/otp/pull/5997">dialyzer: Add incremental analysis mode</a></p>

<h1 id="argparse-a-command-line-parser-for-erlang">argparse: A command line parser for Erlang</h1>

<p>New in OTP 26 is the
<a href="https://www.erlang.org/doc/man/argparse.html">argparse</a> module, which
simplifies parsing of the command line in
<a href="https://www.erlang.org/doc/man/escript.html">escripts</a>.  <code>argparse</code>
was implemented by Maxim Fedorov.</p>

<p>To show only a few of the features, let’s implement the command-line
parsing for an escript called <code>ehead</code>, inspired by the Unix command
<a href="https://en.wikipedia.org/wiki/Head_(Unix)">head</a>:</p>

<pre><code class="language-erlang">#!/usr/bin/env escript
%% -*- erlang -*-

main(Args) -&gt;
    argparse:run(Args, cli(), #{progname =&gt; ehead}).

cli() -&gt;
    #{
      arguments =&gt;
          [#{name =&gt; lines, type =&gt; {integer, [{min, 1}]},
             short =&gt; $n, long =&gt; "-lines", default =&gt; 10,
             help =&gt; "number of lines to print"},
           #{name =&gt; files, nargs =&gt; nonempty_list, action =&gt; extend,
             help =&gt; "lists of files"}],
      handler =&gt; fun(Args) -&gt;
                         io:format("~p\n", [Args])
                 end
     }.
</code></pre>

<p>As currently written, the <code>ehead</code> script will simply print the
arguments collected by <code>argparse</code> and quit.</p>

<p>If <code>ehead</code> is run without any arguments an error message will be
shown:</p>

<pre><code>$ ehead
error: ehead: required argument missing: files
Usage:
  ehead [-n &lt;lines&gt;] [--lines &lt;lines&gt;] &lt;files&gt;...

Arguments:
  files       lists of files

Optional arguments:
  -n, --lines number of lines to print (int &gt;= 1, 10)
</code></pre>

<p>The message tells us that at least one file name must be given:</p>

<pre><code>$ ehead foo bar baz
#{lines =&gt; 10,files =&gt; ["foo","bar","baz"]}
</code></pre>

<p>Since the command line was valid, <code>argparse</code> collected the arguments
into a map, which was then printed by the <code>handler</code> fun.</p>

<p>The number of lines to be printed from each file defaults to <code>10</code>, but
can be changed using either the <code>-n</code> or <code>--lines</code> option:</p>

<pre><code>$ ehead -n 42 foo bar baz
#{lines =&gt; 42,files =&gt; ["foo","bar","baz"]}
$ ehead foo --lines=42 bar baz
#{lines =&gt; 42,files =&gt; ["foo","bar","baz"]}
$ ehead --lines 42 foo bar baz
#{lines =&gt; 42,files =&gt; ["foo","bar","baz"]}
$ ehead foo bar --lines 42 baz
#{lines =&gt; 42,files =&gt; ["foo","bar","baz"]}
</code></pre>

<p>Attempting to give the number of lines as <code>0</code> results in an error message:</p>

<pre><code>$ ehead -n 0 foobar
error: ehead: invalid argument for lines: 0 is less than accepted minimum
Usage:
  ehead [-n &lt;lines&gt;] [--lines &lt;lines&gt;] &lt;files&gt;...

Arguments:
  files       lists of files

Optional arguments:
  -n, --lines number of lines to print (int &gt;= 1, 10)
</code></pre>

<h2 id="pull-request-1">Pull request</h2>

<p><a href="https://github.com/erlang/otp/pull/6852">[argparse] Command line parser for Erlang</a></p>

<h1 id="ssl-safer-defaults">SSL: Safer defaults</h1>

<p>In OTP 25, the default options for
<a href="https://www.erlang.org/doc/man/ssl.html#connect-3">ssl:connect/3</a>
would allow setting up a connection without verifying the
authenticity of the server (that is, without checking the server’s
certificate chain). For example:</p>

<pre><code class="language-erlang">Erlang/OTP 25 . . .

Eshell V13.1.1  (abort with ^G)
1&gt; application:ensure_all_started(ssl).
{ok,[crypto,asn1,public_key,ssl]}
2&gt; ssl:connect("www.erlang.org", 443, []).
=WARNING REPORT==== 6-Apr-2023::12:29:20.824457 ===
Description: "Authenticity is not established by certificate path validation"
     Reason: "Option {verify, verify_peer} and cacertfile/cacerts is missing"

{ok,{sslsocket,{gen_tcp,#Port&lt;0.6&gt;,tls_connection,undefined},
               [&lt;0.122.0&gt;,&lt;0.121.0&gt;]}}
</code></pre>

<p>A warning report would be generated, but a connection would be set up.</p>

<p>In OTP 26, the default value for the <code>verify</code> option is now
<code>verify_peer</code> instead of <code>verify_none</code>. Host verification
requires trusted CA certificates to be supplied using one of the options
<code>cacerts</code> or <code>cacertsfile</code>. Therefore, a connection attempt with an empty
option list will fail in OTP 26:</p>

<pre><code class="language-erlang">Erlang/OTP 26 . . .

Eshell V14.0 (press Ctrl+G to abort, type help(). for help)
1&gt; application:ensure_all_started(ssl).
{ok,[crypto,asn1,public_key,ssl]}
2&gt; ssl:connect("www.erlang.org", 443, []).
{error,{options,incompatible,
                [{verify,verify_peer},{cacerts,undefined}]}}
</code></pre>

<p>The default value for the <code>cacerts</code> option is <code>undefined</code>,
which is not compatible with the <code>{verify,verify_peer}</code> option.</p>

<p>To make the connection succeed, the recommended way is to
use the <code>cacerts</code> option to supply CA certificates to be used
for verifying. For example:</p>

<pre><code class="language-erlang">1&gt; application:ensure_all_started(ssl).
{ok,[crypto,asn1,public_key,ssl]}
2&gt; ssl:connect("www.erlang.org", 443, [{cacerts, public_key:cacerts_get()}]).
{ok,{sslsocket,{gen_tcp,#Port&lt;0.5&gt;,tls_connection,undefined},
               [&lt;0.137.0&gt;,&lt;0.136.0&gt;]}}
</code></pre>

<p>Alternatively, host verification can be explicitly disabled. For example:</p>

<pre><code class="language-erlang">1&gt; application:ensure_all_started(ssl).
{ok,[crypto,asn1,public_key,ssl]}
2&gt; ssl:connect("www.erlang.org", 443, [{verify,verify_none}]).
{ok,{sslsocket,{gen_tcp,#Port&lt;0.6&gt;,tls_connection,undefined},
               [&lt;0.143.0&gt;,&lt;0.142.0&gt;]}}
</code></pre>

<p>Another way that OTP 26 is safer is that legacy algorithms such as SHA1 and
DSA are no longer allowed by default.</p>

<h1 id="ssl-improved-checking-of-options">SSL: Improved checking of options</h1>

<p>In OTP 26, the checking of options is strengthened to return errors
for incorrect options that used to be silently ignored. For example,
<code>ssl</code> now rejects the <code>fail_if_no_peer_cert</code> option if used for the
client:</p>

<pre><code class="language-erlang">1&gt; application:ensure_all_started(ssl).
{ok,[crypto,asn1,public_key,ssl]}
2&gt; ssl:connect("www.erlang.org", 443, [{fail_if_no_peer_cert, true}, {verify, verify_peer}, {cacerts, public_key:cacerts_get()}]).
{error,{option,server_only,fail_if_no_peer_cert}}
</code></pre>

<p>In OTP 25, the option would be silently ignored.</p>

<p><code>ssl</code> in OTP 26 also returns clearer error reasons. In the example in
the previous section the following connection attempt was shown:</p>

<pre><code class="language-erlang">2&gt; ssl:connect("www.erlang.org", 443, []).
{error,{options,incompatible,
                [{verify,verify_peer},{cacerts,undefined}]}}
</code></pre>

<p>In OTP 25, the corresponding error return is less clear:</p>

<pre><code class="language-erlang">2&gt; ssl:connect("www.erlang.org", 443, [{verify,verify_peer}]).
{error,{options,{cacertfile,[]}}}
</code></pre>]]></content><author><name>Björn Gustavsson</name></author><category term="erlang" /><category term="otp" /><category term="26" /><category term="release" /><summary type="html"><![CDATA[Erlang/OTP 26 is finally here. This blog post will introduce the new features that we are most excited about.]]></summary></entry><entry><title type="html">More Optimizations in the Compiler and JIT</title><link href="https://www.erlang.org/blog/more-optimizations/" rel="alternate" type="text/html" title="More Optimizations in the Compiler and JIT" /><published>2023-04-19T00:00:00+00:00</published><updated>2023-04-19T00:00:00+00:00</updated><id>https://www.erlang.org/blog/more-optimizations</id><content type="html" xml:base="https://www.erlang.org/blog/more-optimizations/"><![CDATA[<p>This post explores the enhanced type-based optimizations
and the other performance improvements in Erlang/OTP 26.</p>

<h3 id="what-to-expect-of-the-jit-in-otp-26">What to expect of the JIT in OTP 26</h3>

<p>In OTP 25, the compiler was updated to embed type information in
the BEAM file and the JIT was extended to emit better code based
on that type information. Those improvements were described in
the blog post <a href="https://www.erlang.org/blog/type-based-optimizations-in-the-jit/">Type-Based Optimizations in the JIT</a>.</p>

<p>As mentioned in that blog post, there were limitations in both the
compiler and the JIT that prevented many optimizations. In OTP 26, the
compiler will produce better type information and the JIT will take
better advantage of the improved type information, typically resulting
in fewer redundant type tests and smaller native code size.</p>

<p>A new BEAM instruction introduced in OTP 26 makes record updates
faster by a small but measurable amount.</p>

<p>The most noticable performance improvements in OTP 26 are probably for
matching and construction of binaries using the bit syntax. Those
improvements, combined with changes to the <code>base64</code> module itself,
makes encoding to Base64 about 4 times as fast and decoding from
Base64 more than 3 times as fast.</p>

<h3 id="please-try-this-at-home">Please try this at home!</h3>

<p>While this blog post will show many examples of generated code, I have
attempted to explain the optimizations in English as well. Feel free
to skip the code examples.</p>

<p>On the other hand, if you want more code examples…</p>

<p>To examine the native code for loaded modules, start the runtime system like this:</p>

<pre><code class="language-bash">erl +JDdump true
</code></pre>

<p>The native code for all modules that are loaded will be dumped to files with the
extension <code>.asm</code>.</p>

<p>To examine the BEAM code for a module, use the <code>-S</code> option when
compiling. For example:</p>

<pre><code class="language-bash">erlc -S base64.erl
</code></pre>

<h3 id="quick-overview-of-type-based-optimizations-in-otp-25">Quick overview of type-based optimizations in OTP 25</h3>

<p>Let’s quickly summarize the type-based optimizations in OTP 25. For more
details, see the <a href="https://www.erlang.org/blog/type-based-optimizations-in-the-jit/">aformentioned blog post</a>.</p>

<p>First consider an addition of two values with nothing known about
their types:</p>

<pre><code class="language-erlang">add1(X, Y) -&gt;
    X + Y.
</code></pre>

<p>The <a href="https://www.erlang.org/blog/a-brief-beam-primer">BEAM code</a> looks like this:</p>

<pre><code>    {gc_bif,'+',{f,0},2,[{x,0},{x,1}],{x,0}}.
    return.
</code></pre>

<p>Without any information about the operands, the JIT must emit code
that can handle all possible types for the operands. For the x86_64
architecture, 14 native instructions are needed.</p>

<p>If the type of the operands are known to be integers sufficiently
small making overflow impossible, the JIT needs to emit only 5 native
instructions for the addition.</p>

<p>Here is an example where the types and ranges of the operands for the
<code>+</code> operator are known:</p>

<pre><code class="language-erlang">add5(X, Y) when X =:= X band 16#3FF,
                Y =:= Y band 16#3FF -&gt;
    X + Y.
</code></pre>

<p>The BEAM code for this function is as follows:</p>

<pre><code>    {gc_bif,'band',{f,24},2,[{x,0},{integer,1023}],{x,2}}.
    {test,is_eq_exact,
          {f,24},
          [{tr,{x,0},{t_integer,any}},{tr,{x,2},{t_integer,{0,1023}}}]}.
    {gc_bif,'band',{f,24},2,[{x,1},{integer,1023}],{x,2}}.
    {test,is_eq_exact,
          {f,24},
          [{tr,{x,1},{t_integer,any}},{tr,{x,2},{t_integer,{0,1023}}}]}.
    {gc_bif,'+',
            {f,0},
            2,
            [{tr,{x,0},{t_integer,{0,1023}}},{tr,{x,1},{t_integer,{0,1023}}}],
            {x,0}}.
    return.
</code></pre>

<p>The register operands (<code>{x,0}</code> and <code>{x,1}</code>) have now been annotated with
type information:</p>

<pre><code class="language-erlang">{tr,Register,Type}
</code></pre>

<p>That is, each register operand is a three-tuple with <code>tr</code> as the first
element. <code>tr</code> stands for <strong>typed register</strong>. The second element is the
BEAM register (<code>{x,0}</code> or <code>{x,1}</code> in this case), and the third element
is the type of the register in the compiler’s internal type
representation. <code>{t_integer,{0,1023}}</code> means that the value is an
integer in the inclusive range 0 through 1023.</p>

<p>With that type information, the JIT emits the following native code
for the <code>+</code> operator:</p>

<pre><code class="language-nasm"># i_plus_ssjd
# add without overflow check
    mov rax, qword ptr [rbx]
    mov rsi, qword ptr [rbx+8]
    and rax, -16               ; Zero the tag bits
    add rax, rsi
    mov qword ptr [rbx], rax
</code></pre>

<p>(Lines starting with <code>#</code> are comments emitted by the JIT, while the
text that follows <code>;</code> is a comment added by me for clarification.)</p>

<p>The reduction in code size from 14 instructions down to 5 is nice, but
having to express the range check in that convoluted way using <code>band</code>
can hardly be called nice nor natural.</p>

<p>If we try to express the range checks in a more natural way:</p>

<pre><code class="language-erlang">add4(X, Y) when is_integer(X), 0 =&lt; X, X &lt; 16#400,
                is_integer(Y), 0 =&lt; Y, Y &lt; 16#400 -&gt;
    X + Y.
</code></pre>

<p>the compiler in OTP 25 will no longer be able to figure out the
ranges for the operands. Here is the BEAM code:</p>

<pre><code>    {test,is_integer,{f,22},[{x,0}]}.
    {test,is_ge,{f,22},[{tr,{x,0},{t_integer,any}},{integer,0}]}.
    {test,is_lt,{f,22},[{tr,{x,0},{t_integer,any}},{integer,1024}]}.
    {test,is_integer,{f,22},[{x,1}]}.
    {test,is_ge,{f,22},[{tr,{x,1},{t_integer,any}},{integer,0}]}.
    {test,is_lt,{f,22},[{tr,{x,1},{t_integer,any}},{integer,1024}]}.
    {gc_bif,'+',
            {f,0},
            2,
            [{tr,{x,0},{t_integer,any}},{tr,{x,1},{t_integer,any}}],
            {x,0}}.
    return.
</code></pre>

<p>Because of that severe limitation in the compiler’s value range
analysis, I wrote:</p>

<blockquote>
  <p>We aim to improve the type analysis and optimizations in OTP 26 and
generate better code for this example.</p>
</blockquote>

<h3 id="the-enhanced-type-based-optimizations-in-otp-26">The enhanced type-based optimizations in OTP 26</h3>

<p>Compiling the same example with OTP 26, the result is:</p>

<pre><code>    {test,is_integer,{f,19},[{x,0}]}.
    {test,is_ge,{f,19},[{tr,{x,0},{t_integer,any}},{integer,0}]}.
    {test,is_ge,{f,19},[{integer,1023},{tr,{x,0},{t_integer,{0,'+inf'}}}]}.
    {test,is_integer,{f,19},[{x,1}]}.
    {test,is_ge,{f,19},[{tr,{x,1},{t_integer,any}},{integer,0}]}.
    {test,is_ge,{f,19},[{integer,1023},{tr,{x,1},{t_integer,{0,'+inf'}}}]}.
    {gc_bif,'+',
            {f,0},
            2,
            [{tr,{x,0},{t_integer,{0,1023}}},{tr,{x,1},{t_integer,{0,1023}}}],
            {x,0}}.
</code></pre>

<p>The BEAM instruction for the <code>+</code> operator now have ranges for its operands.</p>

<p>Let’s look at little bit closer at the first three instructions, which
corresponds to the guard test <code>is_integer(X), 0 =&lt; X, X &lt; 16#400</code>.</p>

<p>First is the guard check for an integer:</p>

<pre><code>    {test,is_integer,{f,19},[{x,0}]}.
</code></pre>

<p>It is followed by the guard test <code>0 =&lt; X</code> (rewritten to <code>X &gt;= 0</code> by the compiler):</p>

<pre><code>    {test,is_ge,{f,19},[{tr,{x,0},{t_integer,any}},{integer,0}]}.
</code></pre>

<p>As a result of the <code>is_integer/1</code> test it is known that <code>{x,0}</code>
is an integer.</p>

<p>The third instruction corresponds to <code>X &lt; 16#400</code>, which the compiler
has rewritten to <code>16#3FF &gt;= X</code> (<code>1023 &gt;= X</code>):</p>

<pre><code>    {test,is_ge,{f,19},[{integer,1023},{tr,{x,0},{t_integer,{0,'+inf'}}}]}.
</code></pre>

<p>In the type for the <code>{x,0}</code> register there is something new for
OTP 26. It says that the range is 0 through <code>'+inf'</code>, that is, from 0 up
to positive infinity. Combining that range with the range from this
instruction, the Erlang compiler can infer that if this instruction
succeeds, the type for <code>{x,0}</code> is <code>t_integer,{0,1023}}</code>.</p>

<h3 id="combining-guard-tests">Combining guard tests</h3>

<p>In OTP 25, the JIT would emit native code for each BEAM instruction
in the guard individually. When translated individually, the three guards
tests for one of the variables each require 11 native instructions, or 33
instructions for all three.</p>

<p>By having the BEAM loader combine the three guard tests into a
single <code>is_int_range</code> instruction, the JIT is capable of doing a much
better job and emit a mere 6 native instructions.</p>

<p>How is that possible?</p>

<p>As individual BEAM instructions, each guard test needs 5 instructions
to fetch the value from <code>{x,0}</code> and test that the value is a small
integer. As a combined instruction, that only needs to be done once.
Other parts of the guard tests also become redundant in the combined
instruction and can be omitted. For example, the <code>is_integer/1</code> type
test will also succeed if its argument is a <strong>bignum</strong> (an integer
that does not fit in a machine word). Clearly, a bignum will fall well
outside the range 0 through 1023, so if the argument is not a small
integer, the combined guard test will fail immediately.</p>

<p>With those and some other simplifications, we end up with the following
native instructions:</p>

<pre><code class="language-nasm"># is_int_in_range_fScc
    mov rax, qword ptr [rbx]
    sub rax, 15
    test al, 15
    short jne label_19
    cmp rax, 16368
    short ja label_19
</code></pre>

<p>The first instruction fetches the value of <code>{x,0}</code> to the CPU
register <code>rax</code>:</p>

<pre><code class="language-nasm">    mov rax, qword ptr [rbx]
</code></pre>

<p>The next instruction subtracts the <a href="http://www.it.uu.se/research/publications/reports/2000-029/2000-029-nc.pdf">tagged value</a> for the lower
bound of the range. Since the lower bound of the range is 0 and the
tag for small integers is 15, the value that is subtracted
is <code>16 * 0 + 15</code> or simply 15. (For small integers, the runtime system
uses the 4 least significant bits of the word as tag bits.)
If the lower bound would have been 1, the value to be subtracted would
have been <code>16 * 1 + 15</code> or 31:</p>

<pre><code class="language-nasm">    sub rax, 15
</code></pre>

<p>The subtraction achieves two aims at once. Firstly, it simplifies the
tag test in the next two instructions because if the value of of
<code>{x,0}</code> is a small integer, the 4 least significants bits will now be
zero:</p>

<pre><code class="language-nasm">    test al, 15
    short jne label_19
</code></pre>

<p>The <code>test al, 15</code> instruction does a bitwise AND operation of the
lower byte of the CPU register <code>rax</code>, discarding the result but
setting CPU flags depending on the value. The next instruction tests
whether the result was nonzero (the tag was not the tag for a small
integer), in which case the test fails and a jump to the failure
label is made.</p>

<p>The second aim for the subtraction is to simplify the range check.
If the value being tested was below the lower bound, the value
of <code>rax</code> will be negative after the subtraction.</p>

<p>Since integers are represented in <a href="https://en.wikipedia.org/wiki/Two%27s_complement">two’s complement notation</a>, a
signed negative integer interpreted as an unsigned integer will be a
very large integer. Therefore, both bounds can be checked at once
using the old trick of treating the value in <code>rax</code> as unsigned:</p>

<pre><code class="language-nasm">    cmp rax, 16368
    short ja label_19
</code></pre>

<p>The <code>cmp rax, 16368</code> instruction compares the value in <code>rax</code> with the
difference of the tagged upper bound and the tagged lower bound, that
is:</p>

<pre><code>(16 * 1023 + 15) - (16 * 0 + 15)
</code></pre>

<p><code>ja</code> stands for “Jump (if) Above”, that is, jump if the CPU flags
indicates that in previous comparison of unsigned integers the first
integer was greater than the second. Since a negative number
represented in two’s complement notation looks like a huge integer
when interpreted as an unsigned integer, <code>short ja label_19</code> will
transfer control to the failure label for values both below the lower
bound and above the upper bound.</p>

<h3 id="more-code-generation-improvements">More code generation improvements</h3>

<p>The JIT in OTP 26 generates better code for common combinations of
relational operators. In order to reduce the number of combinations
that the JIT will need to handle, the compiler rewrites the <code>&lt;</code>
operator to <code>&gt;=</code> if possible. In the previous example, it was shown
that the compiler rewrote <code>X &lt; 1024</code> to <code>1023 &gt;= X</code>.</p>

<p>Let’s look at a contrived example to show (off) a few more
improvements in the code generation:</p>

<pre><code class="language-erlang">add6(M) when is_map(M) -&gt;
    A = map_size(M),
    if
        9 &lt; A, A &lt; 100 -&gt;
            A + 6
    end.
</code></pre>

<p>The main part of the BEAM code looks like this:</p>

<pre><code>    {test,is_map,{f,41},[{x,0}]}.
    {gc_bif,map_size,{f,0},1,[{tr,{x,0},{t_map,any,any}}],{x,0}}.
    {test,is_ge,
          {f,43},
          [{tr,{x,0},{t_integer,{0,288230376151711743}}},{integer,10}]}.
    {test,is_ge,
          {f,43},
          [{integer,99},{tr,{x,0},{t_integer,{10,288230376151711743}}}]}.
    {gc_bif,'+',{f,0},1,[{tr,{x,0},{t_integer,{10,99}}},{integer,6}],{x,0}}.
    return.
</code></pre>

<p>In OTP 26, the JIT will inline the code for many of the most
frequently used guard BIFs. Here is the native code for the
<code>map_size/1</code> call:</p>

<pre><code class="language-nasm"># bif_map_size_jsd
    mov rax, qword ptr [rbx]      ; Fetch map from {x,0}
# skipped type check because the argument is always a map
    mov rax, qword ptr [rax+6]    ; Fetch size of map
    shl rax, 4
    or al, 15                     ; Tag as small integer
    mov qword ptr [rbx], rax      ; Store size in {x,0}
</code></pre>

<p>The two <code>is_ge</code> instructions are combined by the BEAM loader into
an <code>is_in_range</code> instruction:</p>

<pre><code class="language-nasm"># is_in_range_ffScc
# simplified fetching of BEAM register
    mov rdi, rax
# skipped test for small operand since it always small
    sub rdi, 175
    cmp rdi, 1424
    ja label_43
</code></pre>

<p>The first instruction is a new optimization in OTP 26. Normally <code>{x,0}</code> is
fetched using the instruction <code>mov rax, qword ptr [rbx]</code>. However, in this
case, the last instruction in the previous BEAM instruction is the instruction
<code>mov qword ptr [rbx], rax</code>. Therefore, since it is known that the contents of
<code>{x,0}</code> is already in CPU register <code>rax</code>, the instruction can be simplified
to:</p>

<pre><code class="language-nasm"># simplified fetching of BEAM register
    mov rdi, rax
</code></pre>

<p>The size of a map that will fit in memory on a 64-bit computer is always
a small integer, so the test for a small integer is skipped:</p>

<pre><code class="language-nasm"># skipped test for small operand since it always small
    sub rdi, 175     ; Subtract 16 * 10 + 15
    cmp rdi, 1424    ; Compare with (16*99+15)-(16*10+15)
    ja label_43
</code></pre>

<p>The native code for the <code>+</code> operator looks like this:</p>

<pre><code class="language-nasm"># i_plus_ssjd
# add without overflow check
    mov rax, qword ptr [rbx]
    add rax, 96      ; 16 * 6 + 0
    mov qword ptr [rbx], rax
</code></pre>

<h3 id="new-beam-instructions-in-otp-26">New BEAM instructions in OTP 26</h3>

<p>The previous example of combining guard tests showed that the JIT can
often generate better code if multiple BEAM instructions are combined
into one. While the <a href="https://www.erlang.org/blog/beam-compiler-history/#the-ever-changing-beam-instructions">BEAM loader</a> is capable of combining
instructions it is often more practical to let the Erlang compiler
emit combined instructions.</p>

<p>OTP 26 introduces two new instructions, each of which replaces a sequence of
any number of simpler instructions:</p>

<ul>
  <li>
    <p><code>update_record</code> for updating any number of fields in a record.</p>
  </li>
  <li>
    <p><code>bs_match</code> for matching multiple segments of fixed size.</p>
  </li>
</ul>

<p>In OTP 25, the <code>bs_create_bin</code> instruction for constructing a binary
with any number of segments was introduced, but its full potential for
generating efficient code was not leveraged in OTP 25.</p>

<h3 id="updating-records-in-otp-25">Updating records in OTP 25</h3>

<p>Consider the following example of a record definition and three functions
that update the record:</p>

<pre><code class="language-erlang">-record(r, {a,b,c,d,e}).

update_a(R) -&gt;
    R#r{a=42}.

update_ce(R) -&gt;
    R#r{c=99,e=777}.

update_bcde(R) -&gt;
    R#r{b=2,c=3,d=4,e=5}.
</code></pre>

<p>In OTP 25 and earlier, the way in which a record is updated depends on
both the number of fields being updated and the size of the record.</p>

<p>When a single field in a record is updated, as in <code>update_a/1</code>, the
<a href="https://www.erlang.org/doc/man/erlang.html#setelement-3">setelement/3</a>
BIF is called:</p>

<pre><code>    {test,is_tagged_tuple,{f,34},[{x,0},6,{atom,r}]}.
    {move,{x,0},{x,1}}.
    {move,{integer,42},{x,2}}.
    {move,{integer,2},{x,0}}.
    {call_ext_only,3,{extfunc,erlang,setelement,3}}.
</code></pre>

<p>When updating more than one field but fewer than approximately half of
the fields, as in <code>update_ce/1</code>, code similar to the following is
emitted:</p>

<pre><code>    {test,is_tagged_tuple,{f,37},[{x,0},6,{atom,r}]}.
    {allocate,0,1}.
    {move,{x,0},{x,1}}.
    {move,{integer,777},{x,2}}.
    {move,{integer,6},{x,0}}.
    {call_ext,3,{extfunc,erlang,setelement,3}}.
    {set_tuple_element,{integer,99},{x,0},3}.
    {deallocate,0}.
    return.
</code></pre>

<p>Here the <code>e</code> field is updated using <code>setelement/3</code>, followed by
<code>set_tuple_element</code> to update the <code>c</code> field destructively. Erlang does
not allow mutation of terms, but here it is done “under the hood” in a
safe way.</p>

<p>When a majority of the fields are updated, as in <code>update_bcde/1</code>, a
new tuple is built:</p>

<pre><code>    {test,is_tagged_tuple,{f,40},[{x,0},6,{atom,r}]}.
    {test_heap,7,1}.
    {get_tuple_element,{x,0},1,{x,0}}.
    {put_tuple2,{x,0},
                {list,[{atom,r},
                       {x,0},
                       {integer,2},
                       {integer,3},
                       {integer,4},
                       {integer,5}]}}.
    return.
</code></pre>

<h3 id="updating-records-in-otp-26">Updating records in OTP 26</h3>

<p>In OTP 26, all records are updated using the new BEAM instruction
<code>update_record</code>.  For example, here is the main part of the BEAM code
for <code>update_1</code>:</p>

<pre><code>    {test,is_tagged_tuple,{f,34},[{x,0},6,{atom,r}]}.
    {test_heap,7,1}.
    {update_record,{atom,reuse},6,{x,0},{x,0},{list,[2,{integer,42}]}}.
    return.
</code></pre>

<p>The last operand is a list of positions in the tuple and their corresponding
new values.</p>

<p>The first operand, <code>{atom,reuse}</code>, is a hint to the JIT that it is possible
that the source tuple is already up to date and does not need to be updated.
Another possible value for the hint operand is <code>{atom,copy}</code>, meaning that
the source tuple is definitely not up to date.</p>

<p>The JIT emits the following native code for the <code>update_record</code> instruction:</p>

<pre><code class="language-nasm"># update_record_aIsdI
    mov rax, qword ptr [rbx]
    mov rdi, rax
    cmp qword ptr [rdi+14], 687
    je L130
    vmovups xmm0, [rax-2]
    vmovups [r15], xmm0
    mov qword ptr [r15+16], 687
    vmovups ymm0, [rax+22]
    vmovups [r15+24], ymm0
    lea rax, qword ptr [r15+2]
    add r15, 56
L130:
    mov qword ptr [rbx], rax
</code></pre>

<p>Let’s walk through those instructions. First the value of <code>{x,0}</code> is fetched:</p>

<pre><code class="language-nasm">    mov rax, qword ptr [rbx]
</code></pre>

<p>Since the hint operand is the atom <code>reuse</code>, is is possible that it is
unnecessary to copy the tuple. Therefore, the JIT emits an instruction
sequence to test whether the <code>a</code> field (position 2 in the tuple)
already contains the value 42. If so, the source tuple can be reused:</p>

<pre><code class="language-nasm">    mov rdi, rax
    cmp qword ptr [rdi+14], 687   ; 42
    je L130                       ; Reuse source tuple
</code></pre>

<p>Next follows the copy and update sequence. First the header word for
the tuple and its first element (the <code>r</code> atom) are copied using
<a href="https://en.wikipedia.org/wiki/Advanced_Vector_Extensions">AVX instructions</a>:</p>

<pre><code class="language-nasm">    vmovups xmm0, [rax-2]
    vmovups [r15], xmm0
</code></pre>

<p>Next the value 42 is stored into position 2 of the copy of the tuple:</p>

<pre><code class="language-nasm">    mov qword ptr [r15+16], 687   ; 42
</code></pre>

<p>Finally the remaining four elements of the tuple are copied:</p>

<pre><code class="language-nasm">    vmovups ymm0, [rax+22]
    vmovups [r15+24], ymm0
</code></pre>

<p>All that remains is to create a tagged pointer to the newly created
tuple and increment the heap pointer:</p>

<pre><code class="language-nasm">    lea rax, qword ptr [r15+2]
    add r15, 56
</code></pre>

<p>The last instruction stores the tagged pointer to either the original
or the updated tuple to <code>{x,0}</code>:</p>

<pre><code class="language-nasm">L130:
    mov qword ptr [rbx], rax
</code></pre>

<p>The BEAM code for <code>update_ce/1</code> is very similar to the code for <code>update_a/1</code>:</p>

<pre><code>    {test,is_tagged_tuple,{f,37},[{x,0},6,{atom,r}]}.
    {test_heap,7,1}.
    {update_record,{atom,reuse},
                   6,
                   {x,0},
                   {x,0},
                   {list,[4,{integer,99},6,{integer,777}]}}.
    return.
</code></pre>

<p>The native code looks like this:</p>

<pre><code class="language-nasm"># update_record_aIsdI
    mov rax, qword ptr [rbx]
    vmovups ymm0, [rax-2]
    vmovups [r15], ymm0
    mov qword ptr [r15+32], 1599   ; 99
    mov rdi, [rax+38]
    mov [r15+40], rdi
    mov qword ptr [r15+48], 12447  ; 777
    lea rax, qword ptr [r15+2]
    add r15, 56
    mov qword ptr [rbx], rax
</code></pre>

<p>Note that the copying and updating is done unconditionally, despite
the <code>reuse</code> hint. The JIT is free to ignore the hints. When multiple
fields are being updated, the test for whether the update is
unnecessary would be more expensive and it is also much less likely
that all of the fields would turn out to be unchanged. Therefore,
trying to reuse the original tuple is more likely to be a
<a href="https://en.wiktionary.org/wiki/pessimization">pessimization</a> rather
than an optimization.</p>

<h3 id="matching-and-constructing-binaries-in-otp-25">Matching and constructing binaries in OTP 25</h3>

<p>To explore the optimizations of binaries, the following example will
be used:</p>

<pre><code class="language-erlang">bin_swap(&lt;&lt;A:8,B:24&gt;&gt;) -&gt;
    &lt;&lt;B:24,A:8&gt;&gt;.
</code></pre>

<p>Somewhat simplified, the main part of the BEAM code as emitted by
the compiler in OTP 25 looks like this:</p>

<pre><code>    {test,bs_start_match3,{f,1},1,[{x,0}],{x,1}}.
    {bs_get_position,{x,1},{x,0},2}.
    {test,bs_get_integer2,
          {f,2},
          2,
          [{x,1},
           {integer,8},
           1,
           {field_flags,[unsigned,big]}],
          {x,2}}.
    {test,bs_get_integer2,
          {f,2},
          3,
          [{x,1},
           {integer,24},
           1,
           {field_flags,[unsigned,big]}],
          {x,3}}.
    {test,bs_test_tail2,{f,2},[{x,1},0]}.
    {bs_create_bin,{f,0},
                   0,4,1,
                   {x,0},
                   {list,[{atom,integer},
                          1,1,nil,
                          {tr,{x,3},{t_integer,{0,16777215}}},
                          {integer,24},
                          {atom,integer},
                          2,1,nil,
                          {tr,{x,2},{t_integer,{0,255}}},
                          {integer,8}]}}.
    return.
</code></pre>

<p>Let’s walk through the code. The first instruction sets up a <a href="https://www.erlang.org/doc/efficiency_guide/binaryhandling.html#how-binaries-are-implemented">match
context</a>:</p>

<pre><code>    {test,bs_start_match3,{f,1},1,[{x,0}],{x,1}}.
</code></pre>

<p>A match context holds several pieces of information needed for
matching a binary.</p>

<p>The next instruction saves information that will be needed if matching
of the binary fails for some reason:</p>

<pre><code>    {bs_get_position,{x,1},{x,0},2}.
</code></pre>

<p>The next two instructions match out two segments as integers (comments added by me):</p>

<pre><code>    {test,bs_get_integer2,
          {f,2},          % Failure label
          2,              % Number of live X registers (needed for GC)
          [{x,1},         % Match context register
           {integer,8},   % Size of segment in units
           1,             % Unit value
           {field_flags,[unsigned,big]}],
          {x,2}}.         % Destination register
    {test,bs_get_integer2,
          {f,2},
          3,
          [{x,1},
           {integer,24},
           1,
           {field_flags,unsigned,big]}],
          {x,3}}.
</code></pre>

<p>The next instruction makes sure that the end of the binary has now been
reached:</p>

<pre><code>    {test,bs_test_tail2,{f,2},[{x,1},0]}.
</code></pre>

<p>The next instruction creates the binary with the segments swapped:</p>

<pre><code>    {bs_create_bin,{f,0},
                   0,4,1,
                   {x,0},
                   {list,[{atom,integer},
                          1,1,nil,
                          {tr,{x,3},{t_integer,{0,16777215}}},
                          {integer,24},
                          {atom,integer},
                          2,1,nil,
                          {tr,{x,2},{t_integer,{0,255}}},
                          {integer,8}]}}.
</code></pre>

<p>Before OTP 25, creation of binaries was done using multiple
instructions, similar to how binary matching is still done in
OTP 25. The reason for creating the <code>bs_create_bin</code> instruction in OTP 25
was to be able to provide improved error information when construction
of a binary fails, similar to the <a href="https://www.erlang.org/blog/my-otp-24-highlights/#eep-54-improved-bif-error-information">improved BIF error
information</a>.</p>

<p>When the size of a segment of size 8, 16, 32, or 64 is matched,
specialized instructions are used for x86_64. The specialized
instructions do everything inline provided that the segment is
byte-aligned. (The JIT in OTP 25 for AArch64/ARM64 does not have these
specialized instructions.) Here is the instruction for matching a
segment of size 8:</p>

<pre><code class="language-nasm"># i_bs_get_integer_8_Stfd
    mov rcx, qword ptr [rbx+8]
    mov rsi, qword ptr [rcx+22]
    lea rdx, qword ptr [rsi+8]
    cmp rdx, qword ptr [rcx+30]
    ja label_25
    rex test sil, 7
    short je L91
    mov edx, 64
    call L92
    short jmp L90
L91:
    mov rdi, qword ptr [rcx+14]
    shr rsi, 3
    mov qword ptr [rcx+22], rdx
    movzx rax, byte ptr [rdi+rsi]
    shl rax, 4
    or rax, 15
L90:
    mov qword ptr [rbx+16], rax
</code></pre>

<p>The first two instructions pick up the pointer to the match context
and from the match context the current bit offset into the binary:</p>

<pre><code class="language-nasm">    mov rcx, qword ptr [rbx+8]   ; Load pointer to match context
    mov rsi, qword ptr [rcx+22]  ; Get offset in bits into binary
</code></pre>

<p>The next three instructions ensure that the length of the binary is at
least 8 bits:</p>

<pre><code class="language-nasm">    lea rdx, qword ptr [rsi+8]   ; Add 8 to the offset
    cmp rdx, qword ptr [rcx+30]  ; Compare offset+8 with size of binary
    ja label_25                  ; Fail if the binary is too short
</code></pre>

<p>The next five instructions test whether the current byte in the binary
is aligned at a byte boundary. If not, a helper code fragment is
called:</p>

<pre><code class="language-nasm">    rex test sil, 7    ; Test the 3 least significant bits
    short je L91       ; Jump if 0 (meaning byte-aligned)
    mov edx, 64        ; Load size and flags
    call L92           ; Call helper fragment
    short jmp L90      ; Done
</code></pre>

<p>A <strong>helper code fragment</strong> is a shared block of code that can be
called from the native code generated for BEAM instructions, typically
to handle cases that are uncommon and/or would require more native
instructions than are practial to include inline. Each such code
fragment has its own calling convention, typically tailor-made to be
as convenient for the caller as possible. (See <a href="https://www.erlang.org/blog/jit-part-2/">Further adventures in
the JIT</a> for more information
about helper code fragments.)</p>

<p>The remaining instructions read one byte from memory, convert it to a
tagged Erlang terms, store it in <code>{x,2}</code>, and advance the bit offset
in the match context:</p>

<pre><code class="language-nasm">L91:
    mov rdi, qword ptr [rcx+14]    ; Load base pointer for binary
    shr rsi, 3                     ; Convert bit offset to byte offset
    mov qword ptr [rcx+22], rdx    ; Update bit offset in match context
    movzx rax, byte ptr [rdi+rsi]  ; Read one byte from the binary
    shl rax, 4                     ; Multiply by 16...
    or rax, 15                     ; ... and add tag for a small integer

L90:
    mov qword ptr [rbx+16], rax    ; Store extracted integer
</code></pre>

<p>When matching a segment of size other than one of the special sizes
mentioned earlier, the JIT will always emit a call to a general
routine that can handle matching of any integer segment with any
aligment, endianness, and signedness.</p>

<p>In OTP 25, the full potential for optimization of the <code>bs_create_bin</code>
instruction is not realized. The construction of each segment is done
by calling a helper routine that builds the segment. Here is the
native for the part of the <code>bs_create_bin</code> instruction that builds the
integer segments:</p>

<pre><code class="language-nasm"># construct integer segment
    mov edx, 24
    mov rsi, qword ptr [rbx+24]
    xor ecx, ecx
    lea rdi, qword ptr [rbx-80]
    call 4387496416
# construct integer segment
    mov edx, 8
    mov rsi, qword ptr [rbx+16]
    xor ecx, ecx
    lea rdi, qword ptr [rbx-80]
    call 4387496416
</code></pre>

<h3 id="binary-pattern-matching-in-otp-26">Binary pattern matching in OTP 26</h3>

<p>In OTP 26, there is a new BEAM <code>bs_match</code> instruction used for
matching segments with sizes known at compile time. The BEAM code for
the matching code in the function head for <code>bin_swap/1</code> is as follows:</p>

<pre><code>    {test,bs_start_match3,{f,1},1,[{x,0}],{x,1}}.
    {bs_get_position,{x,1},{x,0},2}.
    {bs_match,{f,2},
              {x,1},
              {commands,[{ensure_exactly,32},
                         {integer,2,{literal,[]},8,1,{x,2}},
                         {integer,3,{literal,[]},24,1,{x,3}}]}}.
</code></pre>

<p>The first two instructions are identical to their OTP 25 counterparts.</p>

<p>The first operand of the <code>bs_match</code> instruction, <code>{f,2}</code>, is the
failure label and the second operand <code>{x,2}</code> is the register holding
the match context. The third operand, <code>{commands,[...]}</code>, is a list of
matching commands.</p>

<p>The first command in the <code>commands</code> list, <code>{ensure_exactly,32}</code>, tests
that the remaining number of bits in the binary being matched is
exactly 32. If not, a jump is made to the failure label.</p>

<p>The second command extracts an integer of 8 bits and stores it in
<code>{x,2}</code>. The third command extracts an integer of 24 bits and store it
in <code>{x,3}</code>.</p>

<p>Having matching of multiple segments contained in a single BEAM
instruction makes it much easier for the JIT to generate efficient
code. Here is what the native code will do:</p>

<ul>
  <li>
    <p>Test that there are at exactly 32 bits left in the binary.</p>
  </li>
  <li>
    <p>If the segment is byte-aligned, read a 4-byte word from the binary
and store it in a CPU register.</p>
  </li>
  <li>
    <p>If the segment is not byte-aligned, read an 8-byte word from the binary
and shift to extract the 32 bits needed.</p>
  </li>
  <li>
    <p>Shift and mask out 8 bits and tag as an integer. Store into <code>{x,2}</code>.</p>
  </li>
  <li>
    <p>Shift and mask out 24 bits and tag as an integer. Store into <code>{x,3}</code>.</p>
  </li>
</ul>

<p>The native code for the <code>bs_match</code> instruction (slightly simplifed) is
as follows:</p>

<pre><code class="language-nasm"># i_bs_match_fS
# ensure_exactly 32
    mov rsi, qword ptr [rbx+8]
    mov rax, qword ptr [rsi+30]
    mov rcx, qword ptr [rsi+22]
    sub rax, rcx
    cmp rax, 32
    jne label_3
# read 32
    mov rdi, qword ptr [rsi+14]
    add qword ptr [rsi+22], 32
    mov rax, rcx
    shr rax, 3
    add rdi, rax
    and ecx, 7
    jnz L38
    movbe edx, dword ptr [rdi]
    add ecx, 32
    short jmp L40
L38:
    mov rdx, qword ptr [rdi-3]
    shr rdx, 24
    bswap rdx
L40:
    shl rdx, cl
# extract integer 8
    mov rax, rdx
# store extracted integer as a small
    shr rax, 52
    or rax, 15
    mov qword ptr [rbx+16], rax
    shl rdx, 8
# extract integer 24
    shr rdx, 36
    or rdx, 15
    mov qword ptr [rbx+24], rdx
</code></pre>

<p>The first part of the code ensures that there are exactly 32 bits
remaining in the binary:</p>

<pre><code class="language-nasm"># ensure_exactly 32
    mov rsi, qword ptr [rbx+8]    ; Get pointer to match context
    mov rax, qword ptr [rsi+30]   ; Get size of binary in bits
    mov rcx, qword ptr [rsi+22]   ; Get offset in bits into binary
    sub rax, rcx
    cmp rax, 32
    jne label_3
</code></pre>

<p>The next part of the code does not directly correspond to the commands
in the <code>bs_match</code> BEAM instruction. Instead, the code reads 32 bits
from the binary:</p>

<pre><code class="language-nasm"># read 32
    mov rdi, qword ptr [rsi+14]
    add qword ptr [rsi+22], 32  ; Increment bit offset in match context
    mov rax, rcx
    shr rax, 3
    add rdi, rax
    and ecx, 7                  ; Test alignment
    jnz L38                     ; Jump if segment not byte-aligned

    ; Read 32 bits (4 bytes) byte-aligned and convert to big-endian
    movbe edx, dword ptr [rdi]
    add ecx, 32
    short jmp L40

L38:
    ; Read a 8-byte word and extract the 32 bits that are needed.
    mov rdx, qword ptr [rdi-3]
    shr rdx, 24
    bswap rdx                   ; Convert to big-endian

L40:
    ; Shift the read bytes to the most significant bytes of the word
    shl rdx, cl
</code></pre>

<p>The 4 bytes read will be converted to big-endian and placed as the
most significant bytes of CPU register <code>rdx</code> with the rest of the
register zeroed.</p>

<p>The following instructions extracts the 8 bits for the first segment and
stores it as a tagged integer in <code>{x,2}</code>:</p>

<pre><code class="language-nasm"># extract integer 8
    mov rax, rdx
# store extracted integer as a small
    shr rax, 52
    or rax, 15
    mov qword ptr [rbx+16], rax
    shl rdx, 8
</code></pre>

<p>The following instructions extracts the 24 bits for the second segment and
stores it as a tagged integer in <code>{x,3}</code>:</p>

<pre><code class="language-nasm"># extract integer 24
    shr rdx, 36
    or rdx, 15
    mov qword ptr [rbx+24], rdx
</code></pre>

<h3 id="binary-construction-in-otp-26">Binary construction in OTP 26</h3>

<p>For binary construction in OTP 26, the compiler emits a
<code>bs_create_bin</code> BEAM instruction just as in OTP 25. However, the
native code that the JIT in OTP 26 emits for that instruction bears
little resemblance to the native code emitted by OTP 25. The native
code will do the following:</p>

<ul>
  <li>
    <p>Allocate room on the heap for a binary and initialize it with
inlined native code. A helper code fragment is called to do a garbage
collection if there is not sufficient room left on the heap.</p>
  </li>
  <li>
    <p>Read the integer from <code>{x,3}</code> and untag it.</p>
  </li>
  <li>
    <p>Read the integer from <code>{x,2}</code> and untag it. Combine the value with
the previous 24-bit value to obtain a 32-bit value.</p>
  </li>
  <li>
    <p>Write the combined 32 bits into the binary.</p>
  </li>
</ul>

<p>Here follows the complete native code for the <code>bs_create_bin</code>
instruction (somewhat simplified):</p>

<pre><code class="language-nasm"># i_bs_create_bin_jItd
# allocate heap binary
    lea rdx, qword ptr [r15+56]
    cmp rdx, rsp
    short jbe L43
    mov ecx, 4
.db 0x90
    call 4343630296
L43:
    lea rax, qword ptr [r15+2]
    mov qword ptr [rbx-120], rax
    mov qword ptr [r15], 164
    mov qword ptr [r15+8], 4
    add r15, 16
    mov qword ptr [rbx-64], r15
    mov qword ptr [rbx-56], 0
    add r15, 8
# accumulate value for integer segment
    xor r8d, r8d
    mov rdi, qword ptr [rbx+24]
    sar rdi, 4
    or r8, rdi
# accumulate value for integer segment
    shl r8, 8
    mov rdi, qword ptr [rbx+16]
    sar rdi, 4
    or r8, rdi
# construct integer segment from accumulator
    bswap r8d
    mov rdi, qword ptr [rbx-64]
    mov qword ptr [rbx-56], 32
    mov dword ptr [rdi], r8d
</code></pre>

<p>Let’s walk through it.</p>

<p>The first part of the code, starting with <code># allocate heap binary</code> and
ending before the next comment line allocates a <strong>heap binary</strong> with
inlined native code. The only call to a helper code fragment is in case
there is not sufficient space left on the heap.</p>

<p>Next follows the construction of the segments of the binary.</p>

<p>Instead of writing the value of each segment to memory one at a time,
multiple segments are accumulated into a CPU register. Here
follows the code for the first segment to be constructed (24 bits):</p>

<pre><code class="language-nasm"># accumulate value for integer segment
    xor r8d, r8d                ; Initialize accumulator
    mov rdi, qword ptr [rbx+24] ; Fetch {x,3}
    sar rdi, 4                  ; Untag
    or r8, rdi                  ; OR into accumulator
</code></pre>

<p>Here follows the code for the second segment (8 bits):</p>

<pre><code class="language-nasm"># accumulate value for integer segment
    shl r8, 8                   ; Make room for 8 bits
    mov rdi, qword ptr [rbx+16] ; Fetch {x,2}
    sar rdi, 4                  ; Untag
    or r8, rdi                  ; OR into accumulator
</code></pre>

<p>Since there are no segments of the binary left, the accumulated
value will be written out to memory:</p>

<pre><code class="language-nasm"># construct integer segment from accumulator
    bswap r8d                   ; Make accumulator big-endian
    mov rdi, qword ptr [rbx-64] ; Get pointer into binary
    mov qword ptr [rbx-56], 32  ; Update size of binary
    mov dword ptr [rdi], r8d    ; Write 32 bits
</code></pre>

<h3 id="appending-to-binaries-in-otp-25">Appending to binaries in OTP 25</h3>

<p>The ancient OTP R12B release introduced an optimization for
<a href="https://www.erlang.org/doc/efficiency_guide/binaryhandling.html">efficiently appending to a
binary</a>. Let’s
look at an example to see the optimization in action:</p>

<pre><code class="language-erlang">-module(append).
-export([expand/1, expand_bc/1]).

expand(Bin) when is_binary(Bin) -&gt;
    expand(Bin, &lt;&lt;&gt;&gt;).

expand(&lt;&lt;B:8,T/binary&gt;&gt;, Acc) -&gt;
    expand(T, &lt;&lt;Acc/binary,B:16&gt;&gt;);
expand(&lt;&lt;&gt;&gt;, Acc) -&gt;
    Acc.

expand_bc(Bin) when is_binary(Bin) -&gt;
    &lt;&lt; &lt;&lt;B:16&gt;&gt; || &lt;&lt;B:8&gt;&gt; &lt;= Bin &gt;&gt;.
</code></pre>

<p>Both <code>append:expand/1</code> and <code>append:expand_bc/1</code> take a binary and
double its size by expanding each byte to two bytes. For example:</p>

<pre><code class="language-erlang">1&gt; append:expand(&lt;&lt;1,2,3&gt;&gt;).
&lt;&lt;0,1,0,2,0,3&gt;&gt;
2&gt; append:expand_bc(&lt;&lt;4,5,6&gt;&gt;).
&lt;&lt;0,4,0,5,0,6&gt;&gt;
</code></pre>

<p>Both functions accept only binaries:</p>

<pre><code class="language-erlang">3&gt; append:expand(&lt;&lt;1,7:4&gt;&gt;).
** exception error: no function clause matching append:expand(&lt;&lt;1,7:4&gt;&gt;,&lt;&lt;&gt;&gt;)
4&gt; append:expand_bc(&lt;&lt;1,7:4&gt;&gt;).
** exception error: no function clause matching append:expand_bc(&lt;&lt;1,7:4&gt;&gt;)
</code></pre>

<p>Before looking at the BEAM code, let’s do some benchmarking using
<a href="https://github.com/max-au/erlperf">erlperf</a> to find out which function is faster:</p>

<pre><code>erlperf --init_runner_all 'rand:bytes(10_000).' \
        'r(Bin) -&gt; append:expand(Bin).' \
        'r(Bin) -&gt; append:expand_bc(Bin).'
Code                                     ||        QPS       Time   Rel
r(Bin) -&gt; append:expand_bc(Bin).          1       7936     126 us  100%
r(Bin) -&gt; append:expand(Bin).             1       4369     229 us   55%
</code></pre>

<p>The expression for the <code>--init_runner_all</code> option uses
<a href="https://www.erlang.org/doc/man/rand.html#bytes-1">rand:bytes/1</a> to create a binary with 10,000 random
bytes, which will be passed to both expand functions.</p>

<p>From the benchmark results, it can be seen that the <code>expand_bc/1</code> function is
almost twice as fast.</p>

<p>To find out why, let’s compare the BEAM code for the two functions. Here is
the instruction that appends to the binary in <code>expand/1</code>:</p>

<pre><code>    {bs_create_bin,{f,0},
                   0,3,8,
                   {x,1},
                   {list,[{atom,append},  % Append operation
                          1,8,nil,
                          {tr,{x,1},{t_bitstring,1}}, % Source/destination
                          {atom,all},
                          {atom,integer},
                          2,1,nil,
                          {tr,{x,2},{t_integer,{0,255}}},
                          {integer,16}]}}.
</code></pre>

<p>The first segment is an <code>append</code> operation. The operand
<code>{tr,{x,1},{t_bitstring,1}}</code> denotes both source and destination of
the operation. That is, the binary referenced by <code>{x,1}</code> will be
mutated. Erlang normally does not allow mutation, but this mutation
is done under the hood in a way not observable from outside. That
makes the append operation much more efficient than it would be if the
source binary had to be copied.</p>

<p>For the binary comprehension in <code>expand_bc/1</code>, there is a similar
BEAM instruction for appending to the binary:</p>

<pre><code>    {bs_create_bin,{f,0},
                   0,3,1,
                   {x,1},
                   {list,[{atom,private_append}, % Private append operation
                          1,1,nil,
                          {x,1},
                          {atom,all},
                          {atom,integer},
                          2,1,nil,
                          {tr,{x,2},{t_integer,{0,255}}},
                          {integer,16}]}}.
</code></pre>

<p>The main difference is that the binary comprehension uses the more
efficient <code>private_append</code> operation instead of <code>append</code>.</p>

<p>The <code>append</code> operation has more overhead because it must produce the
correct result for code such as:</p>

<pre><code class="language-erlang">bins(Bin) -&gt;
    bins(Bin, &lt;&lt;&gt;&gt;).

bins(&lt;&lt;H,T/binary&gt;&gt;, Acc) -&gt;
    [Acc|bins(T, &lt;&lt;Acc/binary,H&gt;&gt;)];
bins(&lt;&lt;&gt;&gt;, Acc) -&gt;
    [Acc].
</code></pre>

<p>Running it:</p>

<pre><code class="language-erlang">1&gt; example:bins(&lt;&lt;"abcde"&gt;&gt;).
[&lt;&lt;&gt;&gt;,&lt;&lt;"a"&gt;&gt;,&lt;&lt;"ab"&gt;&gt;,&lt;&lt;"abc"&gt;&gt;,&lt;&lt;"abcd"&gt;&gt;,&lt;&lt;"abcde"&gt;&gt;]
</code></pre>

<p>In the <code>expand/1</code> function, only the final value binary being appended
to was needed. In <code>bins/1</code>, all of the intermediate values of binary
are collected in a list. For correctness, the <code>append</code> operations must
ensure that the binary <code>Acc</code> is copied before <code>H</code> is appended to
it. To be able to know when it is necessary to copy the binary, the
<code>append</code> operation does some extra bookeeping that does not come
for free.</p>

<h3 id="appending-to-binaries-in-otp-26">Appending to binaries in OTP 26</h3>

<p>In OTP 26, there is a new optimization in the compiler that replaces
an <code>append</code> operation with a <code>private_append</code> operation whenever it is
correct and safe to do so. This optimization was implemented by Frej
Drejhammar. That is, the optimization will rewrite <code>append:expand/2</code>
to use <code>private_append</code>, but not <code>examples:bins/2</code>.</p>

<p>The difference between <code>append:expand/1</code> and <code>append:expand_bc/1</code> is now
much smaller:</p>

<pre><code>erlperf --init_runner_all 'rand:bytes(10_000).' \
        'r(Bin) -&gt; append:expand(Bin).' \
        'r(Bin) -&gt; append:expand_bc(Bin).'
Code                                     ||        QPS       Time   Rel
r(Bin) -&gt; append:expand_bc(Bin).          1      13164   75988 ns  100%
r(Bin) -&gt; append:expand(Bin).             1      12419   80550 ns   94%
</code></pre>

<p><code>expand_bc/1</code> is still a bit faster because the compiler emits
somewhat more efficient binary matching code for it than for the
<code>expand/1</code> function.</p>

<h3 id="the-benefit-of-is_binary1-guards">The benefit of <code>is_binary/1</code> guards</h3>

<p>The <code>expand/1</code> function has an <code>is_binary/1</code> guard test that may seem
unnecessary:</p>

<pre><code class="language-erlang">expand(Bin) when is_binary(Bin) -&gt;
    expand(Bin, &lt;&lt;&gt;&gt;).
</code></pre>

<p>The guard test is not necessary for correctness, because <code>expand/2</code>
will raise a <code>function_clause</code> exception if its argument is not a
binary. However, better code will be generated for <code>expand/2</code> with
the guard test.</p>

<p>With the guard test, the first BEAM instruction in <code>expand/2</code> is:</p>

<pre><code>    {bs_start_match4,{atom,no_fail},2,{x,0},{x,0}}.
</code></pre>

<p>Without the guard test, the first BEAM instruction is:</p>

<pre><code>    {test,bs_start_match3,{f,3},2,[{x,0}],{x,2}}.
</code></pre>

<p>The <code>bs_start_match4</code> instruction is more efficient because it does
not have to test that <code>{x,0}</code> contains a binary.</p>

<p>The benchmark results show measurable increased execution time for
<code>expand/1</code> if the guard test is removed:</p>

<pre><code>erlperf --init_runner_all 'rand:bytes(10_000).' \
        'r(Bin) -&gt; append:expand(Bin).' \
        'r(Bin) -&gt; append:expand_bc(Bin).'
Code                                     ||        QPS       Time   Rel
r(Bin) -&gt; append:expand_bc(Bin).          1      13273   75366 ns  100%
r(Bin) -&gt; append:expand(Bin).             1      11875   84236 ns   89%
</code></pre>

<h3 id="revisiting-the-base64-module">Revisiting the <code>base64</code> module</h3>

<p>Traditionally, up to OTP 25, the clause in the <code>base64</code> module that does
most of the work of encoding a binary to Base64 looked like this:</p>

<pre><code class="language-erlang">encode_binary(&lt;&lt;B1:8, B2:8, B3:8, Ls/bits&gt;&gt;, A) -&gt;
    BB = (B1 bsl 16) bor (B2 bsl 8) bor B3,
    encode_binary(Ls,
                  &lt;&lt;A/bits,(b64e(BB bsr 18)):8,
                    (b64e((BB bsr 12) band 63)):8,
                    (b64e((BB bsr 6) band 63)):8,
                    (b64e(BB band 63)):8&gt;&gt;).
</code></pre>

<p>The reason is that matching out segments of size 8 has always been
specially optimized and has been much faster than matching out a
segment of size 6. That is no longer true in OTP 26. With the
improvements in binary matching described in this blog post, the
clause can be written in a more natural way:</p>

<pre><code class="language-erlang">encode_binary(&lt;&lt;B1:6, B2:6, B3:6, B4:6, Ls/bits&gt;&gt;, A) -&gt;
    encode_binary(Ls,
                  &lt;&lt;A/bits,
                    (b64e(B1)):8,
                    (b64e(B2)):8,
                    (b64e(B3)):8,
                    (b64e(B4)):8&gt;&gt;);
</code></pre>

<p>(This is not the exact code in OTP 26, because of
<a href="https://github.com/erlang/otp/pull/6280">additional</a>
<a href="https://github.com/erlang/otp/pull/6711">features</a> added later.)</p>

<p>The benchmark results for encoding a random binary of 1,000,000 bytes
to Base64 for OTP 25 is:</p>

<pre><code>erlperf --init_runner_all 'rand:bytes(1_000_000).' \
        'r(Bin) -&gt; base64:encode(Bin).'
Code                                  ||        QPS       Time
r(Bin) -&gt; base64:encode(Bin).          1         61   16489 us
</code></pre>

<p>The benchmark results for encoding a random binary of 1,000,000 bytes
to Base64 for OTP 26 is:</p>

<pre><code>erlperf --init_runner_all 'rand:bytes(1_000_000).' \
        'r(Bin) -&gt; base64:encode(Bin).'
Code                                  ||        QPS       Time
r(Bin) -&gt; base64:encode(Bin).          1        249    4023 us
</code></pre>

<p>That is, encoding is about 4 times faster.</p>

<h3 id="pull-requests">Pull requests</h3>

<p>Here are the main pull requests for the optimizations mentioned in
this blog post:</p>

<ul>
  <li><a href="https://github.com/erlang/otp/pull/5999">compiler: Improve the type analysis</a></li>
  <li><a href="https://github.com/erlang/otp/pull/6025">JIT: Optimise common combinations of relational operators</a></li>
  <li><a href="https://github.com/erlang/otp/pull/6298">JIT: Minor optimizations</a>, which includes
the optimization that avoids fetching an operand that is already in a CPU register.</li>
  <li><a href="https://github.com/erlang/otp/pull/6033">compiler: Optimize record updates</a></li>
  <li><a href="https://github.com/erlang/otp/pull/6259">JIT: Optimize binary matching for fixed-width segments</a></li>
  <li><a href="https://github.com/erlang/otp/pull/6031">JIT: Optimize creation of binaries</a></li>
  <li><a href="https://github.com/erlang/otp/pull/6804">compiler: <code>private_append</code> optimization for binaries</a></li>
</ul>]]></content><author><name>Björn Gustavsson</name></author><category term="BEAM" /><category term="JIT" /><summary type="html"><![CDATA[This post explores the enhanced type-based optimizations and the other performance improvements in Erlang/OTP 26.]]></summary></entry><entry><title type="html">Erlang/OTP 25 Highlights</title><link href="https://www.erlang.org/blog/My-OTP-25-highlights/" rel="alternate" type="text/html" title="Erlang/OTP 25 Highlights" /><published>2022-05-18T00:00:00+00:00</published><updated>2022-05-18T00:00:00+00:00</updated><id>https://www.erlang.org/blog/My-OTP-25-highlights</id><content type="html" xml:base="https://www.erlang.org/blog/My-OTP-25-highlights/"><![CDATA[<p>OTP 25 is finally here. This post will introduce the new features that I am most excited about.</p>

<p>You can download the readme describing all the changes here:
<a href="/patches/OTP-25.0">Erlang/OTP 25 Readme</a>.
Or, as always, look at the release notes of the application you are interested in.
For instance here: <a href="https://www.erlang.org/doc/apps/erts/notes.html#erts-13.0">Erlang/OTP 25 - Erts Release Notes - Version 13.0</a>.</p>

<p>This years highlights are:</p>

<ul>
  <li><a href="#new-functions-in-the-maps-and-lists-modules">New functions in the <code>maps</code>and <code>lists</code> modules</a></li>
  <li><a href="#selectable-features-and-the-new-maybe_expr-feature">Selectable features and the new <code>maybe_expr</code> feature</a></li>
  <li><a href="#dialyzer">Dialyzer</a></li>
  <li><a href="#improvements-of-the-jit">Improvements of the JIT</a></li>
  <li><a href="#better-support-for-perf-and-gdb">Better support for perf and gdb</a></li>
  <li><a href="#relocatable-installation-directory">Relocatable installation directory</a></li>
  <li><a href="#ets-tables-with-adaptive-support-for-write-concurrency">ETS-tables with adaptive support for write concurrency</a></li>
  <li><a href="#new-option-short-for-erlangfloat_to_list2-and-erlangfloat_to_binary2">New option <code>short</code> for <code>erlang:float_to_list/2</code> and <code>erlang:float_to_binary/2</code></a></li>
  <li><a href="#the-new-module-peer-supersedes-the-slave-module">The new module <code>peer</code> supersedes the slave module</a></li>
  <li><a href="#gen_xxx-modules-has-got-a-new-format_status1-callback"><code>gen_xxx</code> modules has got a new <code>format_status/1</code> callback</a></li>
  <li><a href="#the-timer-module-has-been-modernized-and-made-more-efficient">The <code>timer</code> module has been modernized and made more efficient</a></li>
  <li><a href="#crypto-and-openssl-30">Crypto and OpenSSL 3.0</a></li>
  <li><a href="#ca-certificates-can-be-fetched-from-the-os-standard-place">CA-certificates can be fetched from the OS standard place</a></li>
  <li><a href="#a-new-fast-pseudo-random-generator">A new fast Pseudo Random Generator</a></li>
</ul>

<h1 id="new-functions-in-the-maps-and-lists-modules">New functions in the <code>maps</code> and <code>lists</code> modules</h1>

<p>Triggered by suggestions from the users we have introduced new functions in the <a href="/doc/man/maps.html"><code>maps</code></a> and <a href="/doc/man/lists.html"><code>lists</code></a> modules in <code>stdlib</code>.</p>

<h2 id="mapsgroups_from_list23"><code>maps:groups_from_list/2,3</code></h2>

<p>For short we can say that this function take a list of elements and group them. The result is a map <code>#{Group1 =&gt; [Group1Elements], GroupN =&gt; [GroupNElements]}</code>.</p>

<p>Let us look at some examples from the shell:</p>

<pre><code class="language-erlang">&gt; maps:groups_from_list(fun(X) -&gt; X rem 2 end, [1,2,3]).
#{0 =&gt; [2], 1 =&gt; [1, 3]}
</code></pre>

<p>The provided fun calculates <code>X rem 2</code> for every element <code>X</code> in the input list and then group the elements in a map with the result of <code>X rem 2</code> as key and the corresponding elements as a list value for that key.</p>

<pre><code class="language-erlang">&gt; maps:groups_from_list(fun erlang:length/1, ["ant", "buffalo", "cat", "dingo"]).
#{3 =&gt; ["ant", "cat"], 5 =&gt; ["dingo"], 7 =&gt; ["buffalo"]}
</code></pre>

<p>In the example above the strings in the input list are grouped into a map based on their length.</p>

<p>There is also a variant of <code>groups_from_list</code> with an additional fun by which the values can be converted before they are put into their groups.</p>

<pre><code class="language-erlang">&gt; maps:groups_from_list(fun(X) -&gt; X rem 2 end, fun(X) -&gt; X*X end, [1,2,3]).
#{0 =&gt; [4], 1 =&gt; [1, 9]}
</code></pre>

<p>In the example above the elements <code>X</code> in the list are grouped according the <code>X rem 2</code> calculation but the values stored in the groups are the elements multiplied by themselves (<code>X * X</code>).</p>

<pre><code class="language-erlang">&gt; maps:groups_from_list(fun erlang:length/1, fun lists:reverse/1, ["ant", "buffalo", "cat", "dingo"]).
#{3 =&gt; ["tna","tac"],5 =&gt; ["ognid"],7 =&gt; ["olaffub"]}
</code></pre>

<p>In the example above the strings from the input list are grouped according to their length and they are reversed before they are stored in the groups.</p>

<p>For more details see the <a href="/doc/man/maps.html#groups_from_list-2"><code>maps:groups_from_list/2</code></a> documentation.</p>

<h2 id="listsenumerate12"><code>lists:enumerate/1,2</code></h2>

<p>Takes a list of elements and returns a new list where each element has been associated with its position in the original list. Returns a new list with tuples of the form <code>{I, H}</code> where <code>I</code> is the position of <code>H</code> in the original list. The enumeration starts with 1 and increases by 1 in each step.</p>

<p>Example:</p>

<pre><code class="language-erlang">&gt; lists:enumerate([a,b,c]).
[{1,a},{2,b},{3,c}]
</code></pre>

<p>There is also a <code>enumerate/2</code> function which can be used to set the initial number to something else than 1. See example below:</p>

<pre><code class="language-erlang">&gt; lists:enumerate(10, [a,b,c]).
[{10,a},{11,b},{12,c}]
</code></pre>

<p>For more details see the <a href="/doc/man/lists.html#enumerate-1"><code>lists:enumerate/1</code></a> documentation.</p>

<h2 id="listsuniq12"><code>lists:uniq/1,2</code></h2>

<p>Removes duplicates from a list while preserving the order of the elements. The first occurrence of each element is kept. 
We already have <code>lists:usort</code> which also removes duplicates but returns a sorted list.</p>

<p>Examples:</p>

<pre><code class="language-erlang">&gt; lists:uniq([3,3,1,2,1,2,3]).
[3,1,2]
&gt; lists:uniq([a, a, 1, b, 2, a, 3]).
[a, 1, b, 2, 3]
</code></pre>

<p><code>lists:uniq/2</code> allows the user to specify with a fun how to determine that 2 elements in the list are equal. In the example below the provided fun is just testing the first element of the 2 tuples for equality.</p>

<p>Examples:</p>
<pre><code class="language-erlang">&gt; lists:uniq(fun({X, _}) -&gt; X end, [{b, 2}, {a, 1}, {c, 3}, {a, 2}]).
[{b, 2}, {a, 1}, {c, 3}]
</code></pre>
<p>For more details see the <a href="/doc/man/lists.html#uniq-1"><code>lists:uniq/1</code></a> documentation.</p>

<h1 id="selectable-features-and-the-new-maybe_expr-feature">Selectable features and the new <code>maybe_expr</code> feature</h1>

<p>Selectable features is a new mechanism and concept where a new potentially incompatible feature (language or runtime), can be introduced and tested without causing troubles for those that don’t use it.</p>

<p>When it comes to language features the intention is that they can be activated per module with no impact on modules where they are not activated.</p>

<p>Let’s use the new <code>maybe_expr</code> feature as an example.</p>

<p>In module <code>my_experiment</code> the feature is activated and used like this:</p>

<pre><code class="language-erlang">-module(my_experiment).
-export([foo/1]).

%% Enable the feature maybe_expr in this module only
%% Makes maybe a keyword which might be incompatible
%% in modules using maybe as a function name or an atom
-feature(maybe_expr,enable). 
foo() -&gt;
  maybe
    {ok, X} ?= f(Foo),
    [H|T] ?= g([1,2,3]),
    ...
  else
    {error, Y} -&gt;
        {ok, "default"};
    {ok, _Term} -&gt;
        {error, "unexpected wrapper"}
  end.
</code></pre>

<p>The compiler will note that the feature <code>maybe_expr</code> is enabled and will handle the maybe construct correctly. In the generated <code>.beam</code> file it will also be noted that
the module has enabled the feature.</p>

<p>When starting an Erlang node the specific feature (or all) must be enabled otherwise the <code>.beam</code> file with the feature will not be allowed for loading.</p>

<pre><code class="language-text">erl -enable-feature maybe_expr
</code></pre>

<p>Or</p>

<pre><code class="language-text">erl -enable-feature all
</code></pre>

<p>For more details see the <a href="/doc/reference_manual/features.html">feature section</a> in the Erlang Reference Manual.</p>

<h2 id="the-new-maybe_expr-feature-eep-49">The new <code>maybe_expr</code> feature EEP-49</h2>

<p>The <a href="/eeps/eep-0049">EEP-49 “Value-Based Error Handling Mechanisms”</a>, was suggested by Fred Hebert already 2018 and now it has finally been implemented as the first feature within the new feature concept.</p>

<p>The <code>maybe ... end</code> construct is similar to <code>begin ... end</code> in that it is used to group multiple distinct expressions as a
single block. But there is one important difference in that the
<code>maybe</code> block does not export its variables while <code>begin</code> does
export its variables.</p>

<p>A new type of expressions (denoted <code>MatchOrReturnExprs</code>) are introduced, which are only valid within a
<code>maybe ... end</code> expression:</p>

<pre><code class="language-erlang">maybe
    Exprs | MatchOrReturnExprs
end
</code></pre>

<p><code>MatchOrReturnExprs</code> are defined as having the following form:</p>

<pre><code class="language-erlang">Pattern ?= Expr
</code></pre>

<p>This definition means that <code>MatchOrReturnExprs</code> are only allowed at the
top-level of <code>maybe ... end</code> expressions.</p>

<p>The <code>?=</code> operator takes the value returned by <code>Expr</code> and pattern matches
it against <code>Pattern</code>.</p>

<p>If the pattern matches, all variables from <code>Pattern</code> are bound in the local
environment, and the expression is equivalent to a successful <code>Pattern = Expr</code>
call. If the value does not match, the <code>maybe ... end</code> expression returns the
failed expression directly.</p>

<p>A special case exists in which we extend <code>maybe ... end</code> into the following form:</p>

<pre><code class="language-erlang">maybe
    Exprs | MatchOrReturnExprs
else
    Pattern -&gt; Exprs;
    ...
    Pattern -&gt; Exprs
end
</code></pre>

<p>This form exists to capture non-matching expressions in a <code>MatchOrReturnExprs</code>
to handle failed matches rather than returning their value. In such a case, an
unhandled failed match will raise an <code>else_clause</code> error, otherwise identical to
a <code>case_clause</code> error.</p>

<p>This extended form is useful to properly identify and handle successful and
unsuccessful matches within the same construct without risking to confuse
happy and unhappy paths.</p>

<p>Given the structure described here, the final expression may look like:</p>

<pre><code class="language-erlang">maybe
    Foo = bar(),            % normal exprs still allowed
    {ok, X} ?= f(Foo),
    [H|T] ?= g([1,2,3]),
    ...
else
    {error, Y} -&gt;
        {ok, "default"};
    {ok, _Term} -&gt;
        {error, "unexpected wrapper"}
end
</code></pre>

<p>For more details see the <a href="/doc/reference_manual/expressions.html#maybe">maybe section</a> in the Erlang Reference Manual.</p>

<h3 id="motivation">Motivation</h3>

<p>With the <code>maybe</code> construct it is possible to reduce deeply nested conditional expressions and make messy patterns found in the wild unnecessary. It also provides a better separation of concerns when implementing functions.</p>

<h4 id="reducing-nesting">Reducing Nesting</h4>

<p>One common pattern that can be seen in Erlang is deep nesting of <code>case
... end</code> expressions, to check complex conditionals.</p>

<p>Take the following code taken from
<a href="https://github.com/erlang/otp/blob/a0ae44f324576104760a63fe6cf63e0ca31756fc/lib/mnesia/src/mnesia_backup.erl#L106-L126">Mnesia</a>,
for example:</p>

<pre><code class="language-erlang">commit_write(OpaqueData) -&gt;
    B = OpaqueData,
    case disk_log:sync(B#backup.file_desc) of
        ok -&gt;
            case disk_log:close(B#backup.file_desc) of
                ok -&gt;
                    case file:rename(B#backup.tmp_file, B#backup.file) of
                        ok -&gt;
                            {ok, B#backup.file};
                        {error, Reason} -&gt;
                            {error, Reason}
                    end;
                {error, Reason} -&gt;
                    {error, Reason}
            end;
        {error, Reason} -&gt;
            {error, Reason}
    end.
</code></pre>

<p>The code is nested to the extent that shorter aliases must be introduced
for variables (<code>OpaqueData</code> renamed to <code>B</code>), and half of the code just
transparently returns the exact values each function was given.</p>

<p>By comparison, the same code could be written as follows with the new
construct:</p>

<pre><code class="language-erlang">commit_write(OpaqueData) -&gt;
    maybe
        ok ?= disk_log:sync(OpaqueData#backup.file_desc),
        ok ?= disk_log:close(OpaqueData#backup.file_desc),
        ok ?= file:rename(OpaqueData#backup.tmp_file, OpaqueData#backup.file),
        {ok, OpaqueData#backup.file}
    end.
</code></pre>

<p>Or, to protect against <code>disk_log</code> calls returning something else than <code>ok |
{error, Reason}</code>, the following form could be used:</p>

<pre><code class="language-erlang">commit_write(OpaqueData) -&gt;
    maybe
        ok ?= disk_log:sync(OpaqueData#backup.file_desc),
        ok ?= disk_log:close(OpaqueData#backup.file_desc),
        ok ?= file:rename(OpaqueData#backup.tmp_file, OpaqueData#backup.file),
        {ok, OpaqueData#backup.file}
    else
        {error, Reason} -&gt; {error, Reason}
    end.
</code></pre>

<p>The semantics of these calls are identical, except that it is now
much easier to focus on the flow of individual operations and either
success or error paths.</p>

<h1 id="dialyzer">Dialyzer</h1>

<ul>
  <li>
    <p>Dialyzer now supports the <code>missing_return</code> and <code>extra_return</code> options to raise warnings when specifications differ from inferred types. These are similar to, but not quite as verbose, as <code>overspecs</code> and <code>underspecs</code>.</p>
  </li>
  <li>
    <p>Dialyzer now better understands the types for <code>min/2</code>, <code>max/2</code>, and <code>erlang:raise/3</code>. Because of that, Dialyzer can potentially generate new warnings. In particular, functions that use <code>erlang:raise/3</code> could now need a spec with a <code>no_return()</code> return type to avoid an unwanted warning.</p>
  </li>
</ul>

<h1 id="improvements-of-the-jit">Improvements of the JIT</h1>

<p>The <a href="https://www.erlang.org/blog/my-otp-24-highlights/#beamasm---the-jit-compiler-for-erlang">JIT compiler</a> introduced in Erlang/OTP 24 improved
the performance for Erlang applications.</p>

<p>Erlang/OTP 25 introduces some major improvements of the JIT:</p>

<ul>
  <li>
    <p>The JIT now supports the <a href="https://en.wikipedia.org/wiki/AArch64">AArch64 (ARM64)</a> architecture,
used by (for example) <a href="https://en.wikipedia.org/wiki/Apple_silicon">Apple Silicon</a> Macs and newer
<a href="https://en.wikipedia.org/wiki/Raspberry_Pi">Raspberry Pi</a> devices.</p>
  </li>
  <li>
    <p>Better code generated based on types provided by the Erlang compiler.</p>
  </li>
  <li>
    <p>Better support for <code>perf</code> and <code>gdb</code> with line numbers for Erlang code.</p>
  </li>
</ul>

<h3 id="support-for-aarch64-arm64">Support for AArch64 (ARM64)</h3>

<p>How much speedup one can expect from the JIT compared to the interpreter
varies from nothing to up to four times.</p>

<p>To get some more concrete figures we have run three different
benchmarks with the JIT disabled and enabled on a MacBook Pro (M1
processor; released in 2020).</p>

<p>First we ran the <a href="https://github.com/erlang/otp/blob/be860185407d6747dca32e8d328b041cc75ffdb3/erts/emulator/test/estone_SUITE.erl">EStone benchmark</a>. Without the JIT, 691,962
EStones were achieved and with the JIT 1,597,949 EStones. That is,
more than twice as many EStones with the JIT.</p>

<p>Next we tried running Dialyzer to build a small PLT:</p>

<pre><code class="language-text">dialyzer --build_plt --apps erts kernel stdlib
</code></pre>

<p>With the JIT, the time for building the PLT was reduced from 18.38 seconds
down to 9.64 seconds. That is, almost but not quite twice as fast.</p>

<p>Finally, we ran a benchmark for the <a href="https://www.erlang.org/doc/man/base64.html">base64</a> module included
in <a href="https://github.com/erlang/otp/issues/5639">this Github issue</a>.</p>

<p>With the JIT:</p>

<pre><code class="language-text">== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 11846 ms: 84 it/sec
fun base64:decode/1: 1000 iterations in 14617 ms: 68 it/sec
</code></pre>

<p>Without the JIT:</p>

<pre><code class="language-text">== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 25938 ms: 38 it/sec
fun base64:decode/1: 1000 iterations in 20603 ms: 48 it/sec
</code></pre>

<p>Encoding with the JIT is almost two and half times as fast, while the
decoding time with the JIT is about 75 percent of the decoding time
without the JIT.</p>

<h3 id="type-based-optimizations">Type-based optimizations</h3>

<p>The JIT translates one BEAM instruction at the time to native code
without any knowledge of previous instructions. For example, the native
code for the <code>+</code> operator must work for any operands: small integers that
fit in 64-bit word, large integers, floats, and non-numbers that should
result in raising an exception.</p>

<p>In Erlang/OTP 25, the compiler embeds type information in the BEAM file
to the help the JIT generate better native code without unnecessary type
tests.</p>

<p>For more details, see the blog post <a href="https://www.erlang.org/blog/type-based-optimizations-in-the-jit">Type-Based Optimizations in the
JIT</a>.</p>

<h1 id="better-support-for-perf-and-gdb">Better support for <code>perf</code> and <code>gdb</code></h1>

<p>It is now possible to profile Erlang systems with perf and get a mapping from the JIT code to the corresponding Erlang code. This will make it easy to find bottlenecks in the code.</p>

<p>The same goes for <code>gdb</code> which also can show which line of Erlang code a specific address in the JIT code corresponds to.</p>

<p>Perf is a Linux command-line tool for lightweight CPU profiling; it checks CPU performance counters, trace points, uprobes, and kprobes, monitors program events, and creates reports.</p>

<p>An Erlang node running under <code>perf</code> can be started like this:</p>

<pre><code class="language-text">perf record --call-graph fp -- erl +JPperf true
</code></pre>
<p>The result from perf could then be viewed like this:</p>

<pre><code class="language-text">perf report
</code></pre>

<p>It is also possible to attach <code>perf</code> to an already running Erlang node like this:</p>

<pre><code class="language-text"># start Erlang at get the Pid
erl +JPperf true
</code></pre>

<p>And the pid for the node is <code>4711</code></p>

<p>You can then attach <code>perf</code> to the node like this:</p>

<pre><code class="language-text">sudo perf record --call-graph fp -p 4711
</code></pre>
<p>Below is an example where <code>perf</code> is run to analyze <code>dialyzer</code> building a PLT like this:</p>

<pre><code class="language-text"> ERL_FLAGS="+JPperf true +S 1" perf record --call-graph=fp \
   dialyzer --build_plt -Wunknown --apps compiler crypto erts kernel stdlib \
   syntax_tools asn1 edoc et ftp inets mnesia observer public_key \
   sasl runtime_tools snmp ssl tftp wx xmerl tools
</code></pre>

<p>The above code is run using +S 1 to make the perf output easier to understand. 
If you then run <code>perf report -f --no-children</code> you may get something similar to this:</p>

<p><img src="/blog/images/otp25/perf_callgraph.png" alt="alt text" title="perf call-graph" /></p>

<p>Frame pointers are enabled when the <code>+JPperf true</code> option is passed, so you can
use <code>perf record --call-graph=fp</code> to get more context.</p>

<p>Any Erlang function in the report is prefixed with a <code>$</code> and all C functions have
their normal names. Any Erlang function that has the prefix <code>$global::</code> refers
to a global shared fragment.</p>

<p>So in the above, we can see that we spend the most time doing <code>eq</code>, i.e. comparing two terms.
By expanding it and looking at its parents we can see that it is the function
<code>erl_types:t_is_equal/2</code> that contributes the most to this value. Go and have a look
at it in the source code to see if you can figure out why so much time is spent there.</p>

<p>After <code>eq</code> we see the function <code>erl_types:t_has_var/1</code> where we spend almost
5% of the entire execution in. A while further down you can see <code>copy_struct_x</code>
which is the function used to copy terms. If we expand it to view the parents
we find that it is mostly <code>ets:lookup_element/3</code> that contributes to this time
via the Erlang function <code>dialyzer_plt:ets_table_lookup/2</code>.</p>

<h3 id="perf-tips-and-tricks"><code>perf</code> tips and tricks</h3>

<p>You can do a lot of neat things with <code>perf</code>. Below is a list of some of the options
we have found useful:</p>

<ul>
  <li><code>perf report --no-children</code>
  Do not include the accumulation of all children in a call.</li>
  <li><code>perf report  --call-graph callee</code>
  Show the callee rather than the caller when expanding a function call.</li>
  <li><code>perf archive</code>
  Create an archive with all the artifacts needed to inspect the data
  on another host. In early version of perf this command does not work,
  instead you can use <a href="https://github.com/torvalds/linux/blob/master/tools/perf/perf-archive.sh">this bash script</a>.</li>
  <li><code>perf report</code> gives “failed to process sample” and/or “failed to process type: 68”
  This probably means that you are running a buggy version of perf. We have
  seen this when running Ubuntu 18.04 with kernel version 4. If you update
  to Ubuntu 20.04 or use Ubuntu 18.04 with kernel version 5 the problem
  should go away.</li>
</ul>

<h1 id="improved-error-information-for-failing-binary-construction">Improved error information for failing binary construction</h1>

<p>Erlang/OTP 24 introduced <a href="https://www.erlang.org/blog/my-otp-24-highlights/#eep-54-improved-bif-error-information">improved BIF error information</a> to provide
more information when a call to a BIF failed.</p>

<p>In Erlang/OTP 25, improved error information is also given when the
creation of a binary using the <a href="https://www.erlang.org/doc/reference_manual/expressions.html#bit-syntax-expressions">bit syntax</a> fails.</p>

<p>Consider this function:</p>

<pre><code class="language-erlang">bin(A, B, C, D) -&gt;
    &lt;&lt;A/float,B:4/binary,C:16,D/binary&gt;&gt;.
</code></pre>

<p>If we call this function with incorrect arguments in past releases
we will just be told that something was wrong and the line number:</p>

<pre><code class="language-text">1&gt; t:bin(&lt;&lt;"abc"&gt;&gt;, 2.0, 42, &lt;&lt;1:7&gt;&gt;).
** exception error: bad argument
     in function  t:bin/4 (t.erl, line 5)
</code></pre>

<p>But which part of line 5? Imagine that <code>t:bin/4</code> was called from deep
within an application and we had no idea what the actual values for
the arguments were. It could take a while to figure out exactly what
went wrong.</p>

<p>Erlang/OTP 25 gives us more information:</p>

<pre><code>1&gt; c(t).
{ok,t}
2&gt; t:bin(&lt;&lt;"abc"&gt;&gt;, 2.0, 42, &lt;&lt;1:7&gt;&gt;).
** exception error: construction of binary failed
     in function  t:bin/4 (t.erl, line 5)
        *** segment 1 of type 'float': expected a float or an integer but got: &lt;&lt;"abc"&gt;&gt;
</code></pre>

<p>Note that the module must be compiled by the compiler in Erlang/OTP 25 in
order to get the more informative error message. The old-style message will
be shown if the module was compiled by a previous release.</p>

<p>Here the message tells us that first segment in the construction was given
the binary <code>&lt;&lt;"abc"&gt;&gt;</code> instead of a float or an integer, which is the expected
type for a <code>float</code> segment.</p>

<p>It seems that we switched the first and second arguments for <code>bin/4</code>,
so we try again:</p>

<pre><code class="language-text">3&gt; t:bin(2.0, &lt;&lt;"abc"&gt;&gt;, 42, &lt;&lt;1:7&gt;&gt;).
** exception error: construction of binary failed
     in function  t:bin/4 (t.erl, line 5)
        *** segment 2 of type 'binary': the value &lt;&lt;"abc"&gt;&gt; is shorter than the size of the segment
</code></pre>

<p>It seems that there was more than one incorrect argument. In this
case, the message tells us that the given binary is shorter than the
size of the segment.</p>

<p>Fixing that:</p>

<pre><code>4&gt; t:bin(2.0, &lt;&lt;"abcd"&gt;&gt;, 42, &lt;&lt;1:7&gt;&gt;).
** exception error: construction of binary failed
     in function  t:bin/4 (t.erl, line 5)
        *** segment 4 of type 'binary': the size of the value &lt;&lt;1:7&gt;&gt; is not a multiple of the unit for the segment
</code></pre>

<p>A <code>binary</code> segment has a default unit of 8. Therefore, passing a bit string of
size 7 will fail.</p>

<p>Finally:</p>

<pre><code class="language-text">5&gt; t:bin(2.0, &lt;&lt;"abcd"&gt;&gt;, 42, &lt;&lt;1:8&gt;&gt;).
&lt;&lt;64,0,0,0,0,0,0,0,97,98,99,100,0,42,1&gt;&gt;
</code></pre>

<h1 id="improved-error-information-for-failed-record-matching">Improved error information for failed record matching</h1>

<p>Another improvement is the exceptions when matching of a record fails.</p>

<p>Consider this record and function:</p>

<pre><code class="language-erlang">-record(rec, {count}).

rec_add(R) -&gt;
    R#rec{count = R#rec.count + 1}.

</code></pre>

<p>In past releases, failure to match a record or retrieve an element from
a record would result in the following exception:</p>

<pre><code class="language-text">1&gt; t:rec_add({wrong,0}).
** exception error: {badrecord,rec}
     in function  t:rec_add/1 (t.erl, line 8)
</code></pre>

<p>Before Erlang/OTP 15 that introduced line numbers in exceptions, knowing
which record that was expected could be useful if the error occurred in
a large function.</p>

<p>Nowadays, unless several different records are accessed on the same
line, the line number makes it obvious which record was expected.</p>

<p>Therefore, in Erlang/OTP 25 the <code>badrecord</code> exception has been changed
to show the actual incorrect value:</p>

<pre><code class="language-text">2&gt; t:rec_add({wrong,0}).
** exception error: {badrecord,{wrong,0}}
     in function  t:rec_add/1 (t.erl, line 8)
</code></pre>

<p>The new <code>badrecord</code> exceptions will show up for code that has been compiled
with Erlang/OTP 25.</p>

<h1 id="relocatable-installation-directory">Relocatable installation directory</h1>

<p>Previously shell scripts (e.g., <code>erl</code> and <code>start</code>) and the <code>RELEASES</code> file
for an Erlang installation depended on a hard coded absolute path to the
installation’s root directory. This made it cumbersome to move an
installation to a different directory which can be problematic for platforms
such as Android (<a href="https://github.com/erlang/otp/pull/2879">#2879</a>) where the
installation directory is unknown at compile time. This is fixed by:</p>

<ul>
  <li>
    <p>Changing the shell scripts so that they can dynamically find the
<code>ROOTDIR</code>. The dynamically found <code>ROOTDIR</code> is selected if it differs
from the hard-coded <code>ROOTDIR</code> and seems to point to a valid Erlang
installation. The <code>dyn_erl</code> program has been changed so that it can
return its absolute canonicalized path when given the –realpath
argument (dyn_erl gets its absolute canonicalized path from the
realpath POSIX function). The dyn_erl’s –realpath
functionality is used by the scripts to get the root dir dynamically.</p>
  </li>
  <li>
    <p>Changing the release_handler module that reads and writes to the
<code>RELEASES</code> file so that it prepends <code>code:root_dir()</code> whenever it
encounters relative paths. This is necessary since the current
working directory can be changed so it is something different than
<code>code:root_dir()</code>.</p>
  </li>
</ul>

<h1 id="ets-tables-with-adaptive-support-for-write-concurrency">ETS-tables with adaptive support for write concurrency</h1>

<p>It has since long been possible to optimize an ETS table for write concurrency doing like this:</p>

<pre><code class="language-erlang">ets:new(my_table, [{write_concurrency, true}]).
</code></pre>

<p>Now we also introduce adaptive support for write concurrency which can be configured like this:</p>

<pre><code class="language-erlang">ets:new(my_table, [{write_concurrency, auto}]).
</code></pre>

<p>This option forces tables to automatically change the number of locks that are used at run-time depending on how much concurrency is detected. When you enable automatic write concurrency <code>decentralized_counters</code> are also activated for even more scalable ETS tables. Use this option when you know that a lot of processes will be accessing an ETS table on systems with many number of cores.</p>

<p>For more details you can read <a href="https://github.com/erlang/otp/pull/5208">PR 5208</a> that introduced the change and the <a href="/blog/scalable-ets-counters/">blog post about decentralized counters</a>.</p>

<h1 id="new-option-short-for-erlangfloat_to_list2-and-erlangfloat_to_binary2">New option <code>short</code> for <code>erlang:float_to_list/2</code> and <code>erlang:float_to_binary/2</code></h1>

<p>A new option called <code>short</code> has been added to the functions <code>erlang:float_to_list/2</code> and <code>erlang:float_to_binary/2</code>. This option creates the shortest correctly rounded string representation of the given float that can be converted back to the same float again.</p>

<p>If option <code>short</code> is specified, the float is formatted
with the smallest number of digits that still guarantees that</p>

<pre><code class="language-erlang">F =:= list_to_float(float_to_list(F, [short]))
</code></pre>

<p>When the float is inside the range (-2⁵³, 2⁵³), the notation
that yields the smallest number of characters is used (scientific
notation or normal decimal notation). Floats outside the range
(-2⁵³, 2⁵³) are always formatted using scientific notation to avoid confusing
results when doing arithmetic operations.</p>

<p>The implementation is contributed by Thomas Depierre and uses the Ryū algorithm.</p>

<p>Ryū, is a new algorithm to convert binary floating point numbers to their decimal representations using only fixed-size integer operations. Ryū is simpler and approximately three times faster than the previously fastest implementation.
<a href="https://github.com/ulfjack/ryu">https://github.com/ulfjack/ryu</a></p>

<h1 id="the-new-module-peer-supersedes-the-slave-module">The new module <code>peer</code> supersedes the slave module</h1>

<p>The <a href="/doc/man/peer.html"><code>peer</code></a> module provides functions for starting linked Erlang nodes. The Erlang node spawning new “peer” nodes is called <code>origin</code>, and the newly started nodes are peers.</p>

<p>A peer node automatically terminates when it loses the control connection to the origin. This connection could be an Erlang distribution connection, or an alternative - TCP or standard I/O. The alternative connection provides a way to execute remote procedure calls even when Erlang Distribution is not available, allowing to test the distribution itself.</p>

<p>Peer node terminal input/output is relayed through the origin. If a standard I/O alternative connection is requested, console output also goes via the origin, allowing debugging of node startup and boot script execution (see <a href="/doc/man/erl#flags">-init_debug</a>). File I/O is not redirected, contrary to <a href="/doc/man/slave.html"><code>slave</code></a> behavior.</p>

<p>The peer node can start on the same or a different host (via ssh) or in a separate container (for example Docker). When the peer starts on the same host as the origin, it inherits the current directory and environment variables from the origin.</p>

<h2 id="note">Note</h2>

<p>This module is designed to facilitate multi-node testing with Common Test. Use the ?CT_PEER() macro to start a linked peer node according to Common Test conventions: crash dumps written to specific location, node name prefixed with module name, calling function, and origin OS process ID). Use <a href="/doc/man/peer.html#random_name-1"><code>random_name/1</code></a> to create sufficiently unique node names if you need more control.</p>

<p>A peer node started without alternative connection behaves similarly to <code>slave(3)</code>.</p>

<h1 id="gen_xxx-modules-has-got-a-new-format_status1-callback"><code>gen_XXX</code> modules has got a new <code>format_status/1</code> callback.</h1>

<p>The <a href="/doc/man/gen_server.html#Module:format_status-2"><code>format_status/2</code></a> callback for <code>gen_server</code>, <code>gen_statem</code> and <code>gen_event</code> has been deprecated in favor of the new <a href="/doc/man/gen_server.html#Module:format_status-1"><code>format_status/1</code></a> callback.</p>

<p>The new callback adds the possibility to limit and change many more things than the just the state.</p>

<p>The purpose with both the old and the new <code>format_status</code> callbacks are to let the user filter away sensitive information and possibly data of huge volume from the crash reports.</p>

<h1 id="the-timer-module-has-been-modernized-and-made-more-efficient">The <code>timer</code> module has been modernized and made more efficient</h1>

<p>The timer module has been modernized and made more efficient, which makes the timer server less susceptible to being overloaded. The <code>timer:sleep/1</code> function now accepts an arbitrarily large integer.</p>

<h1 id="crypto-and-openssl-30">Crypto and OpenSSL 3.0</h1>

<p>Some applications in OTP like SSL/TLS and SSH need cryptography to
work. That is provided by the OTP application crypto, which interfaces
Erlang to an external cryptolib in C using NIFs. The main example of
such an external cryptolib is <a href="https://www.openssl.org">OpenSSL</a>.</p>

<p>The OpenSSL cryptolib exists in many versions. OTP/crypto supports
0.9.8c and later, although only 1.1.1 is still maintained by OpenSSL.</p>

<p>OpenSSL has released its version 3.0 series, which is their future
platform totally re-built with a new API. The APIs of previous
versions (1.1.1 and older) are partly deprecated, although still
available in 3.0. The support of 1.1.1 will also end in a future.</p>

<p>Since it is vital to get security patches in the cryptolib, and in a
future only the 3.0 API might be available, OTP/crypto now from
OTP-25.0 interfaces OpenSSL 3.0 using the new 3.0 API. A few functions
from old APIs are still used, but they will be replaced as soon as
possible.</p>

<p>You as a user will hopefully not notice any difference: if you have
OpenSSL 1.1.1 (or older - not recommended) and build OTP, that one will
be used as previously. If you have any OpenSSL 3.0 version installed,
that one will be used without need of doing anything special except
for normal handling of dynamic loading paths in the OS.</p>

<h1 id="ca-certificates-can-be-fetched-from-the-os-standard-place">CA-certificates can be fetched from the OS standard place</h1>

<p>With the new functions <code>public_key:cacerts_load/0,1</code> and <code>public_key:cacerts_get/0</code> the CA certificates can be fetched from the standard place of the OS (or from a file).</p>

<p>They will then be cached in decoded form by use of <code>persistent_term</code> which makes them available in an efficient way for the <code>ssl</code> and <code>httpc</code> modules. The intention with this is to make it unnecessary to depend on for example <code>certifi</code> in many packages.</p>

<p>On Windows and MacOSx the certificate store is not an ordinary file so the information is fetched via an API using a NIF (Windows) or with an external program (MacOSx).</p>

<p>Example with <code>ssl</code></p>

<pre><code class="language-erlang">%% makes the certificates available without copying
CaCerts = public_key:cacerts_get(), 
% use the certificates when establishing a connection
{ok,Socket} = ssl:connect("erlang.org",443,[{cacerts,CaCerts}, {verify,verify_peer}]), 
...
</code></pre>

<p>We also plan to update the http client (<code>httpc</code>) to use this soon.</p>

<h1 id="a-new-fast-pseudo-random-generator">A new fast Pseudo Random Generator</h1>

<p>A new custom designed Pseudo Random Generator <a href="/doc/man/rand.html#mwc59-1"><code>rand:mwc59</code></a>
has been implemented. It is probably the fastest possible
generator with good quality that can be written in Erlang.
To do this it barely avoids bignums, allocating heap data,
and uses only a minimal number of fast operations.</p>

<p>Under the “right” circumstances: A number that takes 60 ns to generate
with the default generator can be generated in 4 ns with <code>rand:mwc59</code>.</p>

<p>It is intended for applications in dire need for speed
in PRNG numbers, but not any of the comfort features
that <a href="/doc/man/rand.html"><code>rand</code></a> otherwise offers.</p>]]></content><author><name>Kenneth Lundin</name></author><category term="erlang" /><category term="otp" /><category term="25" /><category term="release" /><summary type="html"><![CDATA[OTP 25 is finally here. This post will introduce the new features that I am most excited about.]]></summary></entry><entry><title type="html">Fast random integers</title><link href="https://www.erlang.org/blog/faster-rand/" rel="alternate" type="text/html" title="Fast random integers" /><published>2022-05-12T00:00:00+00:00</published><updated>2022-05-12T00:00:00+00:00</updated><id>https://www.erlang.org/blog/faster-rand</id><content type="html" xml:base="https://www.erlang.org/blog/faster-rand/"><![CDATA[<p>When you need “random” integers, and it is essential
to generate them fast and cheap; then maybe the full featured
Pseudo Random Number Generators in the <code>rand</code> module are overkill.
This blog post will dive in to new additions to the
said module, how the Just-In-Time compiler optimizes them,
known tricks, and tries to compare these apples and potatoes.</p>

<h4 id="contents">Contents</h4>
<ul>
  <li><a href="#speed-over-quality">Speed over quality?</a></li>
  <li><a href="#suggested-solutions">Suggested solutions</a></li>
  <li><a href="#quality">Quality</a></li>
  <li><a href="#storing-the-state">Storing the state</a></li>
  <li><a href="#seeding">Seeding</a></li>
  <li><a href="#jit-optimizations">JIT optimizations</a></li>
  <li><a href="#implementing-a-prng">Implementing a PRNG</a></li>
  <li><a href="#rand_suitemeasure1"><code>rand_SUITE:measure/1</code></a></li>
  <li><a href="#measurement-results">Measurement results</a></li>
  <li><a href="#summary">Summary</a></li>
</ul>

<h2 id="speed-over-quality">Speed over quality?</h2>

<p>The Pseudo Random Number Generators implemented in
the <code>rand</code> module offers many useful features such as
repeatable sequences, non-biased range generation,
any size range, non-overlapping sequences,
generating floats, normal distribution floats, etc.
Many of those features are implemented through
a plug-in framework, with a performance cost.</p>

<p>The different algorithms offered by the <code>rand</code> module are selected
to have excellent statistical quality and to perform well
in serious PRNG tests (see section <a href="#prng-tests">PRNG tests</a>).</p>

<p>Most of these algorithms are designed for machines with
64-bit arithmetic (unsigned), but in Erlang such integers
become bignums and almost an order of magnitude slower
to handle than immediate integers.</p>

<p>Erlang terms in the 64-bit VM are tagged 64-bit words.
The tag for an immediate integer is 4 bit, leaving 60 bits
for the signed integer value.  The largest positive
immediate integer value is therefore 2<sup>59</sup>-1.</p>

<p>Many algorithms work on unsigned integers so we have
59 bits useful for that.  It could be theoretically
possible to pretend 60 bits unsigned using split code paths
for negative and positive values, but extremely impractical.</p>

<p>We decided to choose 58 bit unsigned integers in this context
since then we can for example add two integers, and check
for overflow or simply mask back to 58 bit, without
the intermediate result becoming a bignum.  To work with
59 bit integers would require having to check for overflow
before even doing an addition so the code that avoids
bignums would eat up much of the speed gained from
avoiding bignums.  So 58-bit integers it is!</p>

<p>The algorithms that perform well in Erlang are the ones
that have been redesigned to work on 58-bit integers.
But still, when executed in Erlang, they are far from
as fast as their C origins.  Achieving good PRNG quality
costs much more in Erlang than in C.  In the section
<a href="#measurement-results">Measurement results</a> we see that the algorithm <code>exsp</code>
that boasts sub-ns speed in C needs 17 ns in Erlang.</p>

<p>32-bit Erlang is a sad story in this regard.  The bignum limit
on such an Erlang system is so low, calculations would have
to use 26-bit integers, that designing a PRNG
not using bignums must be so small in period and size
that it becomes too bad to be useful.
The known trick <code>erlang:phash2(erlang:unique_integer(), Range)</code>
is still fairly fast, but all <code>rand</code> generators work exactly the same
as on a 64-bit system, hence operates on bignums so they are much slower.</p>

<p>If your application needs a “random” integer for an non-critical
purpose such as selecting a worker, choosing a route, etc,
and performance is much more important than repeatability
and statistical quality, what are then the options?</p>

<h2 id="suggested-solutions">Suggested solutions</h2>

<ul>
  <li><a href="#use-rand-anyway">Use <code>rand</code> anyway</a></li>
  <li><a href="#write-a-bif">Write a BIF</a></li>
  <li><a href="#write-a-nif">Write a NIF</a></li>
  <li><a href="#use-the-system-time">Use the system time</a></li>
  <li><a href="#hash-a-unique-value">Hash a “unique” value</a></li>
  <li><a href="#write-a-simple-prng">Write a simple PRNG</a></li>
</ul>

<p>Reasoning and measurement results are in the following sections,
but, in short:</p>

<ul>
  <li>Writing a NIF, we deemed, does not achieve a performance worth the effort.</li>
  <li>Neither does writing a BIF.  But, … a BIF (and a NIF, maybe) could
implement a combination of performance and quality that cannot
be achieved in any other way.  If a high demand on this combination
would emerge, we could reconsider this decision.</li>
  <li>Using the system time is a bad idea.</li>
  <li><code>erlang:phash2(erlang:unique_integer(), Range)</code> has its use cases.</li>
  <li>We have implemented a simple PRNG to fill the niche of non-critical
but very fast number generation: <code>mwc59</code>.</li>
</ul>

<h3 id="use-rand-anyway">Use <code>rand</code> anyway</h3>

<p>Is <code>rand</code> slow, really?  Well, perhaps not considering what it does.</p>

<p>In the <a href="#measurement-results">Measurement results</a> at the end of this text,
it shows that generating a good quality random number using
the <code>rand</code> module’s default algorithm is done in 45 ns.</p>

<p>Generating a number as fast as possible (<code>rand:mwc59/1</code>) can be done
in less than 4 ns, but that algorithm has problems with the
statistical quality.  See section <a href="#prng-tests">PRNG tests</a> and <a href="#implementing-a-prng">Implementing a PRNG</a>.</p>

<p>Using a good quality algorithm instead (<code>rand:exsp_next/1</code>) takes 16 ns,
if you can store the generator’s state in a loop variable.</p>

<p>If you can not store the generator state in a loop variable
there will be more overhead, see section <a href="#storing-the-state">Storing the state</a>.</p>

<p>Now, if you also need a number in an awkward range, as in not much smaller
than the generator’s size, you might have to implement a reject-and-resample
loop, or even concatenate numbers.</p>

<p>The overhead of code that has to implement this much of the features
that the <code>rand</code> module already offers will easily approach
its 26 ns overhead, so often there is no point in
re-implementing this wheel…</p>

<h3 id="write-a-bif">Write a BIF</h3>

<p>There has been a discussion thread on Erlang Forums:
<a href="https://erlangforums.com/t/looking-for-a-faster-rng/">Looking for a faster RNG</a>.  Triggered by this Andrew Bennett
(aka <a href="https://github.com/potatosalad/">potatosalad</a>) wrote an <a href="https://erlangforums.com/t/looking-for-a-faster-rng/1163/17">experimental BIF</a>.</p>

<p>The suggested BIF <code>erlang:random_integer(Range)</code> offered
no repeatability, generator state per scheduler, guaranteed
sequence separation between schedulers, and high generator
quality.  All this thanks to using one of the good generators from
the <code>rand</code> module, but now written in its original
programming language, C, in the BIF.</p>

<p>The performance was a bit slower than the <code>mwc59</code> generator state update,
but with top of the line quality. See section <a href="#measurement-results">Measurement results</a>.</p>

<p>Questions arised regarding maintenance burden, what more to implement, etc.
For example we probably also would need <code>erlang:random_integer/0</code>,
<code>erlang:random_float/0</code>, and some system info
to get the generator bit size…</p>

<p>A BIF could achieve good performance on a 32-bit system, if it there
would return a 27-bit integer, which became another open question.
Should a BIF generator be platform independent with respect to
generated numbers or with respect to performance?</p>

<h3 id="write-a-nif">Write a NIF</h3>

<p><a href="https://github.com/potatosalad/">potatosalad</a> also wrote <a href="https://erlangforums.com/t/looking-for-a-faster-rng/1163/23">a NIF</a>, since we (The Erlang/OTP team)
suggested that it could have good enough performance.</p>

<p>Measurements, however, showed that the overhead is significantly larger
than for a BIF.  Although the NIF used the same trick as the BIF to store
the state in thread specific data it ended up with the same
performance as <code>erlang:phash2(erlang:unique_integer(), Range)</code>,
which is about 2 to 3 times slower than the BIF.</p>

<p>As a speed improvement we tried was to have the NIF generate
a list of numbers, and use that list as a cache in Erlang.
The performance with such a cache was as fast as the BIF,
but introduced problems such as that you would have to decide
on a cache size, the application would have to keep the cache on the heap,
and when generating in a number range the application would have to know
in generate numbers in the same range for the whole cache.</p>

<p>A NIF could like a BIF also achieve good performance on a 32-bit system,
with the same open question — platform independent numbers or performance?</p>

<h3 id="use-the-system-time">Use the system time</h3>

<p>One suggested trick is to use <code>os:system_time(microseconds)</code> to get
a number.  The trick has some peculiarities:</p>
<ul>
  <li>When called repeatedly you might get the same number several times.</li>
  <li>The resolution is system dependent, so on some systems you get
the same number even more several times.</li>
  <li>Time can jump backwards and repeat in some cases.</li>
  <li>Historically it has been a bottleneck, especially on virtualized
platforms.  Getting the OS time is harder then expected.</li>
</ul>

<p>See section <a href="#measurement-results">Measurement results</a> for the performance for this “solution”.</p>

<h3 id="hash-a-unique-value">Hash a “unique” value</h3>

<p>The best combination would most certainly be
<code>erlang:phash2(erlang:unique_integer(), Range)</code> or
<code>erlang:phash2(erlang:unique_integer())</code> which is slightly faster.</p>

<p><code>erlang:unique_integer/0</code> is designed to return an unique integer
with a very small overhead.  It is hard to find a better candidate
for an integer to hash.</p>

<p><code>erlang:phash2/1,2</code> is the current generic hash function for Erlang terms.
It has a default return size well suited for 32-bit Erlang systems,
and it has a <code>Range</code> argument.  The range capping is done with a simple
<code>rem</code> in C (<code>%</code>) which is much faster than in Erlang.  This works good
 only for ranges much smaller than 32-bit as in if the range is larger
 than 16 bits the bias in the range capping starts to be noticable..</p>

<p>Alas this solution does not perform well in <a href="#prng-tests">PRNG tests</a>.</p>

<p>See section <a href="#measurement-results">Measurement results</a> for the performance for this solution.</p>

<h3 id="write-a-simple-prng">Write a simple PRNG</h3>

<p>To be fast, the implementation of a PRNG algorithm cannot
execute many operations.  The operations have to be
on immediate values (not bignums), and the the return
value from a function have to be an immediate value
(a compound term would burden the garbage collector).
This seriously limits how powerful algorithms that can be used.</p>

<p>We wrote one and named it <code>mwc59</code> because it has a 59-bit
state, and the most thorough scrambling function returns
a 59-bit value.  There is also a faster, intermediate scrambling
function, that returns a 32-bit value, which is the “digit” size
of the MWC generator.  It is also possible to directly
use the low 16 bits of the state without scrambling.
See section <a href="#implementing-a-prng">Implementing a PRNG</a> for how this generator
was designed and why.</p>

<p>As another gap filler between really fast with low quality,
and full featured, an internal function in <code>rand</code> has been exported:
<code>rand:exsp_next/1</code>.  This function implements Xoroshiro116+ that exists
within the <code>rand</code> plug-in framework as algorithm <code>exsp</code>.
It has been exported so it is possible to get good quality without
the plug-in framework overhead, for applications that do not
need any framework features.</p>

<p>See section <a href="#measurement-results">Measurement results</a> for speed comparisons.</p>

<h2 id="quality">Quality</h2>

<p>There are many different aspects of a PRNG:s quality.
Here are some.</p>

<h3 id="period">Period</h3>

<p><code>erlang:phash2(erlang:unique_integer(), Range)</code> has, conceptually,
an infinite period, since the time it will take for it to repeat
is assumed to be longer than the Erlang node will survive.</p>

<p>For the new fast <code>mwc59</code> generator the period it is about 2<sup>59</sup>.
For the regular ones in <code>rand</code> it is at least 2<sup>116</sup> - 1,
which is a huge difference.  It might be possible to consume
2<sup>59</sup> numbers during an Erlang node’s lifetime,
but not 2<sup>116</sup>.</p>

<p>There are also generators in <code>rand</code> with a period of
2<sup>928</sup> - 1 which might seem ridiculously long,
but this facilitates generating very many parallel sub-sequences
guaranteed to not overlap.</p>

<p>In, for example, a physical simulation it is common practice to only
use a fraction of the generator’s period, both regarding how many numbers
you generate and on how large range you generate, or it may affect
the simulation for example that specific numbers do not reoccur.
If you have pulled 3 aces from a deck you know there is only one left.</p>

<p>Some applications may be sensitive to the generator period,
while others are not, and this needs to be considered.</p>

<h3 id="size">Size</h3>

<p>The value size of the new fast <code>mwc59</code> generators is 59, 32, or 16 bits,
depending on the scrambling function that is used.
Most of the regular generators in the <code>rand</code> module has got
a value size of 58 bits.</p>

<p>If you need numbers in a power of 2 range then you can
simply mask out the low bits:</p>

<pre><code class="language-erlang">V = X band ((1 bsl RangeBits) - 1).
</code></pre>

<p>Or shift down the required number of bits:</p>

<pre><code class="language-erlang">V = X bsr (GeneratorBits - RangeBits).
</code></pre>

<p>This, depending on if the generator is known to have weak high or low bits.</p>

<p>If the range you need is not a power of 2, but still
much smaller than the generator’s size you can use <code>rem</code>:</p>

<pre><code class="language-erlang">V = X rem Range.
</code></pre>

<p>The rule of thumb is that <code>Range</code> should be less than
the square root of the generator’s size.  This is much slower
than bit-wise operations, and the operation propagates low bits,
which can be a problem if the generator is known to have weak low bits.</p>

<p>Another way is to use truncated multiplication:</p>

<pre><code class="language-erlang">V = (X * Range) bsr GeneratorBits
</code></pre>

<p>The rule of thumb here is that <code>Range</code> should be less than
the square root of 2<sup>GeneratorBits</sup>, that is,
2<sup>GeneratorBits/2</sup>.  Also, <code>X * Range</code>
should not create a bignum, so not more than 59 bits.
This method propagates high bits, which can be a problem
if the generator is known to have weak high bits.</p>

<p>Other tricks are possible, for example if you need numbers
in the range 0 through 999 you may use bit-wise operations to get
a number 0 through 1023, and if too high re-try, which actually
may be faster on average than using <code>rem</code>.  This method is also
completely free from bias in the generated numbers.  The previous
methods have the rules of thumb to get a so small bias
that it becomes hard to notice.</p>

<h3 id="spectral-score">Spectral score</h3>

<p>The spectral score of a generator, measures how much a sequence of numbers
from the generator are unrelated.  A sequence of N numbers are interpreted
as an N-dimensional vector and the spectral score for dimension N is a measure
on how evenly these vectors are distributed in an N-dimensional (hyper)cube.</p>

<p><code>os:system_time(microseconds)</code> simply increments so it should have
a lousy spectral score.</p>

<p><code>erlang:phash2(erlang:unique_integer(), Range)</code> has got unknown
spectral score, since that is not part of the math behind a hash function.
But a hash function is designed to distribute the hash value well
for any input, so one can hope that the statistical
distribution of the numbers is decent and “random” anyway.
Unfortunately this does not seem to hold in <a href="#prng-tests">PRNG tests</a></p>

<p>All regular PRNG:s in the <code>rand</code> module has got good spectral scores.
The new <code>mwc59</code> generator mostly, but not in 2 and 3 dimensions,
due to its unbalanced design and power of 2 multiplier.
Scramblers are used to compensate for those flaws.</p>

<h3 id="prng-tests">PRNG tests</h3>

<p>There are test frameworks that tests the statistical properties
of PRNG:s, such as the <a href="http://simul.iro.umontreal.ca/testu01/">TestU01</a> framework, or <a href="http://pracrand.sourceforge.net/">PractRand</a>.</p>

<p>The regular generators in the <code>rand</code> module perform well
in such tests, and pass thorough test suites.</p>

<p>Although the <code>mcg59</code> generator pass <a href="http://pracrand.sourceforge.net/">PractRand</a> 2 TB
and <a href="http://simul.iro.umontreal.ca/testu01/">TestU01</a> with its low 16 bits without any scrambling,
its statistical problems show when the test parameters
are tweaked just a little.  To perform well in more cases,
and with more bits, scrambling functions are needed.
Still, the small state space and the flaws of the base generator
makes it hard to pass all tests with flying colors.
With the thorough double Xorshift scrambler it gets very good, though.</p>

<p><code>erlang:phash2(N, Range)</code> over an incrementing sequence does not do well
in <a href="http://simul.iro.umontreal.ca/testu01/">TestU01</a>, which suggests that a hash functions has got different
design criteria from PRNG:s.</p>

<p>However, these kind of tests may be completely irrelevant
for your application.</p>

<h3 id="predictability">Predictability</h3>

<p>For some applications, a generated number may have to be even
cryptographically unpredictable, while for others there are
no strict requirements.</p>

<p>There is a grey-zone for “non-critical” applications where for example
a rouge party may be able to affect input data, and if it knows the PRNG
sequence can steer all data to a hash table slot, overload one particular
worker process, or something similar, and in this way attack an application.
And, an application that starts out as “non-critical” may one day
silently have become business critical…</p>

<p>This is an aspect that needs to be considered.</p>

<h2 id="storing-the-state">Storing the state</h2>

<p>If the state of a PRNG can be kept in a loop variable, the cost
can be almost nothing.  But as soon as it has to be stored in
a heap variable it will cost performance due to heap data
allocation, term building, and garbage collection.</p>

<p>In the section <a href="#measurement-results">Measurement results</a> we see that the fastest PRNG
can generate a new state that is also the generated integer
in just under 4 ns.  Unfortunately, just to return both
the value and the new state in a 2-tuple adds roughly 10 ns.</p>

<p>The application state in which the PRNG state must be stored
is often more complex, so the cost for updating it will
probably be even larger.</p>

<h2 id="seeding">Seeding</h2>

<p>Seeding is related to predictability.  If you can guess
the seed you know the generator output.</p>

<p>The seed is generator dependent and how to create a good
seed usually takes much longer than generating a number.
Sometimes the seed and its predictability is so unimportant
that a constant can be used.   If a generator instance
generates just a few numbers per seeding, then seeding
can be the harder problem.</p>

<p><code>erlang:phash2(erlang:unique_integer(), Range)</code> is pre-seeded,
or rather cannot be seeded, so it has no seeding cost, but can
on the other hand be rather predictable, if it is possible to estimate
how many unique integers that have been generated since node start.</p>

<p>The default seeding in the <code>rand</code> module uses a combination
of a hash value of the node name, the system time,
and <code>erlang:unique_integer()</code>, to create a seed,
which is hopefully sufficiently unpredictable.</p>

<p>The suggested NIF and BIF solutions would also need
a way to create a good enough seed, where “good enough”
is hard to put a number on.</p>

<h2 id="jit-optimizations">JIT optimizations</h2>

<p>The speed of the newly implemented <code>mwc59</code> generator
is partly thanks to the recent <a href="https://www.erlang.org/blog/type-based-optimizations-in-the-jit/">type-based optimizations</a> in the compiler
and the Just-In-Time compiling BEAM code loader.</p>

<h3 id="with-no-type-based-optimization">With no type-based optimization</h3>

<p>This is the Erlang code for the <code>mwc59</code> generator:</p>

<pre><code class="language-erlang">mwc59(CX) -&gt;
    C = CX band ((1 bsl 32)-1),
    X = CX bsr 32,
    16#7fa6502 * X + C.
</code></pre>

<p>The code compiles to this Erlang BEAM assembler, (<code>erlc -S rand.erl</code>),
using the <code>no_type_opt</code> flag to disable type-based optimizations:</p>

<pre><code class="language-text">    {gc_bif,'bsr',{f,0},1,[{x,0},{integer,32}],{x,1}}.
    {gc_bif,'band',{f,0},2,[{x,0},{integer,4294967295}],{x,0}}.
    {gc_bif,'*',{f,0},2,[{x,0},{integer,133850370}],{x,0}}.
    {gc_bif,'+',{f,0},2,[{x,0},{x,1}],{x,0}}.
</code></pre>

<p>When loaded by the JIT (x86) (<code>erl +JDdump true</code>)
the machine code becomes:</p>

<pre><code class="language-nasm"># i_bsr_ssjd
    mov rsi, qword ptr [rbx]
# is the operand small?
    mov edi, esi
    and edi, 15
    cmp edi, 15
    short jne L2271
</code></pre>

<p>Above was a test if <code>{x,0}</code> is a small integer and if not
the fallback at <code>L2271</code> is called to handle any term.</p>

<p>Then follows the machine code for right shift, Erlang <code>bsr 32</code>,
x86 <code>sar rax, 32</code>, and a skip over the fallback code:</p>

<pre><code class="language-nasm">    mov rax, rsi
    sar rax, 32
    or rax, 15
    short jmp L2272
L2271:
    mov eax, 527
    call 140439031217336
L2272:
    mov qword ptr [rbx+8], rax
# line_I
</code></pre>

<p>Here follows <code>band</code> with similar test and fallback code:</p>

<pre><code class="language-nasm"># i_band_ssjd
    mov rsi, qword ptr [rbx]
    mov rax, 68719476735
# is the operand small?
    mov edi, esi
    and edi, 15
    cmp edi, 15
    short jne L2273
    and rax, rsi
    short jmp L2274
L2273:
    call 140439031216768
L2274:
    mov qword ptr [rbx], rax
</code></pre>

<p>Below comes <code>*</code> with test, fallback code, and overflow check:</p>

<pre><code class="language-nasm"># line_I
# i_times_jssd
    mov rsi, qword ptr [rbx]
    mov edx, 2141605935
# is the operand small?
    mov edi, esi
    and edi, 15
    cmp edi, 15
    short jne L2276
# mul with overflow check, imm RHS
    mov rax, rsi
    mov rcx, 133850370
    and rax, -16
    imul rax, rcx
    short jo L2276
    or rax, 15
    short jmp L2275
L2276:
    call 140439031220000
L2275:
    mov qword ptr [rbx], rax
</code></pre>

<p>The following is <code>+</code> with tests, fallback code, and overflow check:</p>

<pre><code class="language-nasm"># i_plus_ssjd
    mov rsi, qword ptr [rbx]
    mov rdx, qword ptr [rbx+8]
# are both operands small?
    mov eax, esi
    and eax, edx
    and al, 15
    cmp al, 15
    short jne L2278
# add with overflow check
    mov rax, rsi
    mov rcx, rdx
    and rcx, -16
    add rax, rcx
    short jno L2277
L2278:
    call 140439031219296
L2277:
    mov qword ptr [rbx], rax
</code></pre>

<h3 id="with-type-based-optimization">With type-based optimization</h3>

<p>When the compiler can figure out type information about the arguments
it can emit more efficient code.  One would like to add a guard
that restricts the argument to a 59 bit integer, but unfortunately
the compiler cannot yet make use of such a guard test.</p>

<p>But adding a redundant input bit mask to the Erlang code puts the compiler
on the right track.  This is a kludge, and will only be used
until the compiler has been improved to deduce the same information
from a guard instead.</p>

<p>The Erlang code now has a first redundant mask to 59 bits:</p>

<pre><code class="language-erlang">mwc59(CX0) -&gt;
    CX = CX0 band ((1 bsl 59)-1),
    C = CX band ((1 bsl 32)-1),
    X = CX bsr 32,
    16#7fa6502 * X + C.
</code></pre>

<p>The BEAM assembler then becomes, with the default type-based optimizations
in the compiler the OTP-25.0 release:</p>

<pre><code class="language-text">    {gc_bif,'band',{f,0},1,[{x,0},{integer,576460752303423487}],{x,0}}.
    {gc_bif,'bsr',{f,0},1,[{tr,{x,0},{t_integer,{0,576460752303423487}}},
             {integer,32}],{x,1}}.
    {gc_bif,'band',{f,0},2,[{tr,{x,0},{t_integer,{0,576460752303423487}}},
             {integer,4294967295}],{x,0}}.
    {gc_bif,'*',{f,0},2,[{tr,{x,0},{t_integer,{0,4294967295}}},
             {integer,133850370}],{x,0}}.
    {gc_bif,'+',{f,0},2,[{tr,{x,0},{t_integer,{0,572367635452168875}}},
             {tr,{x,1},{t_integer,{0,134217727}}}],{x,0}}.
</code></pre>

<p>Note that after the initial input <code>band</code> operation,
type information <code>{tr,{x_},{t_integer,Range}}</code> has been propagated
all the way down.</p>

<p>Now the JIT:ed code becomes noticeably shorter.</p>

<p>The input mask operation knows nothing about the value so it has
the operand test and the fallback to any term code:</p>

<pre><code class="language-nasm"># i_band_ssjd
    mov rsi, qword ptr [rbx]
    mov rax, 9223372036854775807
# is the operand small?
    mov edi, esi
    and edi, 15
    cmp edi, 15
    short jne L1816
    and rax, rsi
    short jmp L1817
L1816:
    call 139812177115776
L1817:
    mov qword ptr [rbx], rax
</code></pre>

<p>For all the following operations, operand tests and fallback code
has been optimized away to become a straight sequence of machine code:</p>

<pre><code class="language-nasm"># line_I
# i_bsr_ssjd
    mov rsi, qword ptr [rbx]
# skipped test for small left operand because it is always small
    mov rax, rsi
    sar rax, 32
    or rax, 15
L1818:
L1819:
    mov qword ptr [rbx+8], rax
# line_I
# i_band_ssjd
    mov rsi, qword ptr [rbx]
    mov rax, 68719476735
# skipped test for small operands since they are always small
    and rax, rsi
    mov qword ptr [rbx], rax
# line_I
# i_times_jssd
# multiplication without overflow check
    mov rax, qword ptr [rbx]
    mov esi, 2141605935
    and rax, -16
    sar rsi, 4
    imul rax, rsi
    or rax, 15
    mov qword ptr [rbx], rax
# i_plus_ssjd
# add without overflow check
    mov rax, qword ptr [rbx]
    mov rsi, qword ptr [rbx+8]
    and rax, -16
    add rax, rsi
    mov qword ptr [rbx], rax
</code></pre>

<p>The execution time goes down from 3.7 ns to 3.3 ns which is
10% faster just by avoiding redundant checks and tests,
despite adding a not needed initial input mask operation.</p>

<p>And there is room for improvement.  The values are moved back and forth
to BEAM <code>{x,_}</code> registers (<code>qword ptr [rbx]</code>) between operations.
Moving back from the <code>{x,_}</code> register could be avoided by the JIT
since it is possible to know that the value is in a process register.
Moving out to the <code>{x,_}</code> register could be optimized away if the compiler
would emit the information that the value will not be used
from the <code>{x,_}</code> register after the operation.</p>

<h2 id="implementing-a-prng">Implementing a PRNG</h2>

<p>To create a really fast PRNG in Erlang there are some
limitations coming with the language implementation:</p>

<ul>
  <li>If the generator state is a complex term, that is, a heap term,
instead of an immediate value, state updates gets much slower.
Therefore the state should be a max 59-bit integer.</li>
  <li>If an intermediate result creates a bignum, that is,
overflows 59 bits, arithmetic operations gets much slower,
so intermediate results must produce values that fit in 59 bits.</li>
  <li>If the generator returns both a generated value
and a new state in a compound term, then, again,
updating heap data makes it much slower.  Therefore
a generator should only return an immediate integer state.</li>
  <li>If the returned state integer cannot be used as a generated number,
then a separate value function that operates on the state
can be used.  Two calls, however, double the call overhead.</li>
</ul>

<h3 id="lcg-and-mcg">LCG and MCG</h3>

<p>The first attempt was to try a classical power of 2
Linear Congruential Generator:</p>

<pre><code class="language-erlang">X1 = (A * X0 + C) band (P-1)
</code></pre>

<p>And a Multiplicative Congruential Generator:</p>

<pre><code class="language-erlang">X1 = (A * X0) rem P
</code></pre>

<p>To avoid bignum operations the product <code>A * X0</code>
must fit in 59 bits. The classical paper “Tables of
Linear Congruential Generators of Different Sizes and
Good Lattice Structure” by Pierre L’Ecuyer lists two generators
that are 35 bit, that is, an LCG with <code>P</code> = 2<sup>35</sup>
and an MCG with P being a prime number just below 2<sup>35</sup>.
These were the largest generators to be found for which
the muliplication did not overflow 59 bits.</p>

<p>The speed of the LCG is very good.  The MCG less so since it has
to do an integer division by <code>rem</code>, but thanks to <code>P</code> being
close to 2<sup>35</sup> that could be optimized so the speed
reached only about 50% slower than the LCG.</p>

<p>The short period and know quirks of a power of 2 LCG unfortunately
showed in <a href="#prng-tests">PRNG tests</a>.</p>

<p>They failed miserably.</p>

<h3 id="mwc">MWC</h3>

<p><a href="https://vigna.di.unimi.it/">Sebastiano Vigna</a> of the University of Milano, who also helped
design our current 58-bit Xorshift family generators,
suggested to use a Multiply With Carry generator instead:</p>

<pre><code class="language-erlang">T  = A * X0 + C0,
X1 = T band ((1 bsl Bits)-1),
C0 = T bsr Bits.
</code></pre>

<p>This generator operates on “digits” of size <code>Bits</code>, and if a digit
is half a machine word then the multiplication does not overflow.
Instead of having the state as a digit <code>X</code> and a carry <code>C</code> these
can be merged to have <code>T</code> as the state instead.  We get:</p>

<pre><code class="language-erlang">X  = T0 band ((1 bsl Bits)-1),
C  = T0 bsr Bits,
T1 = A * X + C
</code></pre>

<p>An MWC generator is actually a different form of a MCG generator
with a power of 2 multiplier, so this is an equivalent generator:</p>

<pre><code class="language-erlang">T0 = (T1 bsl Bits) rem ((A bsl Bits) - 1)
</code></pre>

<p>In this form the generator updates the state in the reverse order,
hence <code>T0</code> and <code>T1</code> are swapped.  The modulus <code>(A bsl Bits) - 1</code>
has to be a safe prime number or else the generator
does not have maximum period.</p>

<h4 id="the-base-generator">The base generator</h4>

<p>Because the multiplier (or its multiplicative inverse) is a power of 2,
the MWC generator gets bad <a href="#spectral-score">Spectral score</a> in 3 dimensions,
so using a scrambling function on the state to get a number would
be necessary to improve the quality.</p>

<p>A search for a suitable digit size and multiplier started,
mostly done by using programs that try multipliers for
safe prime numbers, and estimates spectral scores, such as <a href="https://github.com/vigna/CPRNG/">CPRNG</a>.</p>

<p>When the generator is balanced, that is, the multiplier <code>A</code>
has got close to <code>Bits</code> bits, the spectral scores are the best,
apart from the known problem in 3 dimensions.  But since a scrambling
function would be needed anyway there was an opportunity to
try to generate a comfortable 32-bit digit using a 27-bit multiplier.
With these sizes the product <code>A * X0</code> does not create a bignum,
and with a 32-bit digit it becomes possible to use standard
<a href="#prng-tests">PRNG tests</a> to test the generator during development.</p>

<p>Because of using such slightly unbalanced parameters, unfortunately
the spectral scores for 2 dimensions also gets bad, but the scrambler
could solve that too…</p>

<p>The final generator is:</p>

<pre><code class="language-erlang">mwc59(T) -&gt;
    C = T bsr 32,
    X = T band ((1 bsl 32)-1),
    16#7fa6502 * X + C.
</code></pre>

<p>The 32-bit digits of this base generator do not perform very
well in <a href="#prng-tests">PRNG tests</a>, but actually the low 16 bits pass
2 TB in <a href="http://pracrand.sourceforge.net/">PractRand</a> and 1 TB with the bits reversed,
which is surprisingly good.  The problem of bad spectral scores
for 2 and 3 dimensions lie in the higher bits of the MWC digit.</p>

<h4 id="scrambling">Scrambling</h4>

<p>The scrambler has to be fast as in use only a few
and fast operations.  For an arithmetic generator like this,
Xorshift is a suitable scrambler.  We looked at single
Xorshift, double Xorshift and double XorRot.  Double XorRot
was slower than double Xorshift but not better,
probably since the generator has got good low bits, so they
need to be shifted up to improve the high bits.
Rotating down high bits to the low is no improvement.</p>

<p>This is a single Xorshift scrambler:</p>

<pre><code class="language-erlang">V = T bxor (T bsl Shift)
</code></pre>

<p>When trying <code>Shift</code> constants it showed that with a large
shift constant the generator performed better in <a href="http://pracrand.sourceforge.net/">PractRand</a>,
and with a small one it performed better in birthday spacing tests
(such as in <a href="http://simul.iro.umontreal.ca/testu01/">TestU01</a> BigCrush) and collision tests.
Alas, it was not possible to find a constant good for both.</p>

<p>The choosen single Xorshift constant is <code>8</code> that passes
4 TB in <a href="http://pracrand.sourceforge.net/">PractRand</a> and BigCrush in <a href="http://simul.iro.umontreal.ca/testu01/">TestU01</a> but fails
more thorough birthday spacing tests.  The failures are few,
such as the lowest bit in 8 and 9 dimensions,
and some intermediate bits in 2 and 3 dimensions.
This is something unlikely to affect most applications,
and if using the high bits of the 32 generated,
these imperfections should stay under the rug.</p>

<p>The final scrambler has to avoid bignum operations
and masks the value to 32 bits so it looks like this:</p>

<pre><code class="language-erlang">mwc59_value32(T) -&gt;
    V0 = T  band ((1 bsl 32)-1),
    V1 = V0 band ((1 bsl (32-8))-1),
    V0 bxor (V1 bsl 8).
</code></pre>

<p>A better scrambler would be a double Xorshift that can
have both a small shift and a large shift.
Using the small shift <code>4</code> makes the combined generator
do very well in birthday spacings and collision tests,
and following up with a large shift <code>27</code> shifts the
whole improved 32-bit MWC digit all the way up
to the top bit of the generator’s 59-bit state.
That was the idea, and it turned out work fine.</p>

<p>The double Xorshift scrambler produces a 59-bit
number where the low, the high, reversed low,
reversed high, etc… all perform very well in <a href="http://pracrand.sourceforge.net/">PractRand</a>,
<a href="http://simul.iro.umontreal.ca/testu01/">TestU01</a> BigCrush, and in exhaustive birthday spacing
and collision tests.  It is also not terribly much slower
than the single Xorshift scrambler.</p>

<p>Here is a double Xorshift scrambler 4 then 27:</p>

<pre><code class="language-erlang">V1 = T bxor (T bsl 4),
V  = V1 bxor (V1 bsl 27).
</code></pre>

<p>Which, avoiding bignum operations and producing a 59-bit value,
becomes the final scrambler:</p>

<pre><code class="language-erlang">mwc59_value(T) -&gt;
    V0 = T  band ((1 bsl (59-4))),
    V1 = T  bxor (V0 bsl 4),
    V2 = V1 band ((1 bsl (59-27))),
    V1 bxor (V2 bsl 27).
</code></pre>

<p>Many thanks to <a href="https://vigna.di.unimi.it/">Sebastiano Vigna</a> that has done most of
(practically all) the parameter searching and extensive testing
of the generator and scramblers, backed by knowledge of what could work.
Using an MWC generator in this particular way is rather uncharted
territory regarding the math, so extensive testing is
the way to trust the quality of the generator.</p>

<h2 id="rand_suitemeasure1"><code>rand_SUITE:measure/1</code></h2>

<p>The test suite for the <code>rand</code> module — <a href="https://github.com/erlang/otp/blob/master/lib/stdlib/test/rand_SUITE.erl"><code>rand_SUITE</code></a>,
in the Erlang/OTP source tree, contains a test case <a href="https://github.com/erlang/otp/blob/08f343bed4f75bf345b04b4c1fac7e1026a50ab3/lib/stdlib/test/rand_SUITE.erl#L1064"><code>measure/1</code></a>.
This test case is a micro-benchmark of all the algorithms
in the <code>rand</code> module, and some more.  It measures the execution
time in nanoseconds per generated number, and presents the
times both absolute and relative to the default algorithm
<code>exsss</code> that is considered to be 100%.  See <a href="#measurement-results">Measurement Results</a>.</p>

<p><a href="https://github.com/erlang/otp/blob/08f343bed4f75bf345b04b4c1fac7e1026a50ab3/lib/stdlib/test/rand_SUITE.erl#L1064"><code>measure/1</code></a> is runnable also without a test framework.
As long as <code>rand_SUITE.beam</code> is in the code path
<code>rand_SUITE:measure(N)</code> will run the benchmark with <code>N</code>
as an effort factor.  <code>N = 1</code> is the default and
for example <code>N = 5</code> gives a slower
and more thorough measurement.</p>

<p>The test case is divided in sections where each first runs
a warm-up with the default generator, then runs an empty
benchmark generator to measure the benchmark overhead,
and after that runs all generators for the specific section.
The benchmark overhead is subtracted from the presented
results after the overhead run.</p>

<p>The warm-up and overhead measurement &amp; compensation are
recent improvements to the <code>measure/1</code> test case.
Overhead has also been reduced by in-lining 10 PRNG iterations
per test case loop iteration, which got the overhead down to
one third of without such in-lining, and the overhead is now
about as large as the fastest generator itself, approaching the
function call overhead in Erlang.</p>

<p>The different <code>measure/1</code> sections are different use cases such as
“uniform integer half range + 1”, etc.  Many of these test the performance
of plug-in framework features.  The test sections that are interesting
for this text are “uniform integer range 10000”, “uniform integer 32-bit”,
and “uniform integer full range”.</p>

<h2 id="measurement-results">Measurement results</h2>

<p>Here are some selected results from the author’s laptop
from running <code>rand_SUITE:measure(20)</code>:</p>

<p>The <code>{mwc59,Tag}</code> generator is <code>rand:mwc59/1</code>, where
<code>Tag</code> indicates if the <code>raw</code> generator, the <code>rand:mwc59_value32/1</code>,
or the <code>rand:mwc59_value/1</code> scrambler was used.</p>

<p>The <code>{exsp,_}</code> generator is <code>rand:exsp_next/1</code> which
is a newly exported internal function that does not use
the plug-in framework.  When called from the plug-in
framework it is called <code>exsp</code> below.</p>

<p><code>unique_phash2</code> is <code>erlang:phash2(erlang:unique_integer(), Range)</code>.</p>

<p><code>system_time</code> is <code>os:system_time(microsecond)</code>.</p>

<pre><code class="language-text">RNG uniform integer range 10000 performance
                   exsss:     57.5 ns (warm-up)
                overhead:      3.9 ns      6.8%
                   exsss:     53.7 ns    100.0%
                    exsp:     49.2 ns     91.7%
         {mwc59,raw_mod}:      9.8 ns     18.2%
       {mwc59,value_mod}:     18.8 ns     35.0%
              {exsp,mod}:     22.5 ns     41.9%
          {mwc59,raw_tm}:      3.5 ns      6.5%
      {mwc59,value32_tm}:      8.0 ns     15.0%
        {mwc59,value_tm}:     11.7 ns     21.8%
               {exsp,tm}:     18.1 ns     33.7%
           unique_phash2:     23.6 ns     44.0%
             system_time:     30.7 ns     57.2%
</code></pre>

<p>The first two are the warm-up and overhead measurements.
The measured overhead is subtracted from all measurements
after the “overhead:” line.  The measured overhead here
is 3.9 ns which matches well that <code>exsss</code> measures
3.8 ns more during the warm-up run than after <code>overhead</code>.
The warm-up run is, however, a bit unpredictable.</p>

<p><code>{_,*mod}</code> and <code>system_time</code> all use <code>(X rem 10000) + 1</code>
to achieve the desired range.  The <code>rem</code> operation is expensive,
which we will see when comparing with the next section.</p>

<p><code>{_,*tm}</code> use truncated multiplication to achieve the range,
that is <code>((X * 10000) bsr GeneratorBits) + 1</code>,
which is much faster than using <code>rem</code>.</p>

<p><code>erlang:phash2/2</code> has got a range argument, that performs
the <code>rem 10000</code> operation in the BIF, which is fairly cheap,
as we also will see when comparing with the next section.</p>

<pre><code class="language-text">RNG uniform integer 32 bit performance
                   exsss:     55.3 ns    100.0%
                    exsp:     51.4 ns     93.0%
        {mwc59,raw_mask}:      2.7 ns      4.9%
         {mwc59,value32}:      6.6 ns     12.0%
     {mwc59,value_shift}:      8.6 ns     15.5%
            {exsp,shift}:     16.6 ns     30.0%
           unique_phash2:     22.1 ns     40.0%
             system_time:     23.5 ns     42.6%
</code></pre>

<p>In this section, to generate a number in a 32-bit range,
<code>{mwc59,raw_mask}</code> and <code>system_time</code> use a bit mask
<code>X band 16#ffffffff</code>, <code>{_,*shift}</code> use <code>bsr</code>
to shift out the low bits, and <code>{mwc59_value32}</code> has got
the right range in itself.  Here we see that bit operations
are up to 10 ns faster than the <code>rem</code> operation in the previous section.
<code>{mwc59,raw_*}</code> is more than 3 times faster.</p>

<p>Compared to the truncated multiplication variants in the previous section,
the bit operations here are up to 3 ns faster.</p>

<p><code>unique_phash2</code> still uses BIF coded integer division to achieve
the range, which gives it about the same speed as in the previous section,
but it seems integer division with a power of 2 is a bit faster.</p>

<pre><code class="language-text">RNG uniform integer full range performance
                   exsss:     45.1 ns    100.0%
                    exsp:     39.8 ns     88.3%
                   dummy:     25.5 ns     56.6%
             {mwc59,raw}:      3.7 ns      8.3%
         {mwc59,value32}:      6.9 ns     15.2%
           {mwc59,value}:      8.5 ns     18.8%
             {exsp,next}:     16.8 ns     37.2%
       {splitmix64,next}:    331.1 ns    734.3%
           unique_phash2:     21.1 ns     46.8%
                procdict:     75.2 ns    166.7%
        {mwc59,procdict}:     16.6 ns     36.8%
</code></pre>

<p>In this section no range capping is done.  The raw generator output is used.</p>

<p>Here we have the <code>dummy</code> generator, which is an undocumented generator
within the <code>rand</code> plug-in framework that only does a minimal state
update and returns a constant.  It is used here to measure
plug-in framework overhead.</p>

<p>The plug-in framework overhead is measured to 25.5 ns that matches
<code>exsp</code> - <code>{exsp,next}</code> = 23.0 ns fairly well,
which is the same algorithm within and without the plug-in framework,
giving another measure of the framework overhead.</p>

<p><code>procdict</code> is the default algorithm <code>exsss</code> but makes the plug-in
framework store the generator state in the process dictionary,
which here costs 30 ns.</p>

<p><code>{mwc59,procdict}</code> stores the generator state in the process dictionary,
which here costs 12.9 ns. The state term that is stored is much smaller
than for the plug-in framework.  Compare to <code>procdict</code>
in the previous paragraph.</p>

<h2 id="summary">Summary</h2>

<p>The <a href="#write-a-simple-prng">new fast</a> generator’s functions in the <code>rand</code> module
fills a niche for speed over quality where the type-based
<a href="#jit-optimizations">JIT optimizations</a> have elevated the performance.</p>

<p>The combination of high speed and high quality can only
be fulfilled with a <a href="#write-a-bif">BIF implementation</a>, but we hope that
to be a combination we do not need to address…</p>

<p><a href="#implementing-a-prng">Implementing a PRNG</a> is tricky business.</p>

<p>Recent improvements in <a href="#rand_suitemeasure1"><code>rand_SUITE:measure/1</code></a>
highlights what the precious CPU cycles are used for.</p>]]></content><author><name>Raimo Niskanen</name></author><category term="BEAM" /><category term="JIT," /><category term="PRNG," /><category term="rand," /><category term="random" /><summary type="html"><![CDATA[When you need “random” integers, and it is essential to generate them fast and cheap; then maybe the full featured Pseudo Random Number Generators in the rand module are overkill. This blog post will dive in to new additions to the said module, how the Just-In-Time compiler optimizes them, known tricks, and tries to compare these apples and potatoes.]]></summary></entry><entry><title type="html">Type-Based Optimizations in the JIT</title><link href="https://www.erlang.org/blog/type-based-optimizations-in-the-jit/" rel="alternate" type="text/html" title="Type-Based Optimizations in the JIT" /><published>2022-04-26T00:00:00+00:00</published><updated>2022-04-26T00:00:00+00:00</updated><id>https://www.erlang.org/blog/type-based-optimizations-in-the-jit</id><content type="html" xml:base="https://www.erlang.org/blog/type-based-optimizations-in-the-jit/"><![CDATA[<p>This post explores the new type-based optimizations in Erlang/OTP 25 where
the compiler embeds type information in the BEAM files to help the
<a href="https://www.erlang.org/blog/a-first-look-at-the-jit/">JIT (Just-In-Time compiler)</a> to generate better code.</p>

<h3 id="the-best-of-both-worlds">The best of both worlds</h3>

<p>The <a href="https://blog.erlang.org/ssa-history/">SSA-based compiler passes</a> introduced in OTP 22 does a
sophisticated type analysis, which allows for more optimizations and
better code generation. There are, however, limits to what kind of
optimizations the Erlang compiler can do because a BEAM file must be
possible to load on any BEAM machine running on a 32-bit or 64-bit
computer. Therefore, the compiler cannot do optimizations that depend
on the size of integers that fit in a machine word or on how
Erlang terms are represented.</p>

<p>The JIT (introduced in OTP 24) knows that it is running on a 64-bit
computer and knows how Erlang terms are represented. The JIT is still
limited in how much optimization it can do because it translates a
single BEAM instruction at the time. For example, the <code>+</code> operator can
add floats or integers of any size or any combination
thereof. Previously executed BEAM instructions might have made it
clear that the operands can only be small integers, but the JIT does
not know that since it only looks at one instruction at the time, and
therefore it must emit native code that handles all possible operands.</p>

<p>In OTP 25, the compiler has been updated to embed type information in
the BEAM file and the JIT has been extended to emit better code based
on that type information.</p>

<p>The embedded type information is versioned so that we can continue to
improve the type-based optimizations in every OTP release. The loader
will ignore versions it does not recognize so that the module can
still be loaded without the type-based optimizations.</p>

<h3 id="what-to-expect-of-the-jit-in-otp-25">What to expect of the JIT in OTP 25</h3>

<p>OTP 25 is just the beginning for type-based optimizations. We hope to
improve both the type information from the compiler and the
optimizations in the JIT in OTP 26.</p>

<p>How much better the native code emitted by the JIT will be depends
on the nature of the code in the module.</p>

<p>The most commonly applied optimization is simplified tests. For
example, a test for a tuple can frequently be reduced from 5
instructions down to 3 instructions, and a test for small integer
operands can frequently be reduced from 5 instructions down to 4
instructions.</p>

<p>Less commonly applied but more significant are the simplifications
that can be made when an integer is known to be “small” (fits in 60
bits). For example, a relational operator (such as <code>&lt;</code>) used in a
guard can be reduced from 11 instructions down to 4 if the operands
are known to be small integers. This kind of optimization is most
often applied in modules that use binary pattern matching because
integers matched out from a binary have a well-defined range.</p>

<p>In the Erlang/OTP code base, the first kind of optimizations (shaving
off one or two instructions) are applied roughly ten times as often as
the second kind.</p>

<p>We will see later in this blog post that the optimizations of the
second kind applied to the <code>base64</code> module resulted in a significant
speed up.</p>

<h3 id="simplifications-of-type-tests">Simplifications of type tests</h3>

<p>Let’s dive right into some examples.</p>

<p>Consider this module:</p>

<pre><code class="language-erlang">-module(example).
-export([tuple_matching/1]).

tuple_matching(X) -&gt;
    case increment(X) of
        {ok,Result} -&gt; Result;
        error -&gt; X
    end.

increment(X) when is_integer(X) -&gt; {ok,X+1};
increment(_) -&gt; error.
</code></pre>

<p>The <a href="https://www.erlang.org/blog/a-brief-beam-primer">BEAM code</a> for the <code>tuple_matching/1</code> function emitted
by the compiler in OTP 24 is (somewhat simplified):</p>

<pre><code>    {allocate,1,1}.
    {move,{x,0},{y,0}}.
    {call,1,{f,5}}.
    {test,is_tuple,{f,3},[{x,0}]}.
    {get_tuple_element,{x,0},1,{x,0}}.
    {deallocate,1}.
    return.
  {label,3}.
    {move,{y,0},{x,0}}.
    {deallocate,1}.
    return.
</code></pre>

<p>The compiler has figured out that the <code>increment/1</code> returns either the
atom <code>error</code> or a two-tuple with <code>ok</code> as the first element. Therefore,
to distinguish between those two possible return values, a single
instruction suffices:</p>

<pre><code>    {test,is_tuple,{f,3},[{x,0}]}.
</code></pre>

<p>There is no need to explicitly test for the value <code>error</code> because it
<strong>must</strong> be <code>error</code> if it is not a tuple. Similarly, there is no need
to test that the first element of the tuple is <code>ok</code> because it must be.</p>

<p>In OTP 24, the JIT translates that instruction to a sequence of 5 native
instructions for x86_64:</p>

<pre><code class="language-nasm"># i_is_tuple_fs
    mov rsi, qword ptr [rbx]
    rex test sil, 1
    jne L2
    test byte ptr [rsi-2], 63
    jne L2
</code></pre>

<p>(Lines starting with <code>#</code> are comments.)</p>

<p>The <code>mov</code> instruction fetches the value of the BEAM register <code>{x,0}</code>
to the CPU register <code>rsi</code>. The next two instructions test whether the
term is a pointer to an object on the heap. If it is, the header word
for the heap object is tested to make sure it is a tuple. The second
test is needed because the heap object could be some other Erlang term,
such as a binary, a map, or an integer that does not fit in a machine
word.</p>

<p>Now let’s see what the compiler and the JIT in OTP 25 do with this
instruction. The BEAM code is now:</p>

<pre><code class="language-text">    {test,is_tuple,
          {f,3},
          [{tr,{x,0},
               {t_union,{t_atom,[error]},
                        none,none,
                        [{{2,{t_atom,[ok]}},
                          {t_tuple,2,true,
                                   #{1 =&gt; {t_atom,[ok]},
                                     2 =&gt; {t_integer,any}}}}],
                        none}}]}.
</code></pre>

<p>The operand that was <code>{x,0}</code> in OTP 24 is now a tuple:</p>

<pre><code class="language-erlang">{tr,Register,Type}
</code></pre>

<p>That is, it is a three-tuple with <code>tr</code> as the first element. <code>tr</code>
stands for <strong>typed register</strong>. The second element is the BEAM register
(<code>{x,0}</code> in this case), and the third element is the type of the
register in the compiler’s internal type representation. The type
is equivalent to the following type spec:</p>

<pre><code class="language-erlang">'error' | {'ok', integer()}
</code></pre>

<p>The JIT cannot take advantage of that level of detail in the types,
so the compiler embeds a <a href="https://github.com/erlang/otp/blob/de5bb49320db22159de52e677c5f7499b763b0cd/lib/compiler/src/beam_types.erl#L1153-L1241">simplified
version</a>
of that type into the BEAM file. The embedded type is equivalent to:</p>

<pre><code class="language-erlang">atom() | tuple()
</code></pre>

<p>By knowing that <code>{x,0}</code> must be an atom or a tuple, the JIT in OTP 25
emits the following simplified native code:</p>

<pre><code class="language-nasm"># i_is_tuple_fs
    mov rsi, qword ptr [rbx]
# simplified tuple test since the source is always a tuple when boxed
    rex test sil, 1
    jne label_3
</code></pre>

<p>(The JIT generally emits a comment when type information made a simplification
possible.)</p>

<p>Only the first test is now necessary, because if the term is a pointer
to a heap object, according to the type information, it <strong>must</strong> be a tuple.</p>

<h3 id="simplification-of-relational-operators">Simplification of relational operators</h3>

<p>As another example, let’s look at how the relational operators in
guards are translated. Consider this function:</p>

<pre><code class="language-erlang">my_less_than(A, B) -&gt;
    if
        A &lt; B -&gt; smaller;
        true -&gt; larger_or_equal
    end.
</code></pre>

<p>The BEAM code looks like this:</p>

<pre><code>    {test,is_lt,{f,9},[{x,0},{x,1}]}.
    {move,{atom,smaller},{x,0}}.
    return.
  {label,9}.
    {move,{atom,larger_or_equal},{x,0}}.
    return.
</code></pre>

<p>When relational operators are used as guard tests, the compiler rewrites
them as special instructions. Thus, the <code>&lt;</code> operator is rewritten to an
<code>is_lt</code> instruction.</p>

<p>The <code>&lt;</code> operator can compare any Erlang terms. It would be impractical
for the JIT to emit the code to handle all kinds of terms. Therefore, the
JIT emits code that directly handles the most common case and
calls a generic routine to handle everything else:</p>

<pre><code class="language-nasm"># is_lt_fss
    mov rsi, qword ptr [rbx+8]
    mov rdi, qword ptr [rbx]
    mov eax, edi
    and eax, esi
    and al, 15
    cmp al, 15
    short jne L39
    cmp rdi, rsi
    short jmp L40
L39:
    call 5447639136
L40:
    jge label_9
</code></pre>

<p>Let’s walk through the code. The first two instructions:</p>

<pre><code class="language-nasm">    mov rsi, qword ptr [rbx+8]
    mov rdi, qword ptr [rbx]
</code></pre>

<p>fetches the BEAM registers <code>{x,1}</code> and <code>{x,0}</code> into CPU registers.</p>

<p>The most common comparison is between two integers. Depending on the
magnitude, integers can be represented in two different ways. On a 64-bit
computer, signed integers that fit in 60 bits will be stored directly
in a 64-bit word. The remaining 4 bits in the words are used for the
<a href="http://www.it.uu.se/research/publications/reports/2000-029/2000-029-nc.pdf">tag</a>, which for a small integer is <code>15</code>. If the integer does
not fit, it will be represented as a <strong>bignum</strong>, which is pointer to
an object on the heap.</p>

<p>Here is the native code for testing that both operands are small:</p>

<pre><code class="language-nasm">    mov eax, edi
    and eax, esi
    and al, 15
    cmp al, 15
    short jne L39
</code></pre>

<p>If one or both of the operands have another tag than <code>15</code> (are not
small integers), control is transferred to code at label <code>L39</code> that
handles all other types of terms.</p>

<p>The next lines do the comparison of the small integers. The code is
written in a slightly convoluted way so that the conditional jump
(<code>jge label_9</code>) that transfers control to the failure label can be
shared with the generic code:</p>

<pre><code class="language-nasm">    cmp rdi, rsi
    short jmp L40
L39:
    call 5447639136
L40:
    jge label_9
</code></pre>

<p>Thus, without type information, 11 instructions are needed to implement
<code>is_lt</code>.</p>

<p>Now let’s see what happens when types are available:</p>

<pre><code class="language-erlang">my_less_than(A, B) when is_integer(A), is_integer(B) -&gt;
    .
    .
    .
</code></pre>

<p>When compiled by the compiler in OTP 25, the BEAM code is:</p>

<pre><code>    {test,is_integer,{f,7},[{x,0}]}.
    {test,is_integer,{f,7},[{x,1}]}.
    {test,is_lt,{f,9},[{tr,{x,0},{t_integer,any}},{tr,{x,1},{t_integer,any}}]}.
    {move,{atom,smaller},{x,0}}.
    return.
  {label,9}.
    {move,{atom,larger_or_equal},{x,0}}.
    return.

</code></pre>

<p>The operands for the <code>is_lt</code> instruction now have types. The BEAM
registers <code>{x,0}</code> and <code>{x,1}</code> have the type <code>{t_integer,any}</code>, which
means an integer with an unknown range.</p>

<p>Having that knowledge of the types, the JIT can emit a slightly
shorter test for a small integer:</p>

<pre><code class="language-nasm"># simplified small test since all other types are boxed
    mov eax, edi
    and eax, esi
    test al, 1
    short je L39
</code></pre>

<p>To do a better job, the JIT will need better type information. For example:</p>

<pre><code class="language-erlang">map_size_less_than(Map1, Map2) -&gt;
    if
        map_size(Map1) &lt; map_size(Map2) -&gt; smaller;
        true -&gt; larger_or_equal
    end.
</code></pre>

<p>The BEAM code looks like this:</p>

<pre><code>    {gc_bif,map_size,{f,12},2,[{x,0}],{x,0}}.
    {gc_bif,map_size,{f,12},2,[{x,1}],{x,1}}.
    {test,is_lt,
          {f,12},
          [{tr,{x,0},{t_integer,{0,288230376151711743}}},
           {tr,{x,1},{t_integer,{0,288230376151711743}}}]}.
    {move,{atom,smaller},{x,0}}.
    return.
  {label,12}.
    {move,{atom,larger_or_equal},{x,0}}.
    return.
</code></pre>

<p>Both operands for <code>is_lt</code> now have the type
<code>{t_integer,{0,288230376151711743}}</code>, meaning an integer in the range
0 through 288230376151711743 (that is, <code>(1 bsl 58) - 1</code>). There is no
documented upper limit for the number of elements in a map, but for
the foreseeable future, there is no way that the number of elements in
a map will exceed or even get close to 288230376151711743.</p>

<p>Since both the lower and upper bounds for <code>{x,0}</code> and <code>{x,1}</code> fit in
60 bits, there is no need to test the type of the operands:</p>

<pre><code class="language-nasm"># is_lt_fss
    mov rsi, qword ptr [rbx+8]
    mov rdi, qword ptr [rbx]
# skipped test for small operands since they are always small
    cmp rdi, rsi
L42:
L43:
    jge label_12
</code></pre>

<p>Since the operands are always small, the call to the generic routine
(following label <code>L42</code>) has been omitted.</p>

<h3 id="simplification-of-addition">Simplification of addition</h3>

<p>Looking at arithmetic instructions, we will see the potential for nice
simplifications by the JIT, but unfortunately we will also see the
limitations of the type analysis done by the Erlang compiler in
OTP 25.</p>

<p>Let’s look at the generated code for this function:</p>

<pre><code class="language-erlang">add1(X, Y) -&gt;
    X + Y.
</code></pre>

<p>The BEAM code looks like this:</p>

<pre><code>    {gc_bif,'+',{f,0},2,[{x,0},{x,1}],{x,0}}.
    return.
</code></pre>

<p>The JIT translates the <code>+</code> instruction to the following native instructions:</p>

<pre><code class="language-nasm"># i_plus_ssjd
    mov rsi, qword ptr [rbx]
    mov rdx, qword ptr [rbx+8]
# are both operands small?
    mov eax, esi
    and eax, edx
    and al, 15
    cmp al, 15
    short jne L15
# add with overflow check
    mov rax, rsi
    mov rcx, rdx
    and rcx, -16
    add rax, rcx
    short jno L14
L15:
    call 4328985696
L14:
    mov qword ptr [rbx], rax
</code></pre>

<p>The first two instructions:</p>

<pre><code class="language-nasm">    mov rsi, qword ptr [rbx]
    mov rdx, qword ptr [rbx+8]
</code></pre>
<p>loads the operands for the <code>+</code> operation BEAM registers into CPU registers.</p>

<p>The next 5 instructions tests for small operands:</p>

<pre><code class="language-nasm"># are both operands small?
    mov eax, esi
    and eax, edx
    and al, 15
    cmp al, 15
    short jne L15
</code></pre>

<p>The code is almost identical to the code in the <code>is_lt</code> instruction
that we examined earlier. The only difference is that other CPU
registers are used. If one or both of the operands is not a small
integer, a jump is made to label <code>L15</code>, which looks like this:</p>

<pre><code class="language-nasm">L15:
    call 4328985696
</code></pre>

<p>This code calls a generic routine that can add any combination of
small, bignums, or floats. The generic routine will also handle
non-number operands by raising a <code>badarith</code> exception.</p>

<p>If both operands indeed are smalls, the following code adds them and
checks for overflow:</p>

<pre><code class="language-nasm"># add with overflow check
    mov rax, rsi
    mov rcx, rdx
    and rcx, -16
    add rax, rcx
    short jno L14
</code></pre>

<p>If the addition overflowed, the generic addition routine is
called. Otherwise, control is transferred to the following
instruction:</p>

<pre><code class="language-nasm">    mov qword ptr [rbx], rax
</code></pre>

<p>which stores the result in <code>{x,0}</code>.</p>

<p>To summarize, the addition itself (including dealing with the <a href="http://www.it.uu.se/research/publications/reports/2000-029/2000-029-nc.pdf">tags</a>) requires
4 instructions. However, 10 more instructions are needed to:</p>

<ul>
  <li>Fetch operands from BEAM registers.</li>
  <li>Check that the operands are small integers (at most 60 bits).</li>
  <li>Calling the generic addition routine.</li>
  <li>Storing the result to a BEAM register.</li>
</ul>

<p>Now let’s see what happens if types are introduced.</p>

<p>Consider:</p>

<pre><code class="language-erlang">add2(X0, Y0) -&gt;
    X = 2 * X0,
    Y = 2 * Y0,
    X + Y.
</code></pre>

<p>The BEAM code looks like:</p>

<pre><code>    {gc_bif,'*',{f,0},2,[{x,0},{integer,2}],{x,0}}.
    {gc_bif,'*',{f,0},2,[{x,1},{integer,2}],{x,1}}.
    {gc_bif,'+',{f,0},2,[{tr,{x,0},number},{tr,{x,1},number}],{x,0}}.
    return.
</code></pre>

<p>Types are propagated from arithmetic instructions to other arithmetic
instructions. Because the result of <code>*</code> (if it succeeds) is a number
(integer or float), the operands for the <code>+</code> instruction now have the
type <code>number</code>.</p>

<p>Based on our experience of adding types to the <code>&lt;</code> operator, we might
guess that we would save only one instruction in the type test. We
would be right:</p>

<pre><code class="language-nasm"># simplified test for small operands since both are numbers
    mov eax, esi
    and eax, edx
    test al, 1
    short je L22
</code></pre>

<p>Returning to the simpler example with addition and no multiplication,
let’s add a guard to ensure that <code>X</code> and <code>Y</code> are integers:</p>

<pre><code class="language-erlang">add3(X, Y) when is_integer(X), is_integer(Y) -&gt;
    X + Y.
</code></pre>

<p>That results in the following BEAM code:</p>

<pre><code>    {test,is_integer,{f,5},[{x,0}]}.
    {test,is_integer,{f,5},[{x,1}]}.
    {gc_bif,'+',
            {f,0},
            2,
            [{tr,{x,0},{t_integer,any}},{tr,{x,1},{t_integer,any}}],
            {x,0}}.
    return.
</code></pre>

<p>The types for both operands are now <code>{t_integer,any}</code>. However, that
will still result in the same simplified four-instruction sequence for
testing small integers, because the integers might not fit in 60 bits.</p>

<p>Clearly, based on our experience with <code>is_lt</code>, we will need to establish
a range for <code>X</code> and <code>Y</code>. A reasonable way to do that would be:</p>

<pre><code class="language-erlang">add4(X, Y) when is_integer(X), 0 =&lt; X, X &lt; 16#400,
                is_integer(Y), 0 =&lt; Y, Y &lt; 16#400 -&gt;
    X + Y.
</code></pre>

<p>However, because of limitations in the compiler’s value range analysis,
the types for the <code>+</code> operator will <strong>not</strong> improve:</p>

<pre><code>    {test,is_integer,{f,19},[{x,0}]}.
    {test,is_ge,{f,19},[{tr,{x,0},{t_integer,any}},{integer,0}]}.
    {test,is_lt,{f,19},[{tr,{x,0},{t_integer,any}},{integer,1024}]}.
    {test,is_integer,{f,19},[{x,1}]}.
    {test,is_ge,{f,19},[{tr,{x,1},{t_integer,any}},{integer,0}]}.
    {test,is_lt,{f,19},[{tr,{x,1},{t_integer,any}},{integer,1024}]}.
    {gc_bif,'+',
            {f,0},
            2,
            [{tr,{x,0},{t_integer,any}},{tr,{x,1},{t_integer,any}}],
            {x,0}}.
    return.
</code></pre>

<p>To add insult to injury, the first 6 instructions cannot be simplified
by the JIT because there is not sufficient type information. That is,
the <code>is_lt</code> and <code>is_ge</code> instructions will comprise 11 instructions each.</p>

<p>We aim to improve the type analysis and optimizations in OTP 26 and
generate better code for this example. We are also considering adding
a new guard BIF in OTP 26 for testing that a term is an integer in a
given range.</p>

<p>Meanwhile, while we wait for OTP 26, there is a way in
OTP 25 to write an equivalent guard that will result in
much more efficient code <strong>and</strong> establish known ranges for <code>X</code> and
<code>Y</code>:</p>

<pre><code class="language-erlang">add5(X, Y) when X =:= X band 16#3FF,
                Y =:= Y band 16#3FF -&gt;
    X + Y.
</code></pre>

<p>We are showing this way of writing guard for illustrative purposes
only; we don’t recommend rewriting your guards in this way.</p>

<p>The <code>band</code> operator fails if not both of its operands are integers, so
no <code>is_integer/1</code> test is needed. The <code>=:=</code> comparison will return
<code>false</code> if the corresponding variable is outside the range <code>0</code> through
<code>16#3FF</code>.</p>

<p>That will result in the following BEAM code, where the compiler now
has been able to figure out the possible ranges for the operands of
the <code>+</code> operator:</p>

<pre><code>    {gc_bif,'band',{f,21},2,[{x,0},{integer,1023}],{x,2}}.
    {test,is_eq_exact,
          {f,21},
          [{tr,{x,0},{t_integer,any}},{tr,{x,2},{t_integer,{0,1023}}}]}.
    {gc_bif,'band',{f,21},2,[{x,1},{integer,1023}],{x,2}}.
    {test,is_eq_exact,
          {f,21},
          [{tr,{x,1},{t_integer,any}},{tr,{x,2},{t_integer,{0,1023}}}]}.
    {gc_bif,'+',
            {f,0},
            2,
            [{tr,{x,0},{t_integer,{0,1023}}},{tr,{x,1},{t_integer,{0,1023}}}],
            {x,0}}.
    return.
</code></pre>

<p>Also, the 4 instructions that precede the <code>+</code> instructions are now
relatively efficient.</p>

<p>The <code>band</code> instruction needs to test the operands and be prepared to handle
integers that don’t fit in 60 bits:</p>

<pre><code class="language-nasm"># i_band_ssjd
    mov rsi, qword ptr [rbx]
    mov eax, 16383
# is the operand small?
    mov edi, esi
    and edi, 15
    cmp edi, 15
    short jne L97
    and rax, rsi
    short jmp L98
L97:
    call 4456532680
    short je label_25
L98:
    mov qword ptr [rbx+16], rax
</code></pre>

<p>The <code>is_eq_exact</code> instruction benefits from type information derived from
executing the <code>band</code> instruction. Since the right-hand side operand is known
to be a small integer that fits in a machine word, a simple comparison is
sufficient with no need for fallback code to handle other Erlang terms:</p>

<pre><code class="language-nasm"># is_eq_exact_fss
# simplified check since one argument is an immediate
    mov rdi, qword ptr [rbx+16]
    cmp qword ptr [rbx], rdi
    short jne label_25
</code></pre>

<p>The JIT generates the following code for the <code>+</code> operator:</p>

<pre><code class="language-nasm"># i_plus_ssjd
# add without overflow check
    mov rax, qword ptr [rbx]
    mov rsi, qword ptr [rbx+8]
    and rax, -16
    add rax, rsi
    mov qword ptr [rbx], rax
</code></pre>

<h3 id="simplifications-for-base64">Simplifications for <code>base64</code></h3>

<p>As far as we know, <code>base64</code> is the module in OTP that has benefited
the most of the improvements in OTP 25.</p>

<p>Here follows benchmark results for a benchmark included in a <a href="https://github.com/erlang/otp/issues/5639">Github
issue</a>. First the results
for OTP 24 on my computer:</p>

<pre><code>== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 19805 ms: 50 it/sec
fun base64:decode/1: 1000 iterations in 20075 ms: 49 it/sec
</code></pre>

<p>The results for OTP 25 on the same computer:</p>

<pre><code>== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 16024 ms: 62 it/sec
fun base64:decode/1: 1000 iterations in 18306 ms: 54 it/sec
</code></pre>

<p>In OTP 25, the encoding is done in 80 percent of the time that OTP 24 needs.
Decoding is also more than a second faster.</p>

<p>The <code>base64</code> module has not been modified in OTP 25, so the improvements
are entirely down to improvements in the compiler and the JIT.</p>

<p>Here is the clause of <code>encode_binary/2</code> in the <code>base64</code> module that does
most of the work of encoding a binary to Base64:</p>

<pre><code class="language-erlang">encode_binary(&lt;&lt;B1:8, B2:8, B3:8, Ls/bits&gt;&gt;, A) -&gt;
    BB = (B1 bsl 16) bor (B2 bsl 8) bor B3,
    encode_binary(Ls,
                  &lt;&lt;A/bits,(b64e(BB bsr 18)):8,
                    (b64e((BB bsr 12) band 63)):8,
                    (b64e((BB bsr 6) band 63)):8,
                    (b64e(BB band 63)):8&gt;&gt;).
</code></pre>

<p>The binary matching in the function head establishes ranges for the
the variables <code>B1</code>, <code>B2</code>, and <code>B3</code>. (The types for all three variables
will be <code>{t_integer,{0,255}}</code>.)</p>

<p>Because of the ranges, all of the <code>bsl</code>, <code>bsr</code>, <code>band</code>, and <code>bor</code>
operations that follow do not need any type checks. Also, in the
creation of the binary, there is no need to test whether the binary
creation succeeded because all values are known to be small integers.</p>

<p>The 4 calls to the <code>b64e/1</code> functions are inlined. The function
looks like this:</p>

<pre><code class="language-erlang">-compile({inline, [{b64e, 1}]}).
b64e(X) -&gt;
    element(X+1,
	    {$A, $B, $C, $D, $E, $F, $G, $H, $I, $J, $K, $L, $M, $N,
	     $O, $P, $Q, $R, $S, $T, $U, $V, $W, $X, $Y, $Z,
	     $a, $b, $c, $d, $e, $f, $g, $h, $i, $j, $k, $l, $m, $n,
	     $o, $p, $q, $r, $s, $t, $u, $v, $w, $x, $y, $z,
	     $0, $1, $2, $3, $4, $5, $6, $7, $8, $9, $+, $/}).
</code></pre>

<p>In OTP 25, the JIT will optimize calls to <code>element/2</code> where the
position argument is an integer and the tuple argument is a literal
tuple. For the way <code>element/2</code> is used in <code>be64e/1</code>, all type tests
and range checks will be removed:</p>

<pre><code class="language-nasm"># bif_element_jssd
# skipped tuple test since source is always a literal tuple
L302:
    long mov rsi, 9223372036854775807
    mov rdi, qword ptr [rbx+24]
    lea rcx, qword ptr [rsi-2]
# skipped test for small position since it is always small
    mov rax, rdi
    sar rax, 4
# skipped check for position =:= 0 since it is always &gt;= 1
# skipped check for negative position and position beyond tuple
    mov rax, qword ptr [rcx+rax*8]
L300:
L301:
    mov qword ptr [rbx+24], rax
</code></pre>

<p>That is 7 instructions with no conditional branches.</p>

<h3 id="please-try-this-at-home">Please try this at home!</h3>

<p>If you want to follow along and examine the native code for loaded
modules, start the runtime system like this:</p>

<pre><code class="language-bash">erl +JDdump true
</code></pre>

<p>The native code for all modules that are loaded will be dumped to files with the
extension <code>.asm</code>.</p>

<p>To find code that has been simplified by the JIT, use this command:</p>

<pre><code class="language-bash">egrep "simplified|skipped|without overflow" *.asm
</code></pre>

<p>To examine the BEAM code for a module, use the <code>-S</code> option. For example:</p>

<pre><code class="language-bash">erlc -S base64.erl
</code></pre>

<h3 id="pull-requests">Pull requests</h3>

<p>Here are the main pull requests that implement type-based optimizations:</p>

<ul>
  <li><a href="https://github.com/erlang/otp/pull/5316">jit: Optimize instructions based on operand types</a></li>
  <li><a href="https://github.com/erlang/otp/pull/5664">JIT: Strengthen type-based optimizations</a></li>
  <li><a href="https://github.com/erlang/otp/pull/5688">Further strengthen the type-based optimizations</a></li>
  <li><a href="https://github.com/erlang/otp/pull/5727">jit: Fix integer ranges</a></li>
  <li><a href="https://github.com/erlang/otp/pull/5849">JIT: Optimize bsl and bxor with known small operands</a></li>
  <li><a href="https://github.com/erlang/otp/pull/5855">Compiler: Improve bounds calculation for bitwise operators</a></li>
</ul>]]></content><author><name>Björn Gustavsson</name></author><category term="BEAM" /><category term="JIT" /><summary type="html"><![CDATA[This post explores the new type-based optimizations in Erlang/OTP 25 where the compiler embeds type information in the BEAM files to help the JIT (Just-In-Time compiler) to generate better code.]]></summary></entry><entry><title type="html">The Many-to-One Parallel Signal Sending Optimization</title><link href="https://www.erlang.org/blog/parallel-signal-sending-optimization/" rel="alternate" type="text/html" title="The Many-to-One Parallel Signal Sending Optimization" /><published>2021-11-05T00:00:00+00:00</published><updated>2021-11-05T00:00:00+00:00</updated><id>https://www.erlang.org/blog/parallel-signal-sending-optimization</id><content type="html" xml:base="https://www.erlang.org/blog/parallel-signal-sending-optimization/"><![CDATA[<p>This blog post discusses <a href="https://github.com/erlang/otp/pull/5020">the parallel signal sending
optimization</a> that recently got merged into the
master branch (scheduled to be included in Erlang/OTP 25). The
optimization improves signal sending throughput when several processes
send signals to a single process simultaneously on multicore
machines. At the moment, the optimization is only active when one
configures the receiving process with the <code>{message_queue_data,
off_heap}</code> <a href="https://erlang.org/doc/man/erlang.html#spawn_opt-4">setting</a>. The following figure gives an
idea of what type of scalability improvement the optimization can give
in extreme scenarios (number of Erlang processes sending signals on
the x-axis and throughput on the y-axis):</p>

<p><img src="/blog/images/parallel_siq_q/benchmark_peek.png" alt="alt text" title="Send Benchmark Result Peek" /></p>

<p>This blog post aims to give you an understanding of how signal sending
on a single node is implemented in Erlang and how the new optimization
can yield the impressive scalability improvement illustrated in the
figure above. Let us begin with a brief introduction to what Erlang
signals are.</p>

<h2 id="erlang-signals">Erlang Signals</h2>

<p>All concurrently executing entities (processes, ports, etc.)  in an
Erlang system <a href="https://erlang.org/doc/apps/erts/communication.html">communicate using asynchronous signals</a>. The
most common signal is normal messages that are typically sent between
processes with the bang (!) operator. As Erlang takes pride in being a
concurrent programming language, it is, of course, essential that
signals are sent efficiently between different entities. Let us now
discuss what guarantees Erlang programmers get about signal sending
ordering, as this will help when learning how the new optimization works.</p>

<h3 id="the-signal-ordering-guarantee">The Signal Ordering Guarantee</h3>

<p>The signal ordering guarantee is described in the <a href="https://erlang.org/doc/reference_manual/processes.html#signal-delivery">Erlang
documentation like this</a>:</p>

<blockquote>
  <p>“The only signal ordering guarantee given is the following: if an
entity sends multiple signals to the same destination entity, the
order is preserved; that is, if <code>A</code> sends a signal <code>S1</code> to <code>B</code>, and later
sends signal <code>S2</code> to <code>B</code>, <code>S1</code> is guaranteed not to arrive after <code>S2</code>.”</p>
</blockquote>

<p>This guarantee means that if multiple processes send signals to a
single process, all signals from the same process are received in the
send order in the receiving process. Still, there is no ordering
guarantee for two signals coming from two distinct processes. One
should not think about signal sending as instantaneous. There can be
an arbitrary delay after a signal has been sent until it has reached
its destination, but all signals from <code>A</code> to <code>B</code> travel on the same path
and cannot pass each other.</p>

<p>The guarantee has deliberately been designed to allow for efficient
implementations and allow for future optimizations. However, as we
will see in the next section, before the optimization presented in
this blog post, the implementation did not take advantage of the
permissive ordering guarantee for signals sent between processes
running on the same node.</p>

<h3 id="single-node-process-to-process-implementation-before-the-optimization">Single-Node Process-to-Process Implementation before the Optimization</h3>

<p>Conceptually, the Erlang VM organized the data structure for an Erlang
process as in the following figure before the optimization:</p>

<p><img src="/blog/images/parallel_siq_q/before_process_struct.png" alt="alt text" title="Process struct before optimization" /></p>

<p>Of course, this is an extreme simplification of the Erlang process
structure, but it is enough for our explanation. When a process has
the <code>{message_queue_data, off_heap}</code> setting activated, the following
algorithm is executed to send a signal:</p>

<ol>
  <li>Allocate a new linked list node containing the signal data</li>
  <li>Acquire the <code>OuterSignalQueueLock</code> in the receiving process</li>
  <li>Insert the new node at the end of the <code>OuterSignalQueue</code></li>
  <li>Release the <code>OuterSignalQueueLock</code></li>
</ol>

<p>When a receiving process has run out of signals in its
<code>InnerSignalQueue</code> and/or wants to check if there are more signals in
the outer queue, the following algorithm is executed:</p>

<ol>
  <li>Acquire the <code>OuterSignalQueueLock</code></li>
  <li>Append the <code>OuterSignalQueue</code> at the end of the <code>InnerSignalQueue</code></li>
  <li>Release the <code>OuterSignalQueueLock</code></li>
</ol>

<p>How signal sending works when the receiving process is configured with
<code>{message_queue_data, on_heap}</code> is not so relevant for the main topic
of this blog post. Still, understanding how <code>{message_queue_data,
on_heap}</code> works will also give you an understaning of why the parallel
signal queue optimization is not enabled when a process is configured
with <code>{message_queue_data, on_heap}</code> (which is the default setting),
so here is the algorithm for sending a signal to such a process:</p>

<ol>
  <li>Try to acquire the <code>MainProcessLock</code> with a <code>try_lock</code> call
    <ul>
      <li>If the <code>try_lock</code> call succeeded:
        <ol>
          <li>Allocate space for the signal data on the process’ main heap
area and copy the signal data there</li>
          <li>Allocate a linked list node containing a pointer to the
process heap-allocated signal data</li>
          <li>Acquire the <code>OuterSignalQueueLock</code></li>
          <li>Insert the linked list node at the end of the
<code>OuterSignalQueue</code></li>
          <li>Release the <code>OuterSignalQueueLock</code></li>
          <li>Release the <code>MainProcessLock</code></li>
        </ol>
      </li>
      <li>Else:
        <ol>
          <li>Allocate a new linked list node containing the signal data</li>
          <li>Acquire the <code>OuterSignalQueueLock</code></li>
          <li>Insert the new node at the end of the <code>OuterSignalQueue</code></li>
          <li>Release the <code>OuterSignalQueueLock</code></li>
        </ol>
      </li>
    </ul>
  </li>
</ol>

<p>The advantage of <code>{message_queue_data, on_heap}</code> compared to
<code>{message_queue_data, off_heap}</code> is that the signal data is copied
directly to the receiving process main heap (when the <code>try_lock</code> call
for the <code>MainProcessLock</code> succeeds). The disadvantage of
<code>{message_queue_data, on_heap}</code> is that the sender creates extra
contention on the receiver’s <code>MainProcessLock</code>. Notice that we cannot
simply release the <code>MainProcessLock</code> directly after allocating the
data on the receiver’s process heap. If a garbage collection happen
before the signal have been inserted into the process’ heap, the
signal data would be lost (holding the <code>MainProcessLock</code> prevents a
garbage collection from happening). Therefore, <code>{message_queue_data,
off_heap}</code> provides much better scalability than <code>{message_queue_data,
on_heap}</code> when multiple processes send signals to the same process
concurrently on a multicore system.</p>

<p>However, even though <code>{message_queue_data, off_heap}</code> scales better
than <code>{message_queue_data, on_heap}</code> with the old implementation,
signal senders still had to acquire the <code>OuterSignalQueueLock</code> for a
short time. This lock can become a scalability bottleneck and a
contended hot-spot when there are enough parallel senders. This is why
we saw very poor scalability and even a slowdown for the old
implementation in the benchmark figure above. Now, we are ready to
look at the new optimization.</p>

<h2 id="the-parallel-signal-sending-optimization">The Parallel Signal Sending Optimization</h2>

<p>The optimization takes advantage of Erlang’s permissive signal
ordering guarantee discussed above. It is enough to keep the order of
signals coming from the same entity to ensure that the signal ordering
guarantee holds. So there is no need for different senders to
synchronize with each other! In theory, signal sending could therefore
be parallelized perfectly. In practice, however, there is only one
thread of execution that handles incoming signals, so we also have to
keep in mind that we don’t want to slow down the receiver and ideally
make receiving signals faster. As signal queue data is stored outside
the process main heap area when the <code>{message_queue_data, off_heap}</code>
setting is enabled, the garbage collector does not need to go through
the whole signal queue, giving better performance for processes with a
lot of signals in their signal queue. Therefore, it is also important
for the optimization not to add unnecessary overhead when the
<code>OuterSignalQueueLock</code> is uncontended, so that we do not slow down
existing use cases for <code>{message_queue_data, off_heap}</code> too much.</p>

<h3 id="data-structure-and-birds-eye-view-of-optimized-implementation">Data Structure and Birds-Eye-View of Optimized Implementation</h3>

<p>We decided to go for a design that enables the parallel signal sending
optimization on demand when the contention on the <code>OuterSignalQueueLock</code>
seems to be high to avoid as much overhead as possible when the
optimization is unnecessary. Here is a conceptual view of the process
structure when the optimization is not active (which is the initial
state when creating a process with <code>{message_queue_data, off_heap}</code>):</p>

<p><img src="/blog/images/parallel_siq_q/after_opt_not_active_process_struct.png" alt="alt text" title="Process struct after optimization but when the optimization is inactive" /></p>

<p>The following figure shows a conceptual view of the process structure
when the parallel signal sending optimization is turned on. The only
difference between this and the previous figure is that the
<code>OuterSignalQueueBufferArray</code> field now points to a structure
containing an array with buffers.</p>

<p><img src="/blog/images/parallel_siq_q/after_opt_active_process_sturct.png" alt="alt text" title="Process struct after optimization when the optimization is active" /></p>

<p>When the parallel signal sending optimization is active, senders do
not need to acquire the <code>OuterSignalQueueLock</code> anymore. Senders are
mapped to a slot in the <code>OuterSignalQueueBufferArray</code> by a simple hash
function that is applied to the process ID (senders without a process
ID are currently mapped to the same slot). Before a sender takes the
<code>OuterSignalQueueLock</code> in the receiving process’ structure, the sender
tries to enqueue in its slot in the <code>OuterSignalQueueBufferArray</code> (if
it exists). If the enqueue attempt succeeds, the sender can continue
without even touching the <code>OuterSignalQueueLock</code>! The order of signals
coming from the same sender is maintained because the same sender is
always mapped to the same slot in the buffer array. Now, you have
probably got an idea of why the signal sending throughput can increase
so much with the new optimization, as we saw in the benchmark figure
presented earlier. Essentially, the contention on the
<code>OuterSignalQueueLock</code> gets distributed among the slots in the
<code>OuterSignalQueueBufferArray</code>. The rest of the subsections in this
section cover details of the implementation, so you can skip those
if you do not want to dig deeper.</p>

<h3 id="adaptively-activating-the-outer-signal-queue-buffers">Adaptively Activating the Outer Signal Queue Buffers</h3>

<p>As the figure above tries to illustrate, the <code>OuterSignalQueueLock</code> carries
a statistics counter. When that statistics counter reaches a certain
threshold, the new parallel signal sending optimization is activated
by installing the <code>OuterSignalQueueBufferArray</code> in the process
structure. The statistics counter for the lock is updated in a simple
way. When a thread tries to acquire the <code>OuterSignalQueueLock</code> and the lock
is already taken, the counter is increased, and otherwise, it is
decreased, as the following code snippet illustrates:</p>

<pre><code class="language-c">void erts_proc_sig_queue_lock(Process* proc)
{
    if (EBUSY == erts_proc_trylock(proc, ERTS_PROC_LOCK_MSGQ)) {
        erts_proc_lock(proc, ERTS_PROC_LOCK_MSGQ);
        proc-&gt;sig_inq_contention_counter += 1;
    } else if(proc-&gt;sig_inq_contention_counter &gt; 0) {
        proc-&gt;sig_inq_contention_counter -= 1;
    }
}

</code></pre>

<h3 id="the-outer-signal-queue-buffer-array-structure">The Outer Signal Queue Buffer Array Structure</h3>

<p>Currently, the number of slots in the <code>OuterSignalQueueBufferArray</code> is
fixed to 64. Sixty-four slots should go a long way to reduce signal
queue contention in most practical application that exists today. Few
servers have more than 100 cores, and typical applications spend a lot
of time doing other things than sending signals. Using 64 slots also
allows us to implement a very efficient atomically updatable bitset
containing information about which slots are currently non-empty (the
<code>NonEmptySlots</code> field in the figure above). This bitset makes flushing
the buffer array into the <code>OuterSignalQueue</code> more efficient
since only the non-empty slots in the buffer array need to be visited
and updated to perform the flush.</p>

<h3 id="sending-signals-with-the-optimization-activated">Sending Signals with the Optimization Activated</h3>

<p>Pseudo-code for the algorithm that is executed when a process is
sending a signal to another process that has the
<code>OuterSignalQueueBufferArray</code> installed can be seen below:</p>

<ol>
  <li>Allocate a new linked list node containing the signal data</li>
  <li>Map the process ID of the sender to the right slot <code>I</code> with the hash function</li>
  <li>Acquire the <code>SlotLock</code> for the slot <code>I</code></li>
  <li>Check the <code>IsAlive</code> field for slot <code>I</code>
    <ul>
      <li>If the <code>IsAlive</code> field’s value is <code>true</code>:
        <ol>
          <li>Set the appropriate bit in the <code>NonEmptySlots</code> field, if the buffer is empty</li>
          <li>Insert the allocated signal node at the end of the <code>BufferQueue</code> for slot <code>I</code></li>
          <li>Increase the <code>NumberOfEnqueues</code> in slot <code>I</code> by 1</li>
          <li>Release <code>SlotLock</code> for slot <code>I</code></li>
          <li>The signal is enqueued, and the thread can continue with the next task</li>
        </ol>
      </li>
      <li>Else (the <code>OuterSignalQueueBufferArray</code> has been deactivated):
        <ol>
          <li>Release the lock for slot <code>I</code></li>
          <li>Do the insert into the <code>OuterSignalQueue</code> in the same way as
the signal sending algorithm did it prior to the optimization</li>
        </ol>
      </li>
    </ul>
  </li>
</ol>

<h3 id="fetching-signals-from-the-outer-signal-queue-buffer-array-and-deactivation-of-the-optimization">Fetching Signals from the Outer Signal Queue Buffer Array and Deactivation of the Optimization</h3>

<p>The algorithm for fetching signals from the outer signal queue uses
the <code>NonEmptySlots</code> field in the <code>OuterSignalQueueBufferArray</code>, so it
only needs to check slots that are guaranteed to be non-empty. At a
high level, the routine works according to the following pseudo-code:</p>

<ol>
  <li>Acquire the <code>OuterSignalQueueLock</code></li>
  <li>For each non-empty slot in the buffer array:
    <ol>
      <li>Lock the slot</li>
      <li>Append the signals in the slot to the end of <code>OuterSignalQueue</code></li>
      <li>Add the value of the slot’s <code>NumberOfEnqueues</code> field to the
<code>TotNumberOfEnqueues</code> field in the <code>OuterSignalQueueBufferArray</code></li>
      <li>Reset the slot’s <code>BufferQueue</code> and <code>NumberOfEnqueues</code> fields</li>
      <li>Unlock the slot</li>
    </ol>
  </li>
  <li>Increase the value of the <code>NumberOfFlushes</code> field in the
<code>OuterSignalQueueBufferArray</code> by one</li>
  <li>If the value of the <code>NumberOfFlushes</code> field has reached a certain
threshold <code>T</code>:
    <ul>
      <li>Calculate the average number of enqueues per flush
(<code>EnqPerFlush</code>) during the last <code>T</code> flushes
(<code>TotNumberOfEnqueues</code> / <code>T</code>).
        <ul>
          <li>If <code>EnqPerFlush</code> is below a certain threshold <code>Q</code>:
            <ul>
              <li>Deactivate the parallel signal sending optimization:
                <ol>
                  <li>For each slot in the <code>OuterSignalQueueBufferArray</code>:
                    <ol>
                      <li>Acquire the <code>SlotLock</code></li>
                      <li>Append the signals in the slot (if any) to the end of <code>OuterSignalQueue</code></li>
                      <li>Set the slot’s <code>IsAlive</code> field to <code>false</code></li>
                      <li>Release the <code>SlotLock</code></li>
                    </ol>
                  </li>
                  <li>Set the <code>OuterSignalQueueBufferArray</code> field in the process
structure to <code>NULL</code></li>
                  <li>Schedule deallocation of the buffer array structure</li>
                </ol>
              </li>
            </ul>
          </li>
          <li>Else if the average is equal to or above the threshold <code>Q</code>:
            <ul>
              <li>Set the <code>NumberOfFlushes</code> and the <code>TotNumberOfEnqueues</code>
fields in the buffer array struct to 0</li>
            </ul>
          </li>
        </ul>
      </li>
    </ul>
  </li>
  <li>Append the <code>OuterSignalQueue</code> to the end of the <code>InnerSignalQueue</code></li>
  <li>Reset the <code>OuterSignalQueue</code></li>
  <li>Release the <code>OuterSignalQueueLock</code></li>
</ol>

<p>For simplicity, many details have been left out from the pseudo-code
snippets above. However, if you have understood them, you have an
excellent understanding of how signal sending in Erlang works, how the
new optimization is implemented, and how it automatically activates
and deactivates itself. Let us now dive a little bit deeper into
benchmark results for the new implementation.</p>

<h2 id="benchmark">Benchmark</h2>

<p>A configurable benchmark to measure the performance of both signal
sending processes and receiving processes has been created. The
benchmark lets <code>N</code> Erlang processes send signals (of configurable types
and sizes) to a single process during a period of <code>T</code> seconds. Both <code>N</code>
and <code>T</code> are configurable variables. A signal with size <code>S</code> has a payload
consisting of a list of length <code>S</code> with word-sized (64 bits) items. The
send throughput is calculated by dividing the number of signals that
are sent by <code>T</code>. The receive throughput is calculated by waiting until
all sent signals have been received and then dividing the total number
of signals sent by the time between when the first signal was sent and
when the last signal was received. The benchmark machine has 32 cores
and two hardware threads per core (giving 64 hardware threads). You
can find a detailed benchmark description on the <a href="http://winsh.me/bench/erlang_sig_q/sigq_bench_result.html">signal queue
benchmark page</a>.</p>

<p>First, let us look at the results for very small messages (a list
containing a single integer) below. The graph for the receive
throughput is the same as we saw at the beginning of this blog post. Not
surprisingly, the scalability for sending messages is much better
after the optimization. More surprising is that the performance of
receiving messages is also substantially improved. For example, with
16 processes, the receive throughput is 520 times better with the
optimization! The improved receive throughput can be explained by the
fact that in this scenario, the receiver has to fetch messages from
the outer signal queue much more seldom. Sending is much faster
after the optimization, so the receiver will bring more messages from
the outer signal queue to the inner every time it runs out of
messages. The sender can thus process messages from the inner queue
for a longer time before it needs to fetch messages from the outer
queue again. We cannot expect any improvement for the receiver beyond
a certain point as there is only a single hardware thread that can
work on processing messages at the same time.</p>

<p><img src="/blog/images/parallel_siq_q/small_msg_send_receive_throughput.png" alt="alt text" title="Small Messages Benchmark Result" /></p>

<p>Below are the results for larger messages (a list containing 100
integers). We do not get as good improvement in this scenario with a
larger message size. With larger messages, the benchmark spends more
time doing other work than sending and receiving messages. Things like
the speed of the memory system and memory allocation might become
limiting factors. Still, we get decent improvement both in the send
throughput and receive throughput, as seen below.</p>

<p><img src="/blog/images/parallel_siq_q/large_msg_send_receive_throughput.png" alt="alt text" title="Large Messages Benchmark Result" /></p>

<p>You can find results for even larger messages as well as for
non-message signals on the <a href="http://winsh.me/bench/erlang_sig_q/sigq_bench_result.html">benchmark page</a>. Real
Erlang applications do much more than message and signal sending, so
this benchmark is, of course, not representative of what kind of
improvements real applications will get. However, the benchmarks show
that we have pushed the threshold for when parallel message sending to
a single process becomes a problem. Perhaps the new optimization opens
up new interesting ways of writing software that was impractical due
to previous performance reasons.</p>

<h2 id="possible-future-work">Possible Future Work</h2>

<p>Users can configure processes with <code>{message_queue_data, off_heap}</code> or
<code>{message_queue_data, on_heap}</code>. This configurability increases the
burden for Erlang programmers as it can be difficult to figure out
which one is better for a particular process. It would therefore make
sense also to have a <code>{message_queue_data, auto}</code> option that would
automatically detect lock contention even in <code>on_heap</code> mode and
seamlessly switch between <code>on_heap</code> and <code>off_heap</code> based on how much
contention is detected.</p>

<p>As discussed previously, 64 slots in the signal queue buffer array is
a good start but might not be enough when servers have thousands of
cores. A possible way to make the implementation even more scalable
would be to make the signal queue buffer array expandable. For
example, one could have contention detecting locks for each slot in
the array. If the contention is high in a particular slot, one could
expand this slot by creating a link to a subarray with buffers where
senders can use another hash function (similar to how the <a href="https://en.wikipedia.org/wiki/Hash_array_mapped_trie">HAMT data
structure</a> works).</p>

<h2 id="conclusion">Conclusion</h2>

<p>The new parallel signal queue optimization that affects processes
configured with <code>{message_queue_data, off_heap}</code> yields much better
scalability when multiple processes send signals to the same process
in parallel. The optimization has a very low overhead when the
contention is low as it is only activated when its contention
detection mechanism indicates that the contention is high.</p>]]></content><author><name>Kjell Winblad</name></author><category term="message," /><category term="signal," /><category term="signal" /><category term="queue," /><category term="message" /><category term="queue," /><category term="parallel" /><summary type="html"><![CDATA[This blog post discusses the parallel signal sending optimization that recently got merged into the master branch (scheduled to be included in Erlang/OTP 25). The optimization improves signal sending throughput when several processes send signals to a single process simultaneously on multicore machines. At the moment, the optimization is only active when one configures the receiving process with the {message_queue_data, off_heap} setting. The following figure gives an idea of what type of scalability improvement the optimization can give in extreme scenarios (number of Erlang processes sending signals on the x-axis and throughput on the y-axis):]]></summary></entry><entry><title type="html">Decentralized ETS Counters for Better Scalability</title><link href="https://www.erlang.org/blog/scalable-ets-counters/" rel="alternate" type="text/html" title="Decentralized ETS Counters for Better Scalability" /><published>2021-08-03T00:00:00+00:00</published><updated>2021-08-03T00:00:00+00:00</updated><id>https://www.erlang.org/blog/scalable-ets-counters</id><content type="html" xml:base="https://www.erlang.org/blog/scalable-ets-counters/"><![CDATA[<p>A shared <a href="https://erlang.org/doc/man/ets.html">Erlang Term Storage
(ETS)</a> table is often an
excellent place to store data that is updated and read from
multiple Erlang processes frequently. ETS provides key-value stores to
Erlang processes. When the
<a href="https://erlang.org/doc/man/ets.html#new-2">write_concurrency</a> option
is activated, ETS tables use fine-grained locking
internally. Therefore, a scenario where multiple processes insert and
remove different items in an ETS table should scale well with the
number of utilized cores. However, in practice the scalability
for such scenarios is not yet perfect. This blog post will explore
how the <code>decentralized_counters</code> option brings us one step closer to
perfect scalability.</p>

<p>The ETS table option
<a href="https://erlang.org/doc/man/ets.html#new-2"><code>decentralized_counters</code></a>
(introduced in Erlang/OTP 22 for <code>ordered_set</code> tables and in
Erlang/OTP 23 for the other table types) has made the scalability much
better. A table with <code>decentralized_counters</code> activated uses
decentralized counters instead of centralized counters to track the
number of items in the table and the memory
consumption. Unfortunately, tables with <code>decentralized_counters</code>
activated will have slow operations to get the table size and
memory usage (<a href="https://erlang.org/doc/man/ets.html#info-2"><code>ets:info(Table,
size)</code></a> and
<a href="https://erlang.org/doc/man/ets.html#info-2"><code>ets:info(Table,
memory)</code></a>), so whether it
is beneficial to turn <code>decentralized_counters</code> on or off depends on
your use case. This blog post will give you a better understanding of
when one should activate the <code>decentralized_counters</code> option and how
the decentralized counters work.</p>

<h2 id="scalability-with-decentralized-ets-counters">Scalability with Decentralized ETS Counters</h2>

<p>The following figure shows the throughput (operations/second) achieved
when processes are doing inserts (<code>ets:insert/2</code>) and deletes
(<code>ets:delete/2</code>) to an ETS table of the <code>set</code> type on a machine with
64 hardware threads both when <code>decentralized_counters</code> option is
activated and when it is deactivated. The table types <code>bag</code> and
<code>duplicate_bag</code> have similar scalability behavior as their
implementation is based on the same hash table.</p>

<p><img src="/blog/images/ets_scalable_counters/bench_set_50_ins_50_del_nospread.png" alt="alt text" title="Throughput of inserts and deletes on a table of type set with and without the decentralized_counters activated" /></p>

<p>The following figure shows the results for the same benchmark but with
a table of type <code>ordered_set</code>:</p>

<p><img src="/blog/images/ets_scalable_counters/bench_ordset_50_ins_50_del_nospread.png" alt="alt text" title="Throughput of inserts and deletes on a table of type ordered_set with and without the decentralized_counters activated" /></p>

<p>The interested reader can find more information about the benchmark at
the <a href="http://winsh.me/ets_catree_benchmark/decent_ctrs_hash.html">benchmark website for
<code>decentralized_counters</code></a>. The
benchmark results above show that both <code>set</code> and <code>ordered_set</code> tables
get a significant scalability boost when the <code>decentralized_counter</code>
option is activated. The <code>ordered_set</code> type receives a more
substantial scalability improvement than the <code>set</code> type. Tables of the
set type have a fixed number of locks for the hash table buckets. The
<code>ordered_set</code> table type is implemented with a <a href="https://doi.org/10.1016/j.jpdc.2017.11.007">contention adapting
search tree</a> that
dynamically changes the locking granularity based on how much
contention is detected. This implementation difference explains the
difference in scalability between <code>set</code> and <code>ordered_set</code>. The
interested reader can find details about the <code>ordered_set</code>
implementation in an <a href="/blog/the-new-scalable-ets-ordered_set/">earlier blog
post</a>.</p>

<p>Worth noting is also that the Erlang VM that ran the benchmarks has
been compiled with the configure option “<code>./configure
--with-ets-write-concurrency-locks=256</code>”. The configure option
<code>--with-ets-write-concurrency-locks=256</code> changes the number of locks
for hash-based ETS tables from the current default of 64 to 256 (256
is currently the max value one can set this configuration option
to). Changing the implementation of the hash-based tables so that one
can set the number of locks per table instance or so that the lock
granularity is adjusted automatically seems like an excellent future
improvement, but this is not what this blog post is about.</p>

<p>A centralized counter consists of a single memory word that is
incremented and decremented with atomic instructions. The problem with
a centralized counter is that modifications of the counter
by multiple cores are serialized. This problem is amplified because
frequent modifications of a single memory word by multiple cores cause
a lot of expensive traffic in the <a href="https://en.wikipedia.org/wiki/Cache_coherence">cache
coherence</a>
system. However, reading from a centralized counter is quite efficient
as the reader only has to read a single memory word.</p>

<p>When designing the decentralized counters for ETS, we have tried to
optimize for update performance and scalability as most applications
need to get the size of an ETS table relatively rarely. However, since
there may be applications out in the wild that frequently call
<a href="https://erlang.org/doc/man/ets.html#info-2"><code>ets:info(Table, size)</code></a>
and <a href="https://erlang.org/doc/man/ets.html#info-2"><code>ets:info(Table,
memory)</code></a>, we have chosen
to make decentralized counters optional.</p>

<p>Another thing that might be worth keeping in mind is that the
hash-based tables that use decentralized counters tend to use slightly
more hash table buckets than the corresponding tables without
decentralized counters. The reason for this is that, with
decentralized counters activated, the resizing decision is based on an
estimate of the number of items in the table rather than an exact
count, and the resizing heuristics trigger an increase of the number
of buckets more eagerly than a decrease.</p>

<h2 id="implementation">Implementation</h2>

<p>You will now learn how the decentralized counters in ETS works. The
<a href="https://github.com/erlang/otp/blob/ce7dbe8742e66f4632b5d39a9b4d7aa461e4f164/erts/emulator/beam/erl_flxctr.h">decentralized counter implementation exports an
API</a>
that makes it easy to swap between a decentralized counter and a
centralized one. ETS uses this to support the usage of both
centralized and decentralized counters. The data structure for the
decentralized counter is illustrated in the following picture. When
<code>is_decentralized = false</code>, the counter field represents the current
count instead of a pointer to an array of cache line padded counters.</p>

<p><img src="/blog/images/ets_scalable_counters/structure.png" alt="alt text" title="An image
showing the structure of a decentralized counter" /></p>

<p>When <code>is_decentralized = true</code>, processes that update (increment or
decrement) the counter follow the pointer to the array of counters and
increments the counter at the slot in the array that the current
scheduler maps to (one takes the scheduler identifier modulo the
number of slots in the array to get the appropriate slot). Updates do
not need to do anything else, so they are very efficient and can scale
perfectly with the number of cores as long as there are as many slots
as schedulers. One can configure the maximum number of slots in the
array of counters with the
<a href="https://erlang.org/doc/man/erl.html"><code>+dcg</code></a> option.</p>

<p>To implement the <code>ets:info(Table, size)</code> and <code>ets:info(Table, memory)</code>
operations, one also needs to read the current counter value. Reading
the current counter value can be implemented by taking the sum of the
values in the counter array. However, if this summation is done
concurrently with updates to the array of counters, we could get
strange results. For example, we could end up in a situation where
<code>ets:info(Table, size)</code> returns a negative number, which is not
exactly what we want. On the other hand, we want to make counter
updates as fast as possible so having locks to protect the counters in
the counter array is not a good solution. We opted for a solution that
lets readers swap out the entire counter array and wait (using the
<a href="https://github.com/erlang/otp/blob/7c06ca6231b812965305522284dd9f2653ced98d/erts/emulator/internal_doc/ThreadProgress.md">Erlang VM’s thread progress
system</a>)
until no updates can occur in the swapped-out array before the sum is
calculated. The following example illustrates this approach:</p>

<ul>
  <li>
    <p><strong>[Step 1]</strong></p>

    <p>A thread is going to read the counter value.</p>

    <p><img src="/blog/images/ets_scalable_counters/snap_ani_1.png" alt="alt text" title="Step 1" /></p>
  </li>
  <li>
    <p><strong>[Step 2]</strong></p>

    <p>The reader starts by creating a new counter array.</p>

    <p><img src="/blog/images/ets_scalable_counters/snap_ani_1_b.png" alt="alt text" title="Step 2" /></p>
  </li>
  <li>
    <p><strong>[Step 3]</strong></p>

    <p>The pointer to the old counter array is changed to point to the new
 one with the <code>snapshot_ongoing</code> field set to <code>true</code>. This
 change can only be done when the <code>snapshot_onging</code> field is set to
 <code>false</code> in the old counter array.</p>

    <p><img src="/blog/images/ets_scalable_counters/snap_ani_2.png" alt="alt text" title="Step 3" /></p>
  </li>
  <li>
    <p><strong>[Step 4]</strong></p>

    <p>Now, the reader has to wait until all other threads that will
 update a counter in the old array have completed their updates. As
 mentioned, this can be done using the <a href="https://github.com/erlang/otp/blob/7c06ca6231b812965305522284dd9f2653ced98d/erts/emulator/internal_doc/ThreadProgress.md">Erlang VM’s thread progress
 system</a>. After
 that, the reader can safely calculate the sum of counters in the
 old counter array (the sum is 1406). The calculated sum is also
 given to the process that requested the count so that it can
 continue execution.</p>

    <p><img src="/blog/images/ets_scalable_counters/snap_ani_3.png" alt="alt text" title="Step 4" /></p>
  </li>
  <li>
    <p><strong>[Step 5]</strong></p>

    <p>The read operation is not done, even though we have successfully
 calculated a count. One must add the calculated sum from the old
 array to the new array to avoid losing something.</p>

    <p><img src="/blog/images/ets_scalable_counters/snap_ani_4.png" alt="alt text" title="Step 5" /></p>
  </li>
  <li>
    <p><strong>[Step 6]</strong></p>

    <p>Finally, the <code>snapshot_ongoing</code> field in the new counter array is
 set to <code>false</code> so that other read operations can swap out the new
 counter array.</p>

    <p><img src="/blog/images/ets_scalable_counters/snap_ani_5.png" alt="alt text" title="Step 6" /></p>
  </li>
</ul>

<p>Now, you should have got a basic understanding of how ETS’
decentralized counters work. You are also welcome to look at the
source code in
<a href="https://github.com/erlang/otp/blob/ce7dbe8742e66f4632b5d39a9b4d7aa461e4f164/erts/emulator/beam/erl_flxctr.c">erl_flxctr.c</a>
and
<a href="https://github.com/erlang/otp/blob/ce7dbe8742e66f4632b5d39a9b4d7aa461e4f164/erts/emulator/beam/erl_flxctr.h">erl_flxctr.h</a>
if you are interested in details of the implementation.</p>

<p>As you can imagine, reading the value of a decentralized counter with,
for example, <code>ets:info(Table, size)</code> is extremely slow compared to a
centralized counter. Fortunately, most time that is spent reading the
value of a decentralized counter is spent waiting for the thread
progress system to report that it is safe to read the swapped-out array,
and the read operation does not block any scheduler and does not
consume any CPU time during this time. On the other hand, the
decentralized counter can be updated in a very efficient and scalable
way, so using decentralized counters is most likely to prefer, if you
seldom need to get the size and the memory consumed by your shared
ETS table.</p>

<h2 id="concluding-remarks">Concluding Remarks</h2>

<p>This blog post has described the implementation of the decentralized
counter option for ETS tables. ETS tables with decentralized counters
scale much better with the number of cores than ETS tables with
centralized counters. However, as decentralized counters make
<code>ets:info(Table, size)</code> and <code>ets:info(Table, memory)</code> very slow, one
should not use them if any of these two operations need to be
performed frequently.</p>]]></content><author><name>Kjell Winblad</name></author><category term="ETS," /><category term="erlang" /><category term="term" /><category term="storage," /><category term="scalability," /><category term="multicore" /><summary type="html"><![CDATA[A shared Erlang Term Storage (ETS) table is often an excellent place to store data that is updated and read from multiple Erlang processes frequently. ETS provides key-value stores to Erlang processes. When the write_concurrency option is activated, ETS tables use fine-grained locking internally. Therefore, a scenario where multiple processes insert and remove different items in an ETS table should scale well with the number of utilized cores. However, in practice the scalability for such scenarios is not yet perfect. This blog post will explore how the decentralized_counters option brings us one step closer to perfect scalability.]]></summary></entry></feed>