Benchmarking

The main purpose of benchmarking is to find out which implementation of a given algorithm or function is the fastest. Benchmarking is far from an exact science. Today's operating systems generally run background tasks that are difficult to turn off. Caches and multiple CPU cores do not facilitate benchmarking. It would be best to run UNIX computers in single-user mode when benchmarking, but that is inconvenient to say the least for casual testing.

Using erlperf

A useful tool for benchmarking is erlperf (documentation). It makes it simple to find out which code is faster. For example, here is how two methods of generating random bytes can be compared:

% erlperf 'rand:bytes(2).' 'crypto:strong_rand_bytes(2).'
Code                                 ||        QPS       Time   Rel
rand:bytes(2).                        1    7784 Ki     128 ns  100%
crypto:strong_rand_bytes(2).          1    2286 Ki     437 ns   29%

From the Time column we can read out that on average a call to rand:bytes(2) executes in 128 nanoseconds, while a call to crypto:strong_rand_bytes(2) executes in 437 nanoseconds.

From the QPS column we can read out how many calls that can be made in a second. For rand:bytes(2), it is 7,784,000 calls per second.

The Rel column shows the relative differences, with 100% indicating the fastest code.

When generating two random bytes at a time, rand:bytes/1 is more than three times faster than crypto:strong_rand_bytes/1. Assuming that we really need strong random numbers and we need to get them as fast as possible, what can we do? One way could be to generate more than two bytes at a time.

% erlperf 'rand:bytes(100).' 'crypto:strong_rand_bytes(100).'
Code                                   ||        QPS       Time   Rel
rand:bytes(100).                        1    2124 Ki     470 ns  100%
crypto:strong_rand_bytes(100).          1    1915 Ki     522 ns   90%

rand:bytes/1 is still faster when we generate 100 bytes at a time, but the relative difference is smaller.

% erlperf 'rand:bytes(1000).' 'crypto:strong_rand_bytes(1000).'
Code                                    ||        QPS       Time   Rel
crypto:strong_rand_bytes(1000).          1    1518 Ki     658 ns  100%
rand:bytes(1000).                        1     284 Ki    3521 ns   19%

When we generate 1000 bytes at a time, crypto:strong_rand_bytes/1 is now the fastest.

Benchmarking using Erlang/OTP functionality

Benchmarks can measure wall-clock time or CPU time.

timer:tc/3 measures wall-clock time. The advantage with wall-clock time is that I/O, swapping, and other activities in the operating system kernel are included in the measurements. The disadvantage is that the measurements often vary a lot. Usually it is best to run the benchmark several times and note the shortest time, which is the minimum time that is possible to achieve under the best of circumstances.
statistics(runtime) measures CPU time spent in the Erlang virtual machine. The advantage with CPU time is that the results are more consistent from run to run. The disadvantage is that the time spent in the operating system kernel (such as swapping and I/O) is not included. Therefore, measuring CPU time is misleading if any I/O (file or socket) is involved.

It is probably a good idea to do both wall-clock measurements and CPU time measurements.

Some final advice:

The granularity of both measurement types can be high. Therefore, ensure that each individual measurement lasts for at least several seconds.
To make the test fair, each new test run is to run in its own, newly created Erlang process. Otherwise, if all tests run in the same process, the later tests start out with larger heap sizes and therefore probably do fewer garbage collections. Also consider restarting the Erlang emulator between each test.
Do not assume that the fastest implementation of a given algorithm on computer architecture X is also the fastest on computer architecture Y.

← Previous Page Profiling

Next Page → Introduction