[erlang-questions] The trouble with measurements (was: Benchmarking Erlang: ...)

Thu Oct 9 17:58:37 CEST 2008

Anders Nygren wrote:
> But since gen_server handles messages in a FIFO manner, all inserts 
> will be processed before the lookups. The only problem is that the 
> first lookup will have to wait for all inserts to finish. So the time
> for lookups will include the time for the inserts to finish, but 
> there will not be a lot of lookups before the inserts have finished.

The major problem with benchmarks, I have to say again, is to know what
is actually being measured. I thought I knew what was going on, but
Anders' comment here made me sit up and think "of course he's right!"
followed by "but how could that possibly fit in with what I'm seeing??"
because what I was seeing was that for large N, the results from test(N)
came in long before all the messages were processed by the gen_server.

The explanation turned out to be complicated. It was a dark and stormy
night...

First, this doesn't happen as easily without SMP (and since smp is on by
default these days, that was what I was using - and Joel too, I assume).
Without SMP, there is a single scheduler which time-slices all the ready
Erlang processes. A sender process, which does a large number (30000+)
of gen_server:cast() (asynchronous messages to the server), will not be
able to dispatch all of these before the gen_server process is swapped
in instead and starts handling the messages. But with SMP, a sender can
happily fill upp the server mailbox with tens of thousands of messages
and then proceed to doing gen_server:call() instead.

Second, Joel had not specified a timeout for his gen_server:call().
This means that it uses the default, which happens to be five seconds.
Now, you may remember from Joel's numbers what the lookup times for
large N were all very close to...

So, with SMP, and sufficiently large N, the test process very quickly
dispatched all its casts to ADD elements to the table, and the poor
gen_server had barely started to process them, when the test process
entered the first gen_server:call() to do a LOOKUP. Five seconds passed,
and the call timed out (still long before the server was anywhere near
getting through all the ADDs). This threw an exception, which
immediately quit the lookup(Pid, N) loop after the first call.

So why didn't anyone notice the timeout? Well, because of timer:tc(),
of course, which runs the call within a catch! So, what the timer really
measured for the lookup loop, was the time for the first gen_server:call
to time out, namely, five seconds.

    /Richard