reproduceable livelock (?) in R13B04

Thu Nov 18 16:00:04 CET 2010

Hi,

while chasing a related bug, I found that the program below can get
the beam VM into a state where it eats CPU but fails to make forward
progress.

I expect the program to print _something_ new at one second intervals,
no matter what. Here's the program:

   -module(poll).
   -export([go/0]).

   go() ->
       file:delete("/tmp/pid"),
       spawn(fun() -> os:cmd("sleep 1; echo -n $? > /tmp/pid") end),
       wait_for_ev(),
       go().

   wait_for_ev() ->
       case file:read_file("/tmp/pid") of
   	{ok, <<>>} -> io:fwrite("race!\n");
   	{ok, _} -> io:fwrite(".");
   	{error, enoent} -> wait_for_ev();
   	X -> io:fwrite("unexpected: ~p\n", [X])
       end.

On my dual-core 64-bit dual core desktop machine (SMP, the program runs as
expected for a few minutes, printing various things, but then it
produces no more output. 'top' says it's eating 100% CPU. That's happened
within three minutes three times in a row now, i.e. it's reproduceable on
that particular machine.

  Erlang R13B04 (erts-5.7.5) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]

  Linux contorpis.corelatus.se 2.6.32-5-amd64 #1 SMP Sat Oct 30 14:18:21 UTC 2010 x86_64 GNU/Linux

On my single-core 32-bit laptop, I can't provoke the same hang, i.e. the
program runs as it should.

On a single-core MIPS system, I also get the hang. On the MIPS system,
strace shows me:

read(8, "", 262144)                     = 0
clock_gettime(CLOCK_MONOTONIC, {56121, 494810314}) = 0
clock_gettime(CLOCK_MONOTONIC, {56121, 495844944}) = 0
poll([{fd=3, events=POLLIN|POLLRDNORM}, {fd=5, events=POLLIN|POLLRDNORM}, {fd=0,
events=POLLIN|POLLRDNORM}], 3, 0) = 0
clock_gettime(CLOCK_MONOTONIC, {56121, 498175425}) = 0
read(8, "", 262144)                     = 0
clock_gettime(CLOCK_MONOTONIC, {56121, 501869833}) = 0
clock_gettime(CLOCK_MONOTONIC, {56121, 502907388}) = 0
poll([{fd=3, events=POLLIN|POLLRDNORM}, {fd=5, events=POLLIN|POLLRDNORM}, {fd=0,
events=POLLIN|POLLRDNORM}], 3, 0) = 0
clock_gettime(CLOCK_MONOTONIC, {56121, 505240833}) = 0
read(8, "", 262144)                     = 0
clock_gettime(CLOCK_MONOTONIC, {56121, 507228944}) = 0
clock_gettime(CLOCK_MONOTONIC, {56121, 508265759}) = 0
poll([{fd=3, events=POLLIN|POLLRDNORM}, {fd=5, events=POLLIN|POLLRDNORM}, {fd=0,
events=POLLIN|POLLRDNORM}], 3, 0) = 0
clock_gettime(CLOCK_MONOTONIC, {56121, 513049685}) = 0
read(8, "", 262144)                     = 0
clock_gettime(CLOCK_MONOTONIC, {56121, 515038870}) = 0
gettimeofday({1290081697, 234682}, NULL) = 0
clock_gettime(CLOCK_MONOTONIC, {56121, 517034759}) = 0
poll([{fd=3, events=POLLIN|POLLRDNORM}, {fd=5, events=POLLIN|POLLRDNORM}, {fd=0,
events=POLLIN|POLLRDNORM}], 3, 0) = 0
clock_gettime(CLOCK_MONOTONIC, {56121, 519364425}) = 0

I can see that FD=8 is /tmp/pid.

I haven't tried reproducing this on R14B (yet). I can't see anything in
the R14B release notes which matches my symptoms, OTP-8773 is the closest.

Is this a known problem?

Matt