[erlang-bugs] R16B01's monitor delivery is broken?

Scott Lystig Fritchie fritchie@REDACTED
Tue Jul 23 22:44:33 CEST 2013


Hi, everyone.  Hope your summer vacations are going well.  I have some
bad news for Rickard, at least.

    SHA:	 e794251f8e54d6697e1bcc360471fd76b20c7748
    Author:	 Rickard Green <rickard@REDACTED>
    Date:	 Thu May 30 2013 07:56:31 GMT-0500 (CDT)
    Subject: Merge branch 'rickard/ptab-id-alloc/OTP-11077' into maint
    Parent:	 22685099ace9802016bf6203c525702084717d72
    Parent:	 5c039a1fb4979314912dc3af6626d8d7a1c73993
    Merge branch 'rickard/ptab-id-alloc/OTP-11077' into maint
    
    * rickard/ptab-id-alloc/OTP-11077:
      Introduce a better id allocation algorithm for PTabs

This commit appears to break monitor delivery?  And it may or may not be
causing processes to die for reasons that we cannot see or understand.

Run with R15B03-1, the example code in test6.erl is merely slow:

    https://gist.github.com/jtuple/aa4830a0ff0a94f69484/raw/02adc518e225f263a7e25d339ec7200ef2dda491/test6.erl

On my 4 core/8 HT core MacBook Pro, R15B03-1 cannot go above 200% CPU
utilization, and the execution time is correspondingly slooow.  But it
appears to work correctly.

    erl -eval '[begin io:format("Iteration ~p at ~p\n", [X,time()]), test6:go() end || X <- lists:seq(1, 240)].'

When run with R16B, it's *much* faster.  CPU utilization above 750%
confirms that it's going faster.  And it appears to work correctly.

However, when run with R16B01, we see non-deterministic hangs on both OS
X and various Linux platforms.  CPU consumption by the "beam.smp"
process drops to 0, and the next cycle of the list comprehension never
starts.

Thanks to the magic of Git, it's pretty clear that the commit above is
broken.  The commit before it appears to work well (i.e., does not
hang).

    SHA:	 22685099ace9802016bf6203c525702084717d72
    Author:	 Anders Svensson <anders@REDACTED>
    Date:	 Wed May 29 2013 11:46:10 GMT-0500 (CDT)
    Subject: Merge branch 'anders/diameter/watchdog_function_clause/OTP-11115' into maint

Using R16B01 together with the "+P legacy" flag does not hang.  But this
problem has given us at Basho enough ... caution ... that we will be
much more cautious about moving our app packaging from R15B* to R16B*.

Several seconds after CPU consumption drops to 0%, then I trigger the
creation of a "erl_crash.dump" file using erlang:halt("bummer").  If I
look at that file, then the process "Spawned as: test6:do_work2/0" says
that there are active unidirectional links (i.e., monitors), but there
is one process on that list that does not have a corresponding
"=proc:<pid.number.here>" entry in the dump ... which strongly suggests
to me that the process is dead.  Using DTrace, I've been able to
establish that the dead process is indeed alive at one time and has been
scheduled & descheduled at least once.  So there are really two
mysteries:

1. Why is one of the test6:indirect_proxy/1 processes dying
unexpectedly?  (The monitor doesn't fire, SASL isn't logging any errors,
etc.)

2. Why isn't a monitor message being delivered?

Many thanks to Joe Blomstedt, Evan Vigil-McClanahan, Andrew Thompson,
Steve Vinoski, and Sean Cribbs for their sleuthing work.

-Scott

--- snip --- snip --- snip --- snip --- snip --- 

R15B03 lock count analysis, FWIW:

        lock     id  #tries  #collisions  collisions [%]  time [us]  duration [%]
       -----    --- ------- ------------ --------------- ---------- -------------
    proc_tab      1 1280032      1266133         98.9142   60642804      557.0583
   run_queue      8 3617608        12874          0.3559     261722        2.4042
 sys_tracers      1 1280042         6445          0.5035      19365        0.1779
    pix_lock    256 4480284         1213          0.0271       9777        0.0898
   timeofday      1  709955         1187          0.1672       3216        0.0295
[......]

--- snip --- snip --- snip --- snip --- snip --- 

=proc:<0.29950.154>
State: Waiting
Spawned as: test6:do_work2/0
Spawned by: <0.48.0>
Started: Tue Jul 23 04:50:54 2013
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [{from,<0.48.0>,#Ref<0.0.19.96773>}, {to,<0.32497.154>,#Ref<0.0.19.96797>}, {to,<0.1184.155>,#Ref<0.0.19.96796>}, {to,<0.31361.154>,#Ref<0.0.19.96799>}, {to,<0.32019.154>,#Ref<0.0.19.96801>}, {to,<0.32501.154>,#Ref<0.0.19.96800>}, {to,<0.1352.155>,#Ref<0.0.19.96803>}, {to,<0.32415.154>,#Ref<0.0.19.96805>}, {to,<0.504.155>,#Ref<0.0.19.96804>}, {to,<0.87.155>,#Ref<0.0.19.96802>}, {to,<0.776.155>,#Ref<0.0.19.96798>}]
Reductions: 45
Stack+heap: 233
OldHeap: 0
Heap unused: 155
OldHeap unused: 0
Memory: 3472
Program counter: 0x000000001e1504d0 (test6:do_work2/0 + 184)
CP: 0x0000000000000000 (invalid)
arity = 0



More information about the erlang-bugs mailing list