[erlang-bugs] r15b03-1 SEGV in erts_port_task_schedule()

Mikael Pettersson mikpelinux@REDACTED
Tue Jul 29 16:30:37 CEST 2014


Mikael Pettersson writes:
 > This is a followup to my previous report in
 > <http://erlang.org/pipermail/erlang-bugs/2014-June/004451.html>,
 > but it's for a different function in erl_port_task.c.
 > 
 > We've gotten a new SEGV with r15b03-1.  This time we managed to
 > capture a truncated core dump (just threads list and registers,
 > no thread stacks or heap memory):
 > 
 > Program terminated with signal 11, Segmentation fault.
 > #0  enqueue_task (ptp=<optimized out>, 
 >     ptqp=<error reading variable: Cannot access memory at address 0x7f8f02a95d08>)
 >     at beam/erl_port_task.c:327
 > 327         ptp->prev = ptqp->last;
 > (gdb) bt
 > #0  enqueue_task (ptp=<optimized out>, 
 >     ptqp=<error reading variable: Cannot access memory at address 0x7f8f02a95d08>)
 >     at beam/erl_port_task.c:327
 > #1  erts_port_task_schedule (id=<optimized out>, 
 >     id@REDACTED=<error reading variable: Cannot access memory at address 0x7f8efdeb8318>, 
 >     pthp=<error reading variable: Cannot access memory at address 0x7f8efdeb82c0>, 
 >     type=<error reading variable: Cannot access memory at address 0x7f8efdeb82cc>, 
 >     event=<error reading variable: Cannot access memory at address 0x7f8efdeb82d0>, 
 >     event_data=<error reading variable: Cannot access memory at address 0x7f8efdeb82d8>)
 >     at beam/erl_port_task.c:615
 > (gdb) 
 > 
 > The code that faulted is
 > 
 >    0x00000000004b8203 <+419>:   mov    0x10(%r15),%rax
 >    0x00000000004b8207 <+423>:   mov    0x10(%rsp),%rbx
 >    0x00000000004b820c <+428>:   movq   $0x0,0x8(%rbx)
 > => 0x00000000004b8214 <+436>:   mov    0x8(%rax),%rcx
 >    0x00000000004b8218 <+440>:   mov    %rax,0x10(%rbx)
 >    0x00000000004b821c <+444>:   mov    %rcx,(%rbx)
 > 
 > which is enqueue_task() [line 327] as inlined in erts_port_task_schedule()
 > [line 615].  At this point, %rax is zero according to gdb's registers dump.
 > 
 > The relevant part of erts_port_task_schedule() is:
 > 
 > ==snip==
 >     if (!pp->sched.taskq)
 > 	pp->sched.taskq = port_taskq_init(port_taskq_alloc(), pp);
 > 
 >     ASSERT(ptp);
 > 
 >     ptp->type = type;
 >     ptp->event = event;
 >     ptp->event_data = event_data;
 > 
 >     set_handle(ptp, pthp);
 > 
 >     switch (type) {
 >     case ERTS_PORT_TASK_FREE:
 > 	erl_exit(ERTS_ABORT_EXIT,
 > 		 "erts_port_task_schedule(): Cannot schedule free task\n");
 > 	break;
 >     case ERTS_PORT_TASK_INPUT:
 >     case ERTS_PORT_TASK_OUTPUT:
 >     case ERTS_PORT_TASK_EVENT:
 > 	erts_smp_atomic_inc_relb(&erts_port_task_outstanding_io_tasks);
 > 	/* Fall through... */
 >     default:
 > 	enqueue_task(pp->sched.taskq, ptp);
 > 	break;
 >     }
 > ==snip==
 > 
 > The SEGV implies that pp->sched.taskq is NULL at the call to enqueue_task().
 > 
 > The erts_smp_atomic_inc_relb() and set_handle() calls do not affect *pp,
 > and I don't see any aliasing between *ptp and *pp, so the assignments to
 > *ptp do not affect *pp either.
 > 
 > So for pp->sched.taskq to be NULL at the bottom it would have to be NULL
 > after the call to port_taskq_init(), which implies that port_taskq_alloc()
 > returned NULL.
 > 
 > port_taskq_alloc() is generated via ERTS_SCHED_PREF_QUICK_ALLOC_IMPL;
 > if one expands that it becomes:
 > 
 > void erts_alloc_n_enomem(ErtsAlcType_t,Uint)
 >      __attribute__((noreturn));
 > 
 > static __inline__
 > void *erts_alloc(ErtsAlcType_t type, Uint size)
 > {
 >     void *res;
 >     res = (*erts_allctrs[(((type) >> (0)) & (15))].alloc)(
 > 	(((type) >> (7)) & (255)),
 > 	erts_allctrs[(((type) >> (0)) & (15))].extra,
 > 	size);
 >     if (!res)
 > 	erts_alloc_n_enomem((((type) >> (7)) & (255)), size);
 >     return res;
 > }
 > 
 > static __inline__ ErtsPortTaskQueue * port_taskq_alloc(void)
 > {
 >     ErtsPortTaskQueue *res = port_taskq_pre_alloc();
 >     if (!res)
 > 	res = erts_alloc((4564), sizeof(ErtsPortTaskQueue));
 >     return res;
 > }
 > 
 > But given this code, I don't see how erts_alloc() or port_taskq_alloc()
 > could ever return NULL.
 > 
 > Which leads me to suspect that there's a concurrency bug that's
 > causing *pp to be clobbered behind our backs.
 > 
 > Ideas?

I've studied erl_port_task.c and some related code, and think I
understand how the locking is _supposed_ to work: Before you can
access port->sched you have to erts_smp_runq_lock() the runq
stored (as an atomic integer) in port->run_queue.

runq = erts_port_runq(port);

does this automatically, but you then end up with two variables
that must stay coherent until you unlock.

Q1. I see a number of places in erl_port_task.c where the code
temporarily releases the runq lock, calls out somewhere, and then
just locks the runq while still holding on to the original port.
What, if anything, guarantees that these two items still belong
together.  Can't the port migrate to another runq while the runq
was unlocked?  If so, then it seems we have to reload runq (if the
port is the object of interest) or revalidate and possibly bail out
if they are now no longer connected.

Q2. Is erts_check_emigration_need(runq) _guaranteed_ to return a
value that is different from the given runq?  Looking at the code,
that's not obviously true.  If it ever returned the same runq it
was given, very bad things(TM) would happen...

/Mikael



More information about the erlang-bugs mailing list