From dszoboszlay@REDACTED Wed Oct 1 14:14:59 2014 From: dszoboszlay@REDACTED (=?iso-8859-1?Q?D=E1niel_Szoboszlay?=) Date: Wed, 1 Oct 2014 14:14:59 +0200 Subject: [erlang-bugs] gen_server timeout disturbed by system messages Message-ID: Hi, When a gen_server (or gen_fsm, which is very similar in this aspect) is waiting for a message with a finite timeout but receives a system message, it will return to the message loop with the original timeout value. I think this is an incorrect behaviour. Let?s say I set the timeout to 60 seconds, and 20 seconds later a system message arrives. After handling it the gen_server loop will wait a full 60 seconds until I receive the timeout - a total of 80 seconds instead of 60. Even worse, if a monitoring tool (like observer) keeps polling the server for its state I may never ever get a timeout. The docs say: > If an integer timeout value is provided, a timeout will occur unless a request or a message is received within Timeout milliseconds. I?m not sure whether a ?message" in this text should mean a "system message" too, but I think system messages are not very well known and for me the docs read my gen_server code will get back the control via a handle_call, handle_cast or handle_info callback within the timeout specified. Loosing control for ever is definitely not something I would be prepared for when setting a timeout. So I believe the gen_server code shall record the time when the wait started, and after processing a system message deduce the elapsed time from the original timeout. This way the timeout would occur when it should (unless the process receives system messages faster than it could handle them and can never clear its message queue of course). Let me know whether you agree with me on the expected behaviour - if you do, I can write a patch and submit a PR, but I don?t want to waste my time working on a non-issue. Thanks & Regards, Daniel -------------- next part -------------- An HTML attachment was scrubbed... URL: From sverker.eriksson@REDACTED Wed Oct 1 15:19:09 2014 From: sverker.eriksson@REDACTED (Sverker Eriksson) Date: Wed, 1 Oct 2014 15:19:09 +0200 Subject: [erlang-bugs] Erlang vm beam.smp crash In-Reply-To: <6a7d6a4b.104f6.148c0eeb097.Coremail.liu1985629@163.com> References: <6a7d6a4b.104f6.148c0eeb097.Coremail.liu1985629@163.com> Message-ID: <542BFF4D.7060705@erix.ericsson.se> Thank you, ??? This is a race bug that has been there since R16B01. It's quite hard to hit, as it requires race between socket port usage and port termination while getting preempted by the OS at very precise code location. A more complete fix will be included in 17.4, most probably looking like this: diff --git a/erts/emulator/beam/erl_bif_port.c b/erts/emulator/beam/erl_bif_port.c index 8a622e5..64bd598 100644 --- a/erts/emulator/beam/erl_bif_port.c +++ b/erts/emulator/beam/erl_bif_port.c @@ -493,8 +493,8 @@ void erts_cleanup_port_data(Port *prt) { ASSERT(erts_atomic32_read_nob(&prt->state) & ERTS_PORT_SFLGS_INVALID_LOOKUP); - cleanup_old_port_data(erts_smp_atomic_read_nob(&prt->data)); - erts_smp_atomic_set_nob(&prt->data, (erts_aint_t) THE_NON_VALUE); + cleanup_old_port_data(erts_smp_atomic_xchg_nob(&prt->data, + (erts_aint_t) NULL)); } Uint @@ -562,8 +562,14 @@ BIF_RETTYPE port_set_data_2(BIF_ALIST_2) data = erts_smp_atomic_xchg_wb(&prt->data, data); + if (data == (erts_aint_t)NULL) { + /* Port terminated by racing thread */ + data = erts_smp_atomic_xchg_wb(&prt->data, data); + ASSERT(data != (erts_aint_t)NULL); + cleanup_old_port_data(data); + BIF_ERROR(BIF_P, BADARG); + } cleanup_old_port_data(data); - BIF_RET(am_true); } @@ -582,6 +588,8 @@ BIF_RETTYPE port_get_data_1(BIF_ALIST_1) BIF_ERROR(BIF_P, BADARG); data = erts_smp_atomic_read_ddrb(&prt->data); + if (data == (erts_aint_t)NULL) + BIF_ERROR(BIF_P, BADARG); /* Port terminated by racing thread */ if ((data & 0x3) != 0) { res = (Eterm) (UWord) data; /Sverker, Erlang/OTP On 09/29/2014 12:22 PM, ??? wrote: > I use http://www.erlang.org/download/otp_src_17.0.tar.gz to build the erlang. > > BIF_RETTYPE port_get_data_1(BIF_ALIST_1) > { > /* > * This is not a signal. See comment above. > */ > Eterm res; > erts_aint_t data; > Port* prt; > > prt = data_lookup_port(BIF_P, BIF_ARG_1); > if (!prt) > BIF_ERROR(BIF_P, BADARG); > > data = erts_smp_atomic_read_ddrb(&prt->data); > if (!data) > BIF_ERROR(BIF_P, BADARG); //I add the two lines to correct it. > > if ((data & 0x3) != 0) { > res = (Eterm) (UWord) data; > ASSERT(is_immed(res)); > } > else { > ErtsPortDataHeap *pdhp = (ErtsPortDataHeap *) data; > Eterm *hp = HAlloc(BIF_P, pdhp->hsize); > res = copy_struct(pdhp->data, pdhp->hsize, &hp, &MSO(BIF_P)); > } > > BIF_RET(res); > } > > > (gdb) bt full > #0 0x0000000000514524 in port_get_data_1 (A__p=0x7f4bc0d66488, BIF__ARGS=) at beam/erl_bif_port.c:591 > pdhp = 0x0 > hp = > data = 0 > #1 0x000000000054d517 in process_main () at beam/beam_emu.c:2787 > bf = 0x514490 > result = 1688368833101607 > init_done = 1 > c_p = 0x7f4bc0d66488 > reds_used = 178536832 > x0 = 1688368833101607 > reg = 0x7f4c0aa44180 > HTOP = 0x7f4bc036d350 > E = 0x7f4bc0370b18 > I = 0x7f4bfb5c7af8 > FCALLS = 1984 > tmp_arg1 = 139963324058344 > tmp_arg2 = 15 > tmp_big = {139964436718400, 5662828} > freg = 0x7f4c0aa461c0 > neg_o_reds = 0 > arith_func = 0 > opcodes = {0x54c14a, 0x54b78e, 0x54c06a, 0x54c0eb, 0x54c2f8, 0x54cee5, 0x54cb67, 0x54e173, 0x54ec5c, 0x54ca4d, 0x54ca43, 0x54ca23, 0x54908b, 0x54c5fe, 0x54d5c8, 0x54d605, 0x54d5f6, 0x54957d, 0x549451, 0x54d366, 0x54d26d, > 0x54d29b, 0x54d063, 0x54d092, 0x54d245, 0x54d223, 0x54d176, 0x54d4bb, 0x54ca52, 0x54caa1, 0x54ba31, 0x54ba07, 0x54ba26, 0x54bf45, 0x54bf66, 0x54ccfb, 0x54c949, 0x546667, 0x54ca9c, 0x54674e, 0x546771, 0x54667a, 0x5466ab, > 0x5466dc, 0x546715, 0x5464cd, 0x54eaa8, 0x54e14e, 0x54ea52, 0x54eb16, 0x546795, 0x5467b6, 0x5467d8, 0x5467f5, 0x546823, 0x546852, 0x546870, 0x54689f, 0x5468cf, 0x5468fc, 0x54692a, 0x546957, 0x54699e, 0x5469e6, 0x546a14, > 0x546a5c, 0x546aa5, 0x546ad3, 0x546b02, 0x546b30, 0x546b78, 0x546bc1, 0x546bf0, 0x546c39, 0x54d3bf, 0x54b2ce, 0x54b39e, 0x54b3c9, 0x54e468, 0x54b3bf, 0x54b5f8, 0x54e1d9, 0x54b046, 0x54b0ac, 0x54b65f, 0x54ce89, 0x54b47a, > 0x54d44b, 0x54e23b, 0x54b4bd, 0x5493b7, 0x54947d, 0x54ddf7, 0x54df22, 0x54e011, 0x54d8dd, 0x54d60f, 0x54d694, 0x54d965, 0x54d9db, 0x54da59, 0x54db67, 0x54d718, 0x54d348, 0x54d2c2, 0x54d340, 0x54b103, 0x54d34d, 0x54d1a0, > 0x54d8d6, 0x54d857, 0x54d8b7, 0x54d6ad, 0x54d57d, 0x54d5ba, 0x54d810, 0x54d849, 0x549430, 0x54bbbe, 0x54e2f0, 0x54d38c, 0x54e331, 0x54e345, 0x54bab9, 0x54e357, 0x549185, 0x54e3c9, 0x54e3e9, 0x5494f1, 0x549306, 0x549515, > 0x54d6a5, 0x54bbc9, 0x54953a, 0x54d0ee, 0x54d0b2, 0x54df04, 0x54ddd1, 0x54bc6c, 0x54dede, 0x54dd60, 0x54dc7d, 0x54dcef, 0x54af53, 0x54afcc, 0x54e7ca, 0x54e7ff, 0x54e294, 0x54e2be, 0x54b6ca, 0x54d355, 0x54d211, 0x54cf47, > 0x54cf9a, 0x54d007, 0x54d12e, 0x54ec20, 0x54b559, 0x54b507, 0x546539, 0x54656a, 0x5465a3, 0x5465e0, 0x54e0ea, 0x54e400, 0x546504, 0x5464e2, 0x54ea08, 0x54be99, 0x54c53b, 0x54c5d7, 0x54e774, 0x54bee0, 0x54bf28, 0x54bf36, > 0x5464cd, 0x54b222, 0x546d1f, 0x546d3e, 0x546d81, 0x546da0, 0x546dd2, 0x546e29, 0x546e49, 0x546e7c, 0x546d04, 0x546d5e, 0x546e05, 0x546ca2, 0x546cbd, 0x546ce0, 0x546c83, 0x54e83c, 0x54b1cc, 0x54b278, 0x54eb5b, 0x54ba46, > 0x54c683, 0x54c726, 0x54c7b4...} > temp_bits = 139964436760704 > pt_arity = 139963334550664 > start_time = 0 > start_time_i = 0x0 > EBS = 0x7f4c0288c898 > #2 0x00000000004a081b in sched_thread_func (vesdp=0x7f4c0288c880) at beam/erl_process.c:7665 > callbacks = {arg = 0x7f4c02882380, wakeup = 0x4a21b0 , prepare_wait = 0x49e370 , wait = 0x49f6f0 , finalize_wait = 0x49e350 } > esdp = 0x7f4c0288c880 > no = 2 > #3 0x00000000005df676 in thr_wrapper (vtwd=) at pthread/ethread.c:110 > result = > res = 0x7fffd82d4b90 > twd = > thr_func = 0x4a0700 > arg = 0x7f4c0288c880 > tsep = 0x7f4c0a2800a0 > #4 0x00000037d10079d1 in start_thread () from /lib64/libpthread.so.0 > No symbol table info available. > #5 0x00000037d0ce8b6d in ?? () > No symbol table info available. > #6 0x0000000000000000 in ?? () > No symbol table info available. > > > > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://erlang.org/mailman/listinfo/erlang-bugs -------------- next part -------------- An HTML attachment was scrubbed... URL: From sverker.eriksson@REDACTED Wed Oct 1 15:34:39 2014 From: sverker.eriksson@REDACTED (Sverker Eriksson) Date: Wed, 1 Oct 2014 15:34:39 +0200 Subject: [erlang-bugs] Erlang vm beam.smp crash In-Reply-To: <6B0DD9F1-3AC2-443D-ADA3-32324A45F2D0@rogvall.se> References: <6a7d6a4b.104f6.148c0eeb097.Coremail.liu1985629@163.com> <6B0DD9F1-3AC2-443D-ADA3-32324A45F2D0@rogvall.se> Message-ID: <542C02EF.2070703@erix.ericsson.se> Oops. Guess no one is using port_set_data with non-immediate terms.... not even our tests :-[ Here is the one-liner that fixes the problem. diff --git a/erts/emulator/beam/erl_bif_port.c b/erts/emulator/beam/erl_bif_port.c index afb33c1..8a622e5 100644 --- a/erts/emulator/beam/erl_bif_port.c +++ b/erts/emulator/beam/erl_bif_port.c @@ -554,6 +554,7 @@ BIF_RETTYPE port_set_data_2(BIF_ALIST_2) hp = &pdhp->heap[0]; pdhp->off_heap.first = NULL; pdhp->off_heap.overhead = 0; + pdhp->hsize = hsize; pdhp->data = copy_struct(BIF_ARG_2, hsize, &hp, &pdhp->off_heap); data = (erts_aint_t) pdhp; ASSERT((data & 0x3) == 0); Thanks Tony, for using your bug-sniffing ability for the greater good. /Sverker, Erlang/OTP On 09/30/2014 02:39 PM, Tony Rogvall wrote: > I also found this: > > (on unix/mac, on windows use some other program) >> Port = open_port({spawn, "cat"}, []). >> erlang:port_set_data(Port, {1,2,3}). > true >> erlang:port_get_data(Port). > ... hmm my cpu skyrockets and the computer dies a bit :-) > then > > beam.smp(41283,0xb06bb000) malloc: *** mach_vm_map(size=1799625657810944) failed (error code=3) > *** error: can't allocate region > *** set a breakpoint in malloc_error_break to debug > beam.smp(41283,0xb06bb000) malloc: *** mach_vm_map(size=1799625657810944) failed (error code=3) > *** error: can't allocate region > *** set a breakpoint in malloc_error_break to debug > beam.smp(41283,0xb06bb000) malloc: *** mach_vm_map(size=1799625656766464) failed (error code=3) > *** error: can't allocate region > *** set a breakpoint in malloc_error_break to debug > beam.smp(41283,0xb06bb000) malloc: *** mach_vm_map(size=1799625656766464) failed (error code=3) > *** error: can't allocate region > *** set a breakpoint in malloc_error_break to debug > > Crash dump was written to: erl_crash.dump > eheap_alloc: Cannot allocate 1799625656762408 bytes of memory (of type "heap_frag"). > > /Tony > > > From magnus.ottenklinger@REDACTED Fri Oct 10 09:38:22 2014 From: magnus.ottenklinger@REDACTED (Magnus Ottenklinger) Date: Fri, 10 Oct 2014 07:38:22 +0000 Subject: [erlang-bugs] Deadlocking application_controller using init:stop/1, 2 In-Reply-To: References: <71607aa8a2fd48eb95938c1db5552c4a@DBXPR03MB511.eurprd03.prod.outlook.com> Message-ID: Hey Siri, any update on this? Regards, Magnus Von: Siri Hansen [mailto:erlangsiri@REDACTED] Gesendet: Mittwoch, 10. September 2014 16:32 An: Magnus Ottenklinger Cc: erlang-bugs@REDACTED Betreff: Re: [erlang-bugs] Deadlocking application_controller using init:stop/1, 2 Thanks for the additional information, Magnus! We will discuss this a bit more in the team before proceeding. Regards /siri 2014-09-09 11:38 GMT+02:00 Magnus Ottenklinger >: Hey Siri, sorry for taking so long to reply. Our system takes quite some time starting up (around one minute). While this is being done, multiple applications are started, each with a supervisor tree. Within those supervisor trees, processes might start other OTP applications, such as ssl. The init:stop() is sent to the VM by our /etc/init.d script. If e.g. an error is detected during the startup phase, and we want to stop the node, the described deadlock appears, rendering the system unstoppable (in a clean way). Regards, Magnus -------------- next part -------------- An HTML attachment was scrubbed... URL: From vicent.ferrerguasch@REDACTED Sun Oct 26 19:05:28 2014 From: vicent.ferrerguasch@REDACTED (Vicent Ferrer Guasch) Date: Sun, 26 Oct 2014 20:05:28 +0200 Subject: [erlang-bugs] ASN.1 PER compile Message-ID: <544D37E8.6020701@aalto.fi> Hello, I am trying to compile the last S1AP (3GPP 36.413) ASN specifications using asn1ct:compile/2 , but the generated source file is incorrect. I think compile/2 has a problem with the table constraints using per encoding i.e. InitiatingMessage ::= SEQUENCE { procedureCode S1AP-ELEMENTARY-PROCEDURE.&procedureCode ({S1AP-ELEMENTARY-PROCEDURES}), criticality S1AP-ELEMENTARY-PROCEDURE.&criticality ({S1AP-ELEMENTARY-PROCEDURES}{@procedureCode}), value S1AP-ELEMENTARY-PROCEDURE.&InitiatingMessage ({S1AP-ELEMENTARY-PROCEDURES}{@procedureCode}) } The generated code contains @, an example. <> = Bytes1, {V1@REDACTED,V1@REDACTED} I have tried also uper enconding with the same results, although ber encoding generates correct erlang code. I was using this compiler with the same specifications last year, and I it was working properly. I tried this on Debian testing with http://packages.erlang-solutions.com/debian repo: Erlang 17.3 asn1 3.0.2 I have also tried it on a Ubuntu 14.04: Erlang 16.03 asn1 2.0.4 Let me know if you need further details Best Regards, Vicent From bjorn@REDACTED Mon Oct 27 10:21:16 2014 From: bjorn@REDACTED (=?UTF-8?Q?Bj=C3=B6rn_Gustavsson?=) Date: Mon, 27 Oct 2014 10:21:16 +0100 Subject: [erlang-bugs] ASN.1 PER compile In-Reply-To: <544D37E8.6020701@aalto.fi> References: <544D37E8.6020701@aalto.fi> Message-ID: On Sun, Oct 26, 2014 at 7:05 PM, Vicent Ferrer Guasch wrote: > Hello, > > I am trying to compile the last S1AP (3GPP 36.413) ASN specifications > using asn1ct:compile/2 , but the generated source file is incorrect. I > think compile/2 has a problem with the table constraints using per > encoding i.e. > Do you get an actual compilation error from the BEAM compiler? If so, please show the error messages. The S1AP are included in our test suites, but we might not have the very latest version. > > The generated code contains @, an example. > > <> = Bytes1, > > {V1@REDACTED,V1@REDACTED} That is not a problem. @ is allowed in variable names. /Bjorn -- Bj?rn Gustavsson, Erlang/OTP, Ericsson AB From bjorn@REDACTED Mon Oct 27 12:59:07 2014 From: bjorn@REDACTED (=?UTF-8?Q?Bj=C3=B6rn_Gustavsson?=) Date: Mon, 27 Oct 2014 12:59:07 +0100 Subject: [erlang-bugs] ASN.1 PER compile In-Reply-To: <36EDBB147A06874B9DF7A8F578B265BA73DA9AA8@EXMDB04.org.aalto.fi> References: <544D37E8.6020701@aalto.fi> <36EDBB147A06874B9DF7A8F578B265BA73DA9AA8@EXMDB04.org.aalto.fi> Message-ID: On Mon, Oct 27, 2014 at 12:21 PM, Ferrer Guasch Vicent wrote: > > I don't get any errors when compiling, the problem appears when using the generated module. That's why I thought the @ were a problem. > When I try to decode a S1Setup message I obtain this error: > > > 'S1AP':decode('S1AP-PDU',Str). > {error,{asn1,{case_clause,[0,17,0,48,0,0,4,0,59,0,8,0,66, > 244,112,0,0,6,224,0,60,64,10|...]}}} It seems that Str is a list. We now only allow a binary as the second argument to the decode function. > > The test function: > > > asn1ct:test('S1AP','S1AP-PDU'). > {error, > {asn1, > {encode, > {{'S1AP','S1AP-PDU', > {successfulOutcome, > {'SuccessfulOutcome',240,notify,<<"\n\topen_type">>}}}, > {error,{asn1,function_clause}}}}}} > That is a bug, but not a new one. The asn1ct:test/2 function does not understand table constraints. /Bjorn -- Bj?rn Gustavsson, Erlang/OTP, Ericsson AB From vicent.ferrerguasch@REDACTED Mon Oct 27 12:21:28 2014 From: vicent.ferrerguasch@REDACTED (Ferrer Guasch Vicent) Date: Mon, 27 Oct 2014 11:21:28 +0000 Subject: [erlang-bugs] ASN.1 PER compile In-Reply-To: References: <544D37E8.6020701@aalto.fi> Message-ID: <36EDBB147A06874B9DF7A8F578B265BA73DA9AA8@EXMDB04.org.aalto.fi> > -----Original Message----- > From: bgustavsson@REDACTED [mailto:bgustavsson@REDACTED] On Behalf > Of Bj?rn Gustavsson > Sent: 27. lokakuuta 2014 11:21 > To: Ferrer Guasch Vicent > Cc: erlang-bugs > Subject: Re: [erlang-bugs] ASN.1 PER compile > > On Sun, Oct 26, 2014 at 7:05 PM, Vicent Ferrer Guasch > wrote: > > Hello, > > > > I am trying to compile the last S1AP (3GPP 36.413) ASN specifications > > using asn1ct:compile/2 , but the generated source file is incorrect. I > > think compile/2 has a problem with the table constraints using per > > encoding i.e. > > > > Do you get an actual compilation error from the BEAM compiler? > If so, please show the error messages. > I don't get any errors when compiling, the problem appears when using the generated module. That's why I thought the @ were a problem. When I try to decode a S1Setup message I obtain this error: > 'S1AP':decode('S1AP-PDU',Str). {error,{asn1,{case_clause,[0,17,0,48,0,0,4,0,59,0,8,0,66, 244,112,0,0,6,224,0,60,64,10|...]}}} The test function: > asn1ct:test('S1AP','S1AP-PDU'). {error, {asn1, {encode, {{'S1AP','S1AP-PDU', {successfulOutcome, {'SuccessfulOutcome',240,notify,<<"\n\topen_type">>}}}, {error,{asn1,function_clause}}}}}} > The S1AP are included in our test suites, but we might not have the very > latest version. > > > > > The generated code contains @, an example. > > > > <> = Bytes1, > > > > {V1@REDACTED,V1@REDACTED} > > That is not a problem. @ is allowed in variable names. Some of the functions are not indented, is that normal? > > /Bjorn > > -- > Bj?rn Gustavsson, Erlang/OTP, Ericsson AB Vicent Ferrer -------------- next part -------------- A non-text attachment was scrubbed... Name: S1AP.erl Type: application/octet-stream Size: 1203683 bytes Desc: S1AP.erl URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: S1AP.hrl Type: application/octet-stream Size: 45163 bytes Desc: S1AP.hrl URL: From vicent.ferrerguasch@REDACTED Mon Oct 27 13:11:07 2014 From: vicent.ferrerguasch@REDACTED (Ferrer Guasch Vicent) Date: Mon, 27 Oct 2014 12:11:07 +0000 Subject: [erlang-bugs] ASN.1 PER compile In-Reply-To: References: <544D37E8.6020701@aalto.fi> <36EDBB147A06874B9DF7A8F578B265BA73DA9AA8@EXMDB04.org.aalto.fi> Message-ID: <36EDBB147A06874B9DF7A8F578B265BA73DA9B16@EXMDB04.org.aalto.fi> > > > > I don't get any errors when compiling, the problem appears when using the > generated module. That's why I thought the @ were a problem. > > When I try to decode a S1Setup message I obtain this error: > > > > > 'S1AP':decode('S1AP-PDU',Str). > > {error,{asn1,{case_clause,[0,17,0,48,0,0,4,0,59,0,8,0,66, > > > > 244,112,0,0,6,224,0,60,64,10|...]}}} > > It seems that Str is a list. We now only allow a binary as the second argument > to the decode function. > You are right, my fault! It is working, but I wasn't using it right. I was used to pass a list. Thanks for the support. > > > > The test function: > > > > > asn1ct:test('S1AP','S1AP-PDU'). > > {error, > > {asn1, > > {encode, > > {{'S1AP','S1AP-PDU', > > {successfulOutcome, > > {'SuccessfulOutcome',240,notify,<<"\n\topen_type">>}}}, > > {error,{asn1,function_clause}}}}}} > > > > That is a bug, but not a new one. The asn1ct:test/2 function does not > understand table constraints. I will take into account, but I think I won't need it. > > /Bjorn > > -- > Bj?rn Gustavsson, Erlang/OTP, Ericsson AB Vicent From holger@REDACTED Tue Oct 28 13:46:35 2014 From: holger@REDACTED (Holger =?iso-8859-1?Q?Wei=DF?=) Date: Tue, 28 Oct 2014 13:46:35 +0100 Subject: [erlang-bugs] gen_tcp:send/2 gets stuck despite send_timeout Message-ID: <20141028124635.GO627691@zedat.fu-berlin.de> Hi there, I'm an ejabberd contributor, and we're currently facing the issue that gen_tcp:send/2 occasionally blocks forever even though a 'send_timeout' (and 'send_timeout_close') has been specified.? This seems to happen only under rare circumstances, but when it happens, it can crash the VM, as the process that's stuck in the gen_tcp:send/2 call stops processing its message queue and therefore eats the available memory, eventually. This *only* seems to happen when epoll(7) is used, i.e. when "+K true" is specified on Linux. "+K false" makes the issue go away. Also, it only happens when the TCP socket is no longer usable. In the past, it could occur that an ejabberd process called gen_tcp:send/2 even though an earlier call returned a failure already. Since we changed the code to fix that, the issue is triggered less frequently; and in those cases where it still *is* triggered, it's obvious from looking at the details that the socket got closed more or less at the same time. The problem is that I'm not able to reproduce this myself. So far, we've only been made aware of this issue on two servers, both of them running in production, and it's only easily reproducible on one of them. That one is running Erlang 17.1 on a Xen instance (I guess I could ask the admin to update to 17.3). Without code to reproduce the issue, this is probably non-trivial to debug :-( At least there's one live system where the issue is usually triggered multiple times per day. Any suggestions on how to proceed? Thanks, Holger ? According to process_info/1, the current function is prim_inet:send/3. From vinoski@REDACTED Tue Oct 28 16:15:41 2014 From: vinoski@REDACTED (Steve Vinoski) Date: Tue, 28 Oct 2014 11:15:41 -0400 Subject: [erlang-bugs] gen_tcp:send/2 gets stuck despite send_timeout In-Reply-To: <20141028124635.GO627691@zedat.fu-berlin.de> References: <20141028124635.GO627691@zedat.fu-berlin.de> Message-ID: On Tue, Oct 28, 2014 at 8:46 AM, Holger Wei? wrote: > Hi there, > > I'm an ejabberd contributor, and we're currently facing the issue that > gen_tcp:send/2 occasionally blocks forever even though a 'send_timeout' > (and 'send_timeout_close') has been specified.? This seems to happen > only under rare circumstances, but when it happens, it can crash the VM, > as the process that's stuck in the gen_tcp:send/2 call stops processing > its message queue and therefore eats the available memory, eventually. > > This *only* seems to happen when epoll(7) is used, i.e. when "+K true" > is specified on Linux. "+K false" makes the issue go away. > > Also, it only happens when the TCP socket is no longer usable. In the > past, it could occur that an ejabberd process called gen_tcp:send/2 even > though an earlier call returned a failure already. Since we changed the > code to fix that, the issue is triggered less frequently; and in those > cases where it still *is* triggered, it's obvious from looking at the > details that the socket got closed more or less at the same time. > > The problem is that I'm not able to reproduce this myself. So far, > we've only been made aware of this issue on two servers, both of them > running in production, and it's only easily reproducible on one of them. > That one is running Erlang 17.1 on a Xen instance (I guess I could ask > the admin to update to 17.3). > > Without code to reproduce the issue, this is probably non-trivial to > debug :-( At least there's one live system where the issue is usually > triggered multiple times per day. Any suggestions on how to proceed? > > Thanks, Holger > > ? According to process_info/1, the current function is prim_inet:send/3. > I've seen this happen when the receiver is not reading from its socket, thus causing the TCP window to close to apply backpressure to the sender, and so Erlang's inet driver fills the kernel's send buffers and then inet_drv fills its own buffers with data to be sent. Once those are all full, and you try to send again, prim_inet:send blocks waiting to receive a message from inet_drv that won't be sent until some send buffer space is freed up due to the receiver reading data from its socket. Under these conditions, netstat will show your connection's kernel buffers to be full and unchanging, and a call to erlang:port_info(Socket, queue_size) will show queued data greater than the default of 8k for that socket. It can sit that way forever until the receiver reads. One thing that might help under these conditions is to send data directly via port_command like this: erlang:port_command(Socket, Data, [nosuspend]) which returns false if the port/socket is busy, but even this doesn't seem to be foolproof, nor does relying on send_timeout. I've seen both work, and in fact both worked as expected in little cases I tried before sending this reply, but I've also seen cases where I expected them to work but they didn't and I couldn't determine why. --steve -------------- next part -------------- An HTML attachment was scrubbed... URL: