From fritchie@REDACTED Wed May 1 00:13:15 2013 From: fritchie@REDACTED (Scott Lystig Fritchie) Date: Tue, 30 Apr 2013 17:13:15 -0500 Subject: [erlang-bugs] Schedulers getting "stuck", part II In-Reply-To: Message of "Tue, 30 Apr 2013 15:34:09 CDT." <78245.1367354049@snookles.snookles.com> Message-ID: <83533.1367359995@snookles.snookles.com> Patrik, there are a couple of synthetic load cases that have an end result of what we occasionally see Riak and Riak CS doing in the wild. Manymany thanks to Joseph Blomstedt for inventing these two modules. test10.erl: https://gist.github.com/jtuple/0d9ca553b7e58adcb6f4 test11:erl: https://gist.github.com/jtuple/8f12ce9c21471f5d6f01 Both can be used by running the 'go/0' function. The test10:go() function creates an oscillation between a couple of workloads: one that tends toward scheduler collapse, and one that tends to wake them up again. The test11:go() function uses only a single load that tends toward scheduler collapse. Both of them fail mostly regularly on my 8 core MBP using R15B01, R15B03, and R16B. The io:format() messages are sent while load is not running, with very generous pauses before starting the next phase of workload. If you call io:format() during unfairly-scheduled workload (which these tests excel at doing), the messages can be delayed by dozens of seconds. Note that these synthetic tests are using two different functions to cause scheduler collapse: test10.erl with crypto:md5_update/2, a NIF, and test11.erl with erlang:external_size/1, a BIF. It's quite likely that erlang:term_to_binary/1 is similarly effective/buggy. Neither of them fails when using this patch on any of those three VM versions: https://github.com/slfritchie/otp/compare/erlang:maint...disable-scheduler-sleeps or https://github.com/slfritchie/otp/tree/disable-scheduler-sleeps ... when also using "+scl false +zdnfgtse 500:500". -Scott From watson.timothy@REDACTED Wed May 1 13:32:42 2013 From: watson.timothy@REDACTED (Tim Watson) Date: Wed, 1 May 2013 12:32:42 +0100 Subject: [erlang-bugs] Schedulers getting "stuck", part II In-Reply-To: <83533.1367359995@snookles.snookles.com> References: <83533.1367359995@snookles.snookles.com> Message-ID: On 30 Apr 2013, at 23:13, Scott Lystig Fritchie wrote: > > ... when also using "+scl false +zdnfgtse 500:500". > Does dnfgtse stand for what I think it does? :) From spawn.think@REDACTED Wed May 1 15:45:30 2013 From: spawn.think@REDACTED (Ahmed Omar) Date: Wed, 1 May 2013 15:45:30 +0200 Subject: [erlang-bugs] Crash in mnesia_controller with function clause exception from is_tab_blocked Message-ID: Observed on startup of a node in the cluster the following the crash report 2013-04-29 15:33:12 =ERROR REPORT==== Mnesia('ejabberd@REDACTED'): ** ERROR ** (core dumped to file: "/var/lib/ejabberd/MnesiaCore.ejabberd@REDACTED") ** FATAL ** mnesia_controller crashed: {function_clause,[{mnesia_controller,is_tab_blocked,[{blocked,{blocked,[{'ejabberd@REDACTED ',disc_only_copies},{'ejabberd@REDACTED',disc_only_copies}] The exception can be reproduced by the following steps: mnesia:start(), mnesia:create_table(test1, []), mnesia_controller:block_table(test1), mnesia_controller:block_table(test1), mnesia_controller:add_active_replica(test1,node()). I'm preparing a patch to submit Best Regards, Ahmed Omar -------------- next part -------------- An HTML attachment was scrubbed... URL: From isreal-erlang-bugs-at-erlang.org@REDACTED Wed May 1 18:35:35 2013 From: isreal-erlang-bugs-at-erlang.org@REDACTED (David Buckley) Date: Wed, 1 May 2013 17:35:35 +0100 Subject: [erlang-bugs] Bug in unicode characters_to_list trap Message-ID: <20130501163535.GA29904@cirno.fluorescence.co.uk> Simple test session: [ 17:28 ] bucko@REDACTED:~% erl Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:4:4] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.9.1 (abort with ^G) 1> <<_, RR/binary>> = <<$a,164,161,$b>>. <<"a??b">> 2> RR. <<"??b">> 3> unicode:characters_to_list(RR). {error,[],<<"a??">>} 4> unicode:characters_to_list(list_to_binary(binary_to_list(RR))). {error,[],<<"??b">>} I'm using Debian's default erlang build, but I've verified the bug on various others, and can't see it in the release notes. Description: The latter two calls should return the dame value, as list_to_binary(binary_to_list(RR)) =:= RR. I would guess that the code in erlang's guts is taking the falure offset into the binary part as an offset into the full binary. At least, the return values are consistent with this. Workaround is just to call list_to_binary(binary_to_list()) on your data before calling unicode:characters_to_list on it. Or manually offsetting into the binary yourself in the case of a failed parse. -- David Buckley From pan@REDACTED Thu May 2 10:48:10 2013 From: pan@REDACTED (Patrik Nyblom) Date: Thu, 2 May 2013 10:48:10 +0200 Subject: [erlang-bugs] Bug in unicode characters_to_list trap In-Reply-To: <20130501163535.GA29904@cirno.fluorescence.co.uk> References: <20130501163535.GA29904@cirno.fluorescence.co.uk> Message-ID: <5182284A.50603@erlang.org> Hi David! On 05/01/2013 06:35 PM, David Buckley wrote: > Simple test session: > > [ 17:28 ] bucko@REDACTED:~% erl > Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:4:4] [async-threads:0] [hipe] [kernel-poll:false] > > Eshell V5.9.1 (abort with ^G) > 1> <<_, RR/binary>> = <<$a,164,161,$b>>. > <<"a??b">> > 2> RR. > <<"??b">> > 3> unicode:characters_to_list(RR). > {error,[],<<"a??">>} > 4> unicode:characters_to_list(list_to_binary(binary_to_list(RR))). > {error,[],<<"??b">>} Yep - that's a bug, no doubt... Can you try a source code patch when I've found a cure? > > I'm using Debian's default erlang build, but I've verified the bug on > various others, and can't see it in the release notes. > > Description: The latter two calls should return the dame value, as > list_to_binary(binary_to_list(RR)) =:= RR. > > I would guess that the code in erlang's guts is taking the falure offset > into the binary part as an offset into the full binary. At least, the > return values are consistent with this. Good guess, I agree. > > Workaround is just to call list_to_binary(binary_to_list()) on your data > before calling unicode:characters_to_list on it. Or manually offsetting > into the binary yourself in the case of a failed parse. > Thanks! /Patrik From pan@REDACTED Thu May 2 15:56:36 2013 From: pan@REDACTED (Patrik Nyblom) Date: Thu, 2 May 2013 15:56:36 +0200 Subject: [erlang-bugs] Bug in unicode characters_to_list trap In-Reply-To: <5182284A.50603@erlang.org> References: <20130501163535.GA29904@cirno.fluorescence.co.uk> <5182284A.50603@erlang.org> Message-ID: <51827094.8070907@erlang.org> Hi again! On 05/02/2013 10:48 AM, Patrik Nyblom wrote: > Hi David! > > On 05/01/2013 06:35 PM, David Buckley wrote: >> Simple test session: >> >> [ 17:28 ] bucko@REDACTED:~% erl >> Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:4:4] >> [async-threads:0] [hipe] [kernel-poll:false] >> >> Eshell V5.9.1 (abort with ^G) >> 1> <<_, RR/binary>> = <<$a,164,161,$b>>. >> <<"a??b">> >> 2> RR. >> <<"??b">> >> 3> unicode:characters_to_list(RR). >> {error,[],<<"a??">>} >> 4> unicode:characters_to_list(list_to_binary(binary_to_list(RR))). >> {error,[],<<"??b">>} > Yep - that's a bug, no doubt... > Can you try a source code patch when I've found a cure? A small patch is attached, the full patch will of course also cointain a test case, but this is tha minimal fix. It would be great if you would also test it, i will meanwhile prepare a fix in maint... >> >> I'm using Debian's default erlang build, but I've verified the bug on >> various others, and can't see it in the release notes. >> >> Description: The latter two calls should return the dame value, as >> list_to_binary(binary_to_list(RR)) =:= RR. >> >> I would guess that the code in erlang's guts is taking the falure offset >> into the binary part as an offset into the full binary. At least, the >> return values are consistent with this. > Good guess, I agree. And, you were absolutely right! >> >> Workaround is just to call list_to_binary(binary_to_list()) on your data >> before calling unicode:characters_to_list on it. Or manually offsetting >> into the binary yourself in the case of a failed parse. >> > Thanks! > /Patrik Cheers, Patrik > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://erlang.org/mailman/listinfo/erlang-bugs -------------- next part -------------- A non-text attachment was scrubbed... Name: unicode_rest.diff Type: text/x-patch Size: 765 bytes Desc: not available URL: From bgustavsson@REDACTED Thu May 2 17:42:54 2013 From: bgustavsson@REDACTED (=?UTF-8?Q?Bj=C3=B6rn_Gustavsson?=) Date: Thu, 2 May 2013 17:42:54 +0200 Subject: [erlang-bugs] [erlang-patches] Bit string generators, unsized binaries, modules and the REPL In-Reply-To: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com> References: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com> Message-ID: On Sun, Mar 31, 2013 at 4:22 PM, Anthony Ramine wrote: > > This patch implements this new error and simplifies how v3_core works with > forbidden unsized tail segments in patterns of bit string generators. > > git fetch https://github.com/nox/otp illegal-bitstring-gen-pattern > > > https://github.com/nox/otp/compare/erlang:maint...illegal-bitstring-gen-pattern > > https://github.com/nox/otp/compare/erlang:maint...illegal-bitstring-gen-pattern.patch There is a major and a minor issue. The major issue is that the test suites bs_bincomp_SUITE.erl (compiler application) and erl_eval_SUITE.erl (stdlib application) no longer compiles. The minor issue is that erl_eval and eval_bits have assertions to reject bad inputs in case the abstract code has not been verified by erl_lint. The assertion can be written like this to reject unsized tails in binary generators: diff --git a/lib/stdlib/src/eval_bits.erl b/lib/stdlib/src/eval_bits.erl index e49cbc1..56be5a6 100644 --- a/lib/stdlib/src/eval_bits.erl +++ b/lib/stdlib/src/eval_bits.erl @@ -193,6 +193,13 @@ bin_gen_field({bin_element,Line,VE,Size0,Options0}, V = erl_eval:partial_eval(VE), NewV = coerce_to_float(V, Type), match_check_size(Mfun, Size1, BBs0), + case Size1 of + {atom,_,all} -> + %% An unsized field is forbidden in a generator. + throw(invalid); + _ -> + ok + end, {value, Size, _BBs} = Efun(Size1, BBs0), bin_gen_field1(Bin, Type, Size, Unit, Sign, Endian, NewV, Bs0, BBs0, Mfun). -- Bj?rn Gustavsson, Erlang/OTP, Ericsson AB -------------- next part -------------- An HTML attachment was scrubbed... URL: From n.oxyde@REDACTED Thu May 2 21:12:34 2013 From: n.oxyde@REDACTED (Anthony Ramine) Date: Thu, 2 May 2013 21:12:34 +0200 Subject: [erlang-bugs] [erlang-patches] Bit string generators, unsized binaries, modules and the REPL In-Reply-To: References: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com> Message-ID: Hello Bj?rn, Both issues fixed. Please refetch. Regards, -- Anthony Ramine Le 2 mai 2013 ? 17:42, Bj?rn Gustavsson a ?crit : > There is a major and a minor issue. From stefan.zegenhagen@REDACTED Fri May 3 10:34:40 2013 From: stefan.zegenhagen@REDACTED (Stefan Zegenhagen) Date: Fri, 03 May 2013 10:34:40 +0200 Subject: [erlang-bugs] Strange thing in lib/kernel/src/group.erl Message-ID: <1367570080.31752.25.camel@ax-sze> Dear all, I've stumbled across a small issue in the implementation of the process group server. The code in group.erl spawns a server process that monitors the exit of either the shell and the user_drv that started it. In the regular server_loop, exits of the user_drv (Drv) are handled as follows: receive ... {'EXIT',Drv,R} -> exit(R); When a blocking io_request is being executed, the following code is executed instead: %% 'kill' instead of R, since the shell is not always in %% a state where it is ready to handle a termination %% message. exit_shell(kill), exit(R) Besides the behaviour being inconsistent, it also means that our shell process monitor receives the 'killed' exit reason more often than the real exit reason, which defeats our custom error handling and logging. Looking at the comment above the exit_shell(kill) statement, there seems to have been a reason to put it there at some time. Looking at the code in the io module that does those io_requests, it should not be necessary. I'm unsure whether it is safe to remove the exit_shell(kill) statement or whether something would terribly break. However, not receiving the correct exit reason does give us a headache. Kind regards, -- Dr. Stefan Zegenhagen arcutronix GmbH Garbsener Landstr. 10 30419 Hannover Germany Tel: +49 511 277-2734 Fax: +49 511 277-2709 Email: stefan.zegenhagen@REDACTED Web: www.arcutronix.com *Synchronize the Ethernet* General Managers: Dipl. Ing. Juergen Schroeder, Dr. Josef Gfrerer - Legal Form: GmbH, Registered office: Hannover, HRB 202442, Amtsgericht Hannover; Ust-Id: DE257551767. Please consider the environment before printing this message. From bgustavsson@REDACTED Fri May 3 11:18:49 2013 From: bgustavsson@REDACTED (=?UTF-8?Q?Bj=C3=B6rn_Gustavsson?=) Date: Fri, 3 May 2013 11:18:49 +0200 Subject: [erlang-bugs] [erlang-patches] Bit string generators, unsized binaries, modules and the REPL In-Reply-To: References: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com> Message-ID: On Thu, May 2, 2013 at 9:12 PM, Anthony Ramine wrote: > Hello Bj?rn, > > Both issues fixed. Please refetch. > > > No, bs_bincomp_SUITE still does not compile. You have removed the tail/1 function, but not the export of it. Another thing is that the modification of bs_bincomp_SUITE is done in the wrong commit. It should be done in the same commit that makes tails illegal. (That may cause problems when running 'git bisect'.) -- Bj?rn Gustavsson, Erlang/OTP, Ericsson AB -------------- next part -------------- An HTML attachment was scrubbed... URL: From n.oxyde@REDACTED Fri May 3 11:32:13 2013 From: n.oxyde@REDACTED (Anthony Ramine) Date: Fri, 3 May 2013 11:32:13 +0200 Subject: [erlang-bugs] [erlang-patches] Bit string generators, unsized binaries, modules and the REPL In-Reply-To: References: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com> Message-ID: <3BA7F90A-E656-4CF3-A27F-00D721B94BF7@gmail.com> Hello Bj?rn, Silly me, to validate my changes I run erts/compiler/test/bs_bincomp_SUITE.erl instead of the one in lib/compiler/test/. I always forget these damn export attributes. Both issues really fixed for good now. Please refetch. Regards, -- Anthony Ramine Le 3 mai 2013 ? 11:18, Bj?rn Gustavsson a ?crit : > > No, bs_bincomp_SUITE still does not compile. You have removed the tail/1 > function, but not the export of it. From mononcqc@REDACTED Fri May 3 16:42:01 2013 From: mononcqc@REDACTED (Fred Hebert) Date: Fri, 3 May 2013 10:42:01 -0400 Subject: [erlang-bugs] Strange thing in lib/kernel/src/group.erl In-Reply-To: <1367570080.31752.25.camel@ax-sze> References: <1367570080.31752.25.camel@ax-sze> Message-ID: <20130503144159.GA57046@ferdair.local> Hi, The reason I could see to have exit_shell(kill) (which in turn calls exit/2 on the shell iff a shell is attached) is that you want, out of all doubt, to get rid of the shell. The group.erl module is entirely distinct from the shell implementation. For example, most shells use shell.erl, but the one used by the SSH daemon has a custom one going that's different, and there are also concepts such as safe shells. In the event that some shell implementation traps exits (and they shoulda be expected to do so if they want to handle the 'interrupt' signal, which is necessary to deal with some ^G commands such as 'i', in any special manner), if the shell is currently blocked in an IO request, it will *never* see the 'EXIT' signal given it is busy waiting on another message, namely the IO Request's response. Because of this specific reason, it might be necessary to kill the shell with the 'kill' signal, which cannot be trapped. We just can't assume that the other shell will receive it. Now granted, I think it could be possible to send both exit signals there (exit_shell(R), exit_shell(kill), exit(R)) just in case in order to allow more obvious exit messages, but I'm not sure it would necessarily be worth it. Someone from the OTP team (or Robert) could voice their opinion there. Regards, Fred. On 05/03, Stefan Zegenhagen wrote: > Dear all, > > I've stumbled across a small issue in the implementation of the process > group server. > > The code in group.erl spawns a server process that monitors the exit of > either the shell and the user_drv that started it. In the regular > server_loop, exits of the user_drv (Drv) are handled as follows: > > receive > ... > {'EXIT',Drv,R} -> > exit(R); > > > When a blocking io_request is being executed, the following code is > executed instead: > > %% 'kill' instead of R, since the shell is not always in > %% a state where it is ready to handle a termination > %% message. > exit_shell(kill), > exit(R) > > Besides the behaviour being inconsistent, it also means that our shell > process monitor receives the 'killed' exit reason more often than the > real exit reason, which defeats our custom error handling and logging. > > Looking at the comment above the exit_shell(kill) statement, there seems > to have been a reason to put it there at some time. Looking at the code > in the io module that does those io_requests, it should not be > necessary. > > I'm unsure whether it is safe to remove the exit_shell(kill) statement > or whether something would terribly break. However, not receiving the > correct exit reason does give us a headache. > > > Kind regards, > > -- > Dr. Stefan Zegenhagen > > arcutronix GmbH > Garbsener Landstr. 10 > 30419 Hannover > Germany > > Tel: +49 511 277-2734 > Fax: +49 511 277-2709 > Email: stefan.zegenhagen@REDACTED > Web: www.arcutronix.com > > *Synchronize the Ethernet* > > General Managers: Dipl. Ing. Juergen Schroeder, Dr. Josef Gfrerer - > Legal Form: GmbH, Registered office: Hannover, HRB 202442, Amtsgericht > Hannover; Ust-Id: DE257551767. > > Please consider the environment before printing this message. > > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://erlang.org/mailman/listinfo/erlang-bugs From stefan.zegenhagen@REDACTED Mon May 6 11:25:17 2013 From: stefan.zegenhagen@REDACTED (Stefan Zegenhagen) Date: Mon, 06 May 2013 11:25:17 +0200 Subject: [erlang-bugs] Strange thing in lib/kernel/src/group.erl In-Reply-To: <20130503144159.GA57046@ferdair.local> References: <1367570080.31752.25.camel@ax-sze> <20130503144159.GA57046@ferdair.local> Message-ID: <1367832317.31752.63.camel@ax-sze> Hi, thank you very much for the response. > The group.erl module is entirely distinct from the shell implementation. > For example, most shells use shell.erl, but the one used by the SSH > daemon has a custom one going that's different, and there are also > concepts such as safe shells. In fact, we've written our own shell ;-) and provide a command line interface to view / change device settings after logon. > In the event that some shell implementation traps exits (and they > shoulda be expected to do so if they want to handle the 'interrupt' > signal, which is necessary to deal with some ^G commands such as 'i', in > any special manner), if the shell is currently blocked in an IO request, > it will *never* see the 'EXIT' signal given it is busy waiting on > another message, namely the IO Request's response. > Because of this specific reason, it might be necessary to kill the > shell with the 'kill' signal, which cannot be trapped. We just can't > assume that the other shell will receive it. Unfortunately, this is only half the truth. I/O requests will usually not cause the shell to not listen to 'EXIT' requests from the group_leader() because I/O requests are implemented as message exchange between those two. Additionally, the io.erl module (which does the I/O requests), is terribly careful to not miss any exit signal sent by the group_leader() / I/O channel. It does the following: - create a monitor for the group_leader() (or the supplied I/O channel) - send an {io_request, *} message to the I/O channel - listen for * an {io_reply, *} * the 'DOWN' message from the process monitor * any 'EXIT' message from the I/O channel If any matching 'DOWN' or 'EXIT' message is received, the correponding opposite is fetched from the message queue as well and {error, terminated} is returned to the caller. This is already bad by itself because it drops the (possibly important) reason of the error. In conclusion, by using the io module for input/output a shell can never get stuck in a state where it is unkillable by doing an I/O request. But it is true that an I/O request blocks the shell for calls/messages from *OTHER* processes than the I/O channel. I can see that it might be wanted to get rid of the shell for sure. One might imagine a case where the shell is trapping exits but "refuses to die" in response to a trappable exit signal. But then, it is not clear to me, why the same measure (e.g. exit_shell(kill)) is not taken in the case where the group.erl's server process is *NOT* executing an I/O request right now and the shell might truely be blocked by activities that prevent it from reacting on the exit signal. But back to the original issue: there are several, discinct reasons why we might need to forcedly terminate a shell session *AND* to do an appropriate logging IFF a user is currently logged on (for security/auditing reasons), e.g.: - the serial cable is being unplugged while a user is logged on - someone tries to interfere with the system by sending huge amounts of binary data over the serial port (possible denial-of-service) - ... Our user_drv.erl replacement exits with an appropriate reason in those cases and our shell implementation needs to know the exit reason to do the right thing depending on the situation. This is currently impossible and I was wondering whether anything could be done about it. > Now granted, I think it could be possible to send both exit signals > there (exit_shell(R), exit_shell(kill), exit(R)) just in case in order > to allow more obvious exit messages, but I'm not sure it would > necessarily be worth it. Someone from the OTP team (or Robert) could > voice their opinion there. Whether this works would certainly depend on the timing. The shell process should be given enough time to have a chance to process the first exit signal before being forcedly killed by the second one. Can this be guaranteed? Kind regards, -- Dr. Stefan Zegenhagen arcutronix GmbH Garbsener Landstr. 10 30419 Hannover Germany Tel: +49 511 277-2734 Fax: +49 511 277-2709 Email: stefan.zegenhagen@REDACTED Web: www.arcutronix.com *Synchronize the Ethernet* General Managers: Dipl. Ing. Juergen Schroeder, Dr. Josef Gfrerer - Legal Form: GmbH, Registered office: Hannover, HRB 202442, Amtsgericht Hannover; Ust-Id: DE257551767. Please consider the environment before printing this message. From mononcqc@REDACTED Mon May 6 15:34:21 2013 From: mononcqc@REDACTED (Fred Hebert) Date: Mon, 6 May 2013 09:34:21 -0400 Subject: [erlang-bugs] Strange thing in lib/kernel/src/group.erl In-Reply-To: <1367832317.31752.63.camel@ax-sze> References: <1367570080.31752.25.camel@ax-sze> <20130503144159.GA57046@ferdair.local> <1367832317.31752.63.camel@ax-sze> Message-ID: <20130506133420.GA64025@ferdair.local> On 05/06, Stefan Zegenhagen wrote: > > Unfortunately, this is only half the truth. I/O requests will usually > not cause the shell to not listen to 'EXIT' requests from the > group_leader() because I/O requests are implemented as message exchange > between those two. Additionally, the io.erl module (which does the I/O > requests), is terribly careful to not miss any exit signal sent by the > group_leader() / I/O channel. It does the following: > - create a monitor for the group_leader() (or the supplied I/O channel) > - send an {io_request, *} message to the I/O channel > - listen for > * an {io_reply, *} > * the 'DOWN' message from the process monitor > * any 'EXIT' message from the I/O channel > > If any matching 'DOWN' or 'EXIT' message is received, the correponding > opposite is fetched from the message queue as well and {error, > terminated} is returned to the caller. This is already bad by itself > because it drops the (possibly important) reason of the error. > > In conclusion, by using the io module for input/output a shell can never > get stuck in a state where it is unkillable by doing an I/O request. > But it is true that an I/O request blocks the shell for calls/messages > from *OTHER* processes than the I/O channel. The key point here is 'usually'. In practice, with the 'io' module, things are gonna be safe. I think most if not all functions of the 'file' module also make use of the io protocol to write to files through the 'io' module directly and are generally safe for that. However, I'm looking at it only from within the group.erl implementation and the documented protocol at http://erlang.org/doc/apps/stdlib/io_protocol.html (in Erlang, if it's not documented, it doesn't exist). If you're basing yourself only on the protocol, you can't assume the other side will monitor you, although it's probably what any reasonable Erlang programmer would do. I'm guessing that if the shell had documentation and a notice warning for this usage, there would be no argument that could be made against it. > > I can see that it might be wanted to get rid of the shell for sure. One > might imagine a case where the shell is trapping exits but "refuses to > die" in response to a trappable exit signal. But then, it is not clear > to me, why the same measure (e.g. exit_shell(kill)) is not taken in the > case where the group.erl's server process is *NOT* executing an I/O > request right now and the shell might truely be blocked by activities > that prevent it from reacting on the exit signal. > It is indeed not very clear. My guess would be that you can make assumptions about your part of the communication and protocol, but not the others. A simpler explanation is probably that sometimes back, there was a problem with either implementation and it was simpler to fix with a kill than by adding other ways to handling code (say, before monitoring was added to the language, but while trap_exits were available). If this is the case, then there would be no reason to keep things the way they are right now IMO, and it would be possible to go with the other exit. > > But back to the original issue: there are several, discinct reasons why > we might need to forcedly terminate a shell session *AND* to do an > appropriate logging IFF a user is currently logged on (for > security/auditing reasons), e.g.: > - the serial cable is being unplugged while a user is logged on > - someone tries to interfere with the system by sending huge amounts > of binary data over the serial port (possible denial-of-service) > - ... > > Our user_drv.erl replacement exits with an appropriate reason in those > cases and our shell implementation needs to know the exit reason to do > the right thing depending on the situation. This is currently impossible > and I was wondering whether anything could be done about it. That is definitely a nice use case and I would be personally more open to allowing that than leaving the 'kill' here. I am however not in the OTP team, and do not know everything that has to do with the shell, so this is only my personal opinion. A possible workaround if things do not come to fruition would be to add layers of indirection -- a process that monitors the shell and the group.erl process and reports the most useful message. Ideally this would not need to be written, although it might still be needed if you deal with older implementations after the fix. > > Whether this works would certainly depend on the timing. The shell > process should be given enough time to have a chance to process the > first exit signal before being forcedly killed by the second one. Can > this be guaranteed? > The two-kill approach should work well in the event where the other process is not trapping exits. In that case, the order of signals should be guaranteed, and the first one will kill the process cleanly. If the process is trapping exits, though, then the first (non-kill) signal will be converted to a message and you're absolutely unlikely to be able to have the time to process the first one before being killed by the second one. The cleanest solution is obviously to be able to just exit/2 with the right reason. I don't know if the OTP team has managed to transfer all the changelogs relating to the shells when they moved over to git, but I'd be interested to figure out if the exit(Pid,kill) in there is older than monitors -- if so, it would mean that it was probably a workaround for the io module which is no longer necessary today (because it can monitor without altering links or exits being trapped). Regards, Fred. From stefan.zegenhagen@REDACTED Mon May 6 15:54:35 2013 From: stefan.zegenhagen@REDACTED (Stefan Zegenhagen) Date: Mon, 06 May 2013 15:54:35 +0200 Subject: [erlang-bugs] Strange thing in lib/kernel/src/group.erl In-Reply-To: <20130506133420.GA64025@ferdair.local> References: <1367570080.31752.25.camel@ax-sze> <20130503144159.GA57046@ferdair.local> <1367832317.31752.63.camel@ax-sze> <20130506133420.GA64025@ferdair.local> Message-ID: <1367848475.31752.86.camel@ax-sze> Hi, thanks again for the detailed answer. > > I can see that it might be wanted to get rid of the shell for sure. One > > might imagine a case where the shell is trapping exits but "refuses to > > die" in response to a trappable exit signal. But then, it is not clear > > to me, why the same measure (e.g. exit_shell(kill)) is not taken in the > > case where the group.erl's server process is *NOT* executing an I/O > > request right now and the shell might truely be blocked by activities > > that prevent it from reacting on the exit signal. > > > > It is indeed not very clear. My guess would be that you can make > assumptions about your part of the communication and protocol, but not > the others. > > A simpler explanation is probably that sometimes back, there was a > problem with either implementation and it was simpler to fix with a kill > than by adding other ways to handling code (say, before monitoring was > added to the language, but while trap_exits were available). > > If this is the case, then there would be no reason to keep things the > way they are right now IMO, and it would be possible to go with the > other exit. I guess we'll have to wait for the OTP team to have a look at this, then ;-) > > But back to the original issue: there are several, discinct reasons why > > we might need to forcedly terminate a shell session *AND* to do an > > appropriate logging IFF a user is currently logged on (for > > security/auditing reasons), e.g.: > > - the serial cable is being unplugged while a user is logged on > > - someone tries to interfere with the system by sending huge amounts > > of binary data over the serial port (possible denial-of-service) > > - ... > > > > Our user_drv.erl replacement exits with an appropriate reason in those > > cases and our shell implementation needs to know the exit reason to do > > the right thing depending on the situation. This is currently impossible > > and I was wondering whether anything could be done about it. > > That is definitely a nice use case and I would be personally more open > to allowing that than leaving the 'kill' here. I am however not in the > OTP team, and do not know everything that has to do with the shell, so > this is only my personal opinion. > > A possible workaround if things do not come to fruition would be to add > layers of indirection -- a process that monitors the shell and the > group.erl process and reports the most useful message. Ideally this > would not need to be written, although it might still be needed if you > deal with older implementations after the fix. I had thought of that as well, but tried to avoid that because personally, I do not feel comfortable with this (that our session monitor would need to know the PID of the shell's group leader). But that's merely a matter of taste and if there is need, it can overrule the headaches :-) > > > > Whether this works would certainly depend on the timing. The shell > > process should be given enough time to have a chance to process the > > first exit signal before being forcedly killed by the second one. Can > > this be guaranteed? > > > > The two-kill approach should work well in the event where the other > process is not trapping exits. In that case, the order of signals should > be guaranteed, and the first one will kill the process cleanly. > > If the process is trapping exits, though, then the first (non-kill) > signal will be converted to a message and you're absolutely unlikely to > be able to have the time to process the first one before being killed by > the second one. Unfortunately, since we want to provide +C interrupt possibilities, we need to trap exits. > The cleanest solution is obviously to be able to just exit/2 with the > right reason. > > I don't know if the OTP team has managed to transfer all the changelogs > relating to the shells when they moved over to git, but I'd be > interested to figure out if the exit(Pid,kill) in there is older than > monitors -- if so, it would mean that it was probably a workaround for > the io module which is no longer necessary today (because it can monitor > without altering links or exits being trapped). This would be interesting to know, indeed ;-) I'm just wondering if there's a better chance of getting the change if it is made configurable via "io:setopt([{safe_exit_code, true}])"? In any case I would not mind to create the patch. Kind regards, -- Dr. Stefan Zegenhagen arcutronix GmbH Garbsener Landstr. 10 30419 Hannover Germany Tel: +49 511 277-2734 Fax: +49 511 277-2709 Email: stefan.zegenhagen@REDACTED Web: www.arcutronix.com *Synchronize the Ethernet* General Managers: Dipl. Ing. Juergen Schroeder, Dr. Josef Gfrerer - Legal Form: GmbH, Registered office: Hannover, HRB 202442, Amtsgericht Hannover; Ust-Id: DE257551767. Please consider the environment before printing this message. From fredrik@REDACTED Tue May 7 13:38:03 2013 From: fredrik@REDACTED (Fredrik) Date: Tue, 7 May 2013 13:38:03 +0200 Subject: [erlang-bugs] [erlang-patches] Bit string generators, unsized binaries, modules and the REPL In-Reply-To: <3BA7F90A-E656-4CF3-A27F-00D721B94BF7@gmail.com> References: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com> <3BA7F90A-E656-4CF3-A27F-00D721B94BF7@gmail.com> Message-ID: <5188E79B.7030501@erlang.org> On 05/03/2013 11:32 AM, Anthony Ramine wrote: > Hello Bj?rn, > > Silly me, to validate my changes I run erts/compiler/test/bs_bincomp_SUITE.erl instead of the one in lib/compiler/test/. I always forget these damn export attributes. > > Both issues really fixed for good now. > > Please refetch. > > Regards, > Hello Anthony, This patch is failing small_SUITE:bin_compr testcase in dialyzer application. Could you have a look at it? Thanks, -- BR Fredrik Gustafsson Erlang OTP Team From pan@REDACTED Tue May 7 14:23:22 2013 From: pan@REDACTED (Patrik Nyblom) Date: Tue, 7 May 2013 14:23:22 +0200 Subject: [erlang-bugs] Schedulers getting "stuck", part II In-Reply-To: <83533.1367359995@snookles.snookles.com> References: <83533.1367359995@snookles.snookles.com> Message-ID: <5188F23A.7050707@erlang.org> Hi Scott (and Joe)! Thank you for these tests! I would say Joe's comment at the end of the test10 gist says it all, and is spot on: "This isn't just a NIF problem. Any code that sits in C land and doesn't accurately contribute towards scheduler reductions can case this. So, BIFs that don't estimate work and perform BIF_TRAPs are also bad. Turns out that that the commonly used |term_to_binary| and |external_size| BIFs have this problem. " Joe points out a couple of misbehaving BIF's and NIF's which will cause this, breaking the scheduling algorithm. I bet there's more of them. I can see several problems that needs to be fixed: 1) OTP should of course not have code (BIF's or NIF's or whatever) that does not even bump reductions or trap properly. 2) If writing NIF's, you should have a way to monitor the scheduler behavior to easily find long schedules. DTrace is nice, but not available everywhere... 3) If writing NIF's, you should have a simple way to put the execution of your code in a separate worker thread. The answer to (1) is that we continue (or intensify) our work when it comes to adding proper reductions and trapping to BIF's (and NIF's). A first step would be to just add proper reductions to all relevant BIF's, which is fairly easy to do. Whenever there's a BIF whose work depends on the size of the input, it should also at least add a cost to the process that's proportional. Some old BIF's does not do even that, which really needs to be fixed. Contributions are always welcome... term_to_binary and external_size are already being worked on, but there's most probably more problem BIF's out there... One step towards (2) is the ability to monitor long schedules in the system. I've extended erlang:system_monitor/2 to have an option to monitor all schedules and port operations that run for more than a specified amount of wall clock time. That should at least help in identifying such problems (the code is not in maint yet, but will be soon). More monitoring options, to see the scheduler behavior may be needed, but this is at least a start. As an example, monitoring long schedules in test10, will inform you that the processes run uninterrupted for a whopping 1,5 *seconds*. Just adding reduction cost to the md5 calls will reduce this to a tenth of the scheduling time of course. The answer to (3) is "dirty schedulers", which is in the roadmap for R17. I think all three things need to be done for the scheduling to work properly, but not only for that. A schedule that takes too long, also breaks real time properties of the VM, so fixing this by poking the schedulers to wake up at certain intervals just handles one symptom, but does not remove the cause and does not cure the impact on real time behavior... So - it's not the scheduling algorithms as such that results in this problem, it's still a problem with uninterrupted C-code. These examples shows that some (or many) of our BIF's need to be fixed, that we need to intensify the work on monitoring options and that we need dirty schedulers. At least that's how I see it. Cheers, Patrik On 05/01/2013 12:13 AM, Scott Lystig Fritchie wrote: > Patrik, there are a couple of synthetic load cases that have an end > result of what we occasionally see Riak and Riak CS doing in the wild. > Manymany thanks to Joseph Blomstedt for inventing these two modules. > > test10.erl: > https://gist.github.com/jtuple/0d9ca553b7e58adcb6f4 > test11:erl: > https://gist.github.com/jtuple/8f12ce9c21471f5d6f01 > > Both can be used by running the 'go/0' function. > > The test10:go() function creates an oscillation between a couple of > workloads: one that tends toward scheduler collapse, and one that tends > to wake them up again. > > The test11:go() function uses only a single load that tends toward > scheduler collapse. > > Both of them fail mostly regularly on my 8 core MBP using R15B01, > R15B03, and R16B. > > The io:format() messages are sent while load is not running, with very > generous pauses before starting the next phase of workload. If you call > io:format() during unfairly-scheduled workload (which these tests excel > at doing), the messages can be delayed by dozens of seconds. > > Note that these synthetic tests are using two different functions to > cause scheduler collapse: test10.erl with crypto:md5_update/2, a NIF, > and test11.erl with erlang:external_size/1, a BIF. It's quite likely > that erlang:term_to_binary/1 is similarly effective/buggy. > > Neither of them fails when using this patch on any of those three VM > versions: > > https://github.com/slfritchie/otp/compare/erlang:maint...disable-scheduler-sleeps > or > https://github.com/slfritchie/otp/tree/disable-scheduler-sleeps > > ... when also using "+scl false +zdnfgtse 500:500". > > -Scott -------------- next part -------------- An HTML attachment was scrubbed... URL: From n.oxyde@REDACTED Tue May 7 14:37:23 2013 From: n.oxyde@REDACTED (Anthony Ramine) Date: Tue, 7 May 2013 14:37:23 +0200 Subject: [erlang-bugs] [erlang-patches] Bit string generators, unsized binaries, modules and the REPL In-Reply-To: <5188E79B.7030501@erlang.org> References: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com> <3BA7F90A-E656-4CF3-A27F-00D721B94BF7@gmail.com> <5188E79B.7030501@erlang.org> Message-ID: <63BB7979-D344-4C71-99C1-2399EB604F58@gmail.com> Hello, I can't build the PLT anymore for stdlib because it complains that it (obviously) can't find the abstract code in the BEAM file. I have two solutions: * I can write a script that takes the .beam file compiled from the .S file and a .abstr file compiled from a kinda-equivalent .erl file and uses beam_lib to add an abstract code chunk. * I can make Dialyzer ignores BEAM files for which there is 'from_asm' in the compile options. Would you be against such a modification, Kostis? Regards, -- Anthony Ramine Le 7 mai 2013 ? 13:38, Fredrik a ?crit : > On 05/03/2013 11:32 AM, Anthony Ramine wrote: >> Hello Bj?rn, >> >> Silly me, to validate my changes I run erts/compiler/test/bs_bincomp_SUITE.erl instead of the one in lib/compiler/test/. I always forget these damn export attributes. >> >> Both issues really fixed for good now. >> >> Please refetch. >> >> Regards, >> > Hello Anthony, > This patch is failing small_SUITE:bin_compr testcase in dialyzer application. > Could you have a look at it? > Thanks, > > -- > > BR Fredrik Gustafsson > Erlang OTP Team > From n.oxyde@REDACTED Tue May 7 21:32:52 2013 From: n.oxyde@REDACTED (Anthony Ramine) Date: Tue, 7 May 2013 21:32:52 +0200 Subject: [erlang-bugs] [erlang-patches] Bit string generators, unsized binaries, modules and the REPL In-Reply-To: <5188E79B.7030501@erlang.org> References: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com> <3BA7F90A-E656-4CF3-A27F-00D721B94BF7@gmail.com> <5188E79B.7030501@erlang.org> Message-ID: Hello Fredrik, I removed the Dialyzer patch as it tests a now-forbidden expression. Please refetch. Regards, -- Anthony Ramine Le 7 mai 2013 ? 13:38, Fredrik a ?crit : > On 05/03/2013 11:32 AM, Anthony Ramine wrote: >> Hello Bj?rn, >> >> Silly me, to validate my changes I run erts/compiler/test/bs_bincomp_SUITE.erl instead of the one in lib/compiler/test/. I always forget these damn export attributes. >> >> Both issues really fixed for good now. >> >> Please refetch. >> >> Regards, >> > Hello Anthony, > This patch is failing small_SUITE:bin_compr testcase in dialyzer application. > Could you have a look at it? > Thanks, > > -- > > BR Fredrik Gustafsson > Erlang OTP Team > From elinsn@REDACTED Wed May 8 09:55:55 2013 From: elinsn@REDACTED (Sergey Yelin) Date: Wed, 8 May 2013 07:55:55 +0000 (UTC) Subject: [erlang-bugs] Invitation to connect on LinkedIn Message-ID: <1744636659.17853080.1367999755861.JavaMail.app@ela4-app0134.prod> LinkedIn ------------ I'd like to add you to my professional network on LinkedIn. - Sergey Sergey Yelin Tech Lead at EXANTE, The Organic Choice in Finance Russian Federation Confirm that you know Sergey Yelin: https://www.linkedin.com/e/e17gao-hgg7qf63-4y/isd/13086505733/rnhjn1nh/?hs=false&tok=38pQn9T-BXVBI1 -- You are receiving Invitation to Connect emails. Click to unsubscribe: http://www.linkedin.com/e/e17gao-hgg7qf63-4y/qx6Gu6PrqWdpIYhF_xtVrrPrqWdpIYR4Y3j/goo/erlang-bugs%40erlang%2Eorg/20061/I4333492851_1/?hs=false&tok=2vwUt2PxpXVBI1 (c) 2012 LinkedIn Corporation. 2029 Stierlin Ct, Mountain View, CA 94043, USA. -------------- next part -------------- An HTML attachment was scrubbed... URL: From fredrik@REDACTED Wed May 8 10:18:45 2013 From: fredrik@REDACTED (Fredrik) Date: Wed, 8 May 2013 10:18:45 +0200 Subject: [erlang-bugs] [erlang-patches] Bit string generators, unsized binaries, modules and the REPL In-Reply-To: References: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com> <3BA7F90A-E656-4CF3-A27F-00D721B94BF7@gmail.com> <5188E79B.7030501@erlang.org> Message-ID: <518A0A65.7020206@erlang.org> On 05/07/2013 09:32 PM, Anthony Ramine wrote: > Hello Fredrik, > > I removed the Dialyzer patch as it tests a now-forbidden expression. Please refetch. > > Regards, > Hello Anthony, Re-fetched. Thanks, -- BR Fredrik Gustafsson Erlang OTP Team From snar@REDACTED Wed May 8 12:32:13 2013 From: snar@REDACTED (Alexandre Snarskii) Date: Wed, 8 May 2013 14:32:13 +0400 Subject: [erlang-bugs] minor bug in erl_interface. Message-ID: <20130508103213.GB46550@snar.spb.ru> Hi! During valgrinding freeswitch compiled with mod_erlang_event valgrind output was flooded with messages like the following: ==96247== Warning: invalid file descriptor -2 in syscall close() ==96247== at 0x7CA9D3: __sys_close (in /usr/lib32/libc.so.7) ==96247== by 0xBC66F3: ei_accept_tmo (in /usr/local/lib/freeswitch/mod/mod_er lang_event.so) ==96247== by 0xBBF34F: mod_erlang_event_runtime (mod_erlang_event.c:1957) ==96247== by 0x145DF8: switch_loadable_module_exec (switch_loadable_module.c: 98) ==96247== by 0x1F17B3: dummy_worker (thread.c:138) ==96247== by 0x367F19: ??? (in /usr/lib32/libthr.so.3) According to sources (./lib/erl_interface/src/connect/ei_connect.c), -2 is a timeout indication from ei_accept_t: if ((fd = ei_accept_t(lfd, (struct sockaddr*) &cli_addr, &cli_addr_len, ms )) < 0) { EI_TRACE_ERR0("ei_accept","<- ACCEPT socket accept failed"); erl_errno = (fd == -2) ? ETIMEDOUT : EIO; goto error; } [....] error: EI_TRACE_ERR0("ei_accept","<- ACCEPT failed"); closesocket(fd); return ERL_ERROR; } /* ei_accept */ and closesocket on unix systems is defined as just close(2), so any timeout or error on accept causes closing invalid file descriptor. Patch is obvious: EI_TRACE_ERR0("ei_accept","<- ACCEPT failed"); - closesocket(fd); + if (fd>=0) + closesocket(fd); return ERL_ERROR; -- In theory, there is no difference between theory and practice. But, in practice, there is. From fredrik@REDACTED Wed May 8 15:44:24 2013 From: fredrik@REDACTED (Fredrik) Date: Wed, 8 May 2013 15:44:24 +0200 Subject: [erlang-bugs] minor bug in erl_interface. In-Reply-To: <20130508103213.GB46550@snar.spb.ru> References: <20130508103213.GB46550@snar.spb.ru> Message-ID: <518A56B8.9000907@erlang.org> On 05/08/2013 12:32 PM, Alexandre Snarskii wrote: > Hi! > > During valgrinding freeswitch compiled with mod_erlang_event > valgrind output was flooded with messages like the following: > > ==96247== Warning: invalid file descriptor -2 in syscall close() > ==96247== at 0x7CA9D3: __sys_close (in /usr/lib32/libc.so.7) > ==96247== by 0xBC66F3: ei_accept_tmo (in /usr/local/lib/freeswitch/mod/mod_er > lang_event.so) > ==96247== by 0xBBF34F: mod_erlang_event_runtime (mod_erlang_event.c:1957) > ==96247== by 0x145DF8: switch_loadable_module_exec (switch_loadable_module.c: > 98) > ==96247== by 0x1F17B3: dummy_worker (thread.c:138) > ==96247== by 0x367F19: ??? (in /usr/lib32/libthr.so.3) > > According to sources (./lib/erl_interface/src/connect/ei_connect.c), -2 is > a timeout indication from ei_accept_t: > > if ((fd = ei_accept_t(lfd, (struct sockaddr*)&cli_addr, > &cli_addr_len, ms ))< 0) { > EI_TRACE_ERR0("ei_accept","<- ACCEPT socket accept failed"); > erl_errno = (fd == -2) ? ETIMEDOUT : EIO; > goto error; > } > [....] > error: > EI_TRACE_ERR0("ei_accept","<- ACCEPT failed"); > closesocket(fd); > return ERL_ERROR; > } /* ei_accept */ > > and closesocket on unix systems is defined as just close(2), so any > timeout or error on accept causes closing invalid file descriptor. > > Patch is obvious: > > EI_TRACE_ERR0("ei_accept","<- ACCEPT failed"); > - closesocket(fd); > + if (fd>=0) > + closesocket(fd); > return ERL_ERROR; > Hello Alexandre, I am making a patch out of this and putting it into testing. Thanks for noticing and reporting :) -- BR Fredrik Gustafsson Erlang OTP Team From n.oxyde@REDACTED Thu May 9 15:03:08 2013 From: n.oxyde@REDACTED (Anthony Ramine) Date: Thu, 9 May 2013 15:03:08 +0200 Subject: [erlang-bugs] Properly guard WIDE_TAG use with HAVE_WCWIDTH in ttsl_drv Message-ID: Hello, I forgot to guard two lines of code where WIDE_TAG is used, crashing the compile process if wcwidth() is unavailable. git fetch https://github.com/nox/otp.git fix-wcwidth https://github.com/nox/otp/compare/erlang:maint...fix-wcwidth https://github.com/nox/otp/compare/erlang:maint...fix-wcwidth.patch Regards, -- Anthony Ramine From robert.virding@REDACTED Thu May 9 22:13:33 2013 From: robert.virding@REDACTED (Robert Virding) Date: Thu, 9 May 2013 21:13:33 +0100 (BST) Subject: [erlang-bugs] Strange thing in lib/kernel/src/group.erl In-Reply-To: <1367848475.31752.86.camel@ax-sze> Message-ID: <420733710.105442810.1368130413119.JavaMail.root@erlang-solutions.com> The shell might be trapping exits and ignore exit messages or running code which does the same. The minimal case: 1> process_flag(trap_exit, true). false In which case the only guaranteed method to kill it is to send the kill signal. Unfortunately you cannot (must not) assume that code is well-behaved even if it has been written with the best intentions. Erlang's error handling mechanism is based on this assumption. Robert ----- Original Message ----- > From: "Stefan Zegenhagen" > To: "Fred Hebert" > Cc: erlang-bugs@REDACTED > Sent: Monday, 6 May, 2013 3:54:35 PM > Subject: Re: [erlang-bugs] Strange thing in lib/kernel/src/group.erl > > Hi, > > > thanks again for the detailed answer. > > > > > I can see that it might be wanted to get rid of the shell for > > > sure. One > > > might imagine a case where the shell is trapping exits but > > > "refuses to > > > die" in response to a trappable exit signal. But then, it is not > > > clear > > > to me, why the same measure (e.g. exit_shell(kill)) is not taken > > > in the > > > case where the group.erl's server process is *NOT* executing an > > > I/O > > > request right now and the shell might truely be blocked by > > > activities > > > that prevent it from reacting on the exit signal. > > > > > > > It is indeed not very clear. My guess would be that you can make > > assumptions about your part of the communication and protocol, but > > not > > the others. > > > > A simpler explanation is probably that sometimes back, there was a > > problem with either implementation and it was simpler to fix with a > > kill > > than by adding other ways to handling code (say, before monitoring > > was > > added to the language, but while trap_exits were available). > > > > If this is the case, then there would be no reason to keep things > > the > > way they are right now IMO, and it would be possible to go with the > > other exit. > > I guess we'll have to wait for the OTP team to have a look at this, > then ;-) > > > > > But back to the original issue: there are several, discinct > > > reasons why > > > we might need to forcedly terminate a shell session *AND* to do > > > an > > > appropriate logging IFF a user is currently logged on (for > > > security/auditing reasons), e.g.: > > > - the serial cable is being unplugged while a user is logged on > > > - someone tries to interfere with the system by sending huge > > > amounts > > > of binary data over the serial port (possible > > > denial-of-service) > > > - ... > > > > > > Our user_drv.erl replacement exits with an appropriate reason in > > > those > > > cases and our shell implementation needs to know the exit reason > > > to do > > > the right thing depending on the situation. This is currently > > > impossible > > > and I was wondering whether anything could be done about it. > > > > That is definitely a nice use case and I would be personally more > > open > > to allowing that than leaving the 'kill' here. I am however not in > > the > > OTP team, and do not know everything that has to do with the shell, > > so > > this is only my personal opinion. > > > > A possible workaround if things do not come to fruition would be to > > add > > layers of indirection -- a process that monitors the shell and the > > group.erl process and reports the most useful message. Ideally this > > would not need to be written, although it might still be needed if > > you > > deal with older implementations after the fix. > > I had thought of that as well, but tried to avoid that because > personally, I do not feel comfortable with this (that our session > monitor would need to know the PID of the shell's group leader). But > that's merely a matter of taste and if there is need, it can overrule > the headaches :-) > > > > > > > Whether this works would certainly depend on the timing. The > > > shell > > > process should be given enough time to have a chance to process > > > the > > > first exit signal before being forcedly killed by the second one. > > > Can > > > this be guaranteed? > > > > > > > The two-kill approach should work well in the event where the other > > process is not trapping exits. In that case, the order of signals > > should > > be guaranteed, and the first one will kill the process cleanly. > > > > If the process is trapping exits, though, then the first (non-kill) > > signal will be converted to a message and you're absolutely > > unlikely to > > be able to have the time to process the first one before being > > killed by > > the second one. > > Unfortunately, since we want to provide +C interrupt > possibilities, we need to trap exits. > > > > The cleanest solution is obviously to be able to just exit/2 with > > the > > right reason. > > > > I don't know if the OTP team has managed to transfer all the > > changelogs > > relating to the shells when they moved over to git, but I'd be > > interested to figure out if the exit(Pid,kill) in there is older > > than > > monitors -- if so, it would mean that it was probably a workaround > > for > > the io module which is no longer necessary today (because it can > > monitor > > without altering links or exits being trapped). > > > This would be interesting to know, indeed ;-) > > I'm just wondering if there's a better chance of getting the change > if > it is made configurable via "io:setopt([{safe_exit_code, true}])"? In > any case I would not mind to create the patch. > > > Kind regards, > -- > Dr. Stefan Zegenhagen > > arcutronix GmbH > Garbsener Landstr. 10 > 30419 Hannover > Germany > > Tel: +49 511 277-2734 > Fax: +49 511 277-2709 > Email: stefan.zegenhagen@REDACTED > Web: www.arcutronix.com > > *Synchronize the Ethernet* > > General Managers: Dipl. Ing. Juergen Schroeder, Dr. Josef Gfrerer - > Legal Form: GmbH, Registered office: Hannover, HRB 202442, > Amtsgericht > Hannover; Ust-Id: DE257551767. > > Please consider the environment before printing this message. > > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://erlang.org/mailman/listinfo/erlang-bugs > From stefan.zegenhagen@REDACTED Fri May 10 10:23:51 2013 From: stefan.zegenhagen@REDACTED (Stefan Zegenhagen) Date: Fri, 10 May 2013 10:23:51 +0200 Subject: [erlang-bugs] Strange thing in lib/kernel/src/group.erl In-Reply-To: <420733710.105442810.1368130413119.JavaMail.root@erlang-solutions.com> References: <420733710.105442810.1368130413119.JavaMail.root@erlang-solutions.com> Message-ID: <1368174231.31752.112.camel@ax-sze> Dear Robert, > The shell might be trapping exits and ignore exit messages or running code which does the same. The minimal case: > > 1> process_flag(trap_exit, true). > false > > In which case the only guaranteed method to kill it is to send the kill signal. Unfortunately you cannot (must not) assume that code is well-behaved even if it has been written with the best intentions. Erlang's error handling mechanism is based on this assumption. I fully agree with you. There's just two things that are difficult for me to understand and indicate that a change to the current behaviour might be necessary: 1) The error handling isn't consistently that rigorous. In my opinion, it's the less critical path that uses the definite exit path, whereas other exit paths simply assume that the client code is well-behaved. 2) Even for well-behaved code there is *ALWAYS* a penalty by not being able to reliably retrieve the exit reason. There may be solutions to the problem that satisfy all needs. Two spring into my mind almost immediately: a) Make the behaviour configurable by introducing an io:setopt() option, but let the default behaviour as-is. b) Assume that the code is well-behaved. Send a regular exit signal with correct reason and check that the shell process really exits. If it does not within a certain amount of time, forcedly kill it. I would be willing to prepare a patch, but before doing so, I wanted to get an overview of the possible solutions and which of them might be acceptable to the erlang community. Kind regards, -- Dr. Stefan Zegenhagen arcutronix GmbH Garbsener Landstr. 10 30419 Hannover Germany Tel: +49 511 277-2734 Fax: +49 511 277-2709 Email: stefan.zegenhagen@REDACTED Web: www.arcutronix.com *Synchronize the Ethernet* General Managers: Dipl. Ing. Juergen Schroeder, Dr. Josef Gfrerer - Legal Form: GmbH, Registered office: Hannover, HRB 202442, Amtsgericht Hannover; Ust-Id: DE257551767. Please consider the environment before printing this message. From n.oxyde@REDACTED Sun May 12 17:36:45 2013 From: n.oxyde@REDACTED (Anthony Ramine) Date: Sun, 12 May 2013 17:36:45 +0200 Subject: [erlang-bugs] Lift limitation to FD_SETSIZE file descriptors on Mac OS X in erl_poll Message-ID: <5B66A627-9663-4745-8282-9A4BFAB23229@gmail.com> Hello, I've written a patch that makes erl_poll uses _DARWIN_UNLIMITED_SELECT on Mac OS X. This constant makes select() work with more than FD_SETSIZE file descriptors, all that is needed is to manually manage the fd_set values. I've run port_SUITE.iter_max_ports test case and got a maximum of 2422 ports instead of 502 before. Cc'ing Joel Reymont and Max Lapshin because I know they both encountered that problem. git fetch https://github.com/nox/otp.git darwin-unlimited-select https://github.com/nox/otp/compare/erlang:maint...darwin-unlimited-select https://github.com/nox/otp/compare/erlang:maint...darwin-unlimited-select.patch Regards, -- Anthony Ramine From fredrik@REDACTED Mon May 13 10:10:35 2013 From: fredrik@REDACTED (Fredrik) Date: Mon, 13 May 2013 10:10:35 +0200 Subject: [erlang-bugs] [erlang-patches] Properly guard WIDE_TAG use with HAVE_WCWIDTH in ttsl_drv In-Reply-To: References: Message-ID: <51909FFB.3050908@erlang.org> On 05/09/2013 03:03 PM, Anthony Ramine wrote: > Hello, > > I forgot to guard two lines of code where WIDE_TAG is used, crashing the compile process if wcwidth() is unavailable. > > git fetch https://github.com/nox/otp.git fix-wcwidth > > https://github.com/nox/otp/compare/erlang:maint...fix-wcwidth > https://github.com/nox/otp/compare/erlang:maint...fix-wcwidth.patch > > Regards, > Hello Anthony, I've fetched your branch, it should be visible in the 'pu' branch shortly. Thanks, -- BR Fredrik Gustafsson Erlang OTP Team From fredrik@REDACTED Mon May 13 10:27:46 2013 From: fredrik@REDACTED (Fredrik) Date: Mon, 13 May 2013 10:27:46 +0200 Subject: [erlang-bugs] [erlang-patches] Lift limitation to FD_SETSIZE file descriptors on Mac OS X in erl_poll In-Reply-To: <5B66A627-9663-4745-8282-9A4BFAB23229@gmail.com> References: <5B66A627-9663-4745-8282-9A4BFAB23229@gmail.com> Message-ID: <5190A402.4090406@erlang.org> On 05/12/2013 05:36 PM, Anthony Ramine wrote: > Hello, > > I've written a patch that makes erl_poll uses _DARWIN_UNLIMITED_SELECT on Mac OS X. This constant makes select() work with more than FD_SETSIZE file descriptors, all that is needed is to manually manage the fd_set values. > > I've run port_SUITE.iter_max_ports test case and got a maximum of 2422 ports instead of 502 before. > > Cc'ing Joel Reymont and Max Lapshin because I know they both encountered that problem. > > git fetch https://github.com/nox/otp.git darwin-unlimited-select > > https://github.com/nox/otp/compare/erlang:maint...darwin-unlimited-select > https://github.com/nox/otp/compare/erlang:maint...darwin-unlimited-select.patch > > Regards, > Hello Anthony, I've fetched your branch and it is now located in the 'pu' branch. Thanks, -- BR Fredrik Gustafsson Erlang OTP Team From erlangsiri@REDACTED Mon May 13 17:41:43 2013 From: erlangsiri@REDACTED (Siri Hansen) Date: Mon, 13 May 2013 17:41:43 +0200 Subject: [erlang-bugs] Supervisor terminate_child race In-Reply-To: <83357CE5-7BFB-4857-82ED-33AC842ACBD8@gmail.com> References: <3161FF70-B6D0-4565-8664-2FCB9F96E08D@gmail.com> <83357CE5-7BFB-4857-82ED-33AC842ACBD8@gmail.com> Message-ID: Bryan and Tim, your analysis is very good, and the problem is complicated. I don't see a "water tight" solution right now, and I can not spend too much time pondering without having a real priority for this case. I have written a ticket for it, and it will be prioritized along with all other backlog items. Any further thoughts and contributions will be very much appreciated :) Thanks again /siri 2013/4/30 Tim Watson > Hi Bryan, > > On 30 Apr 2013, at 18:34, Bryan Fink wrote: > > > But twiddling the timing there is just as racy, as you've noticed, right? > > > Correct. The length of the timeout is irrelevant. The EXIT signal is > not guaranteed to arrive within any specific amount of time. > > > Indeed. Almost a halting problem this isn't it. :) > > > Isn't the point that the EXIT signal might /never/ come, if the child > un-links, or might come *after* the 'DOWN' if the race you've located > occurs? Surely you've got to be able to handle either case? > > > Yes, the point of the monitor is to handle the case where the EXIT > never comes (because the child unlinks). It is not the case, however, > that the EXIT always arrives after the DOWN in the race I'm seeing. > They might both be delayed. > > > Waiting without a timeout for the 'DOWN' is acceptable, because you've got > a guarantee (via the runtime) the it *will* arrive, no matter what state > the target process was in when you created the monitor. Waiting some > arbitrary time for the 'EXIT' is a real problem though, because you could > wait forever. > > Handling either order is important, but the problem with this race is > that only the EXIT message contains the actual exit reason when this > happens. The 'noproc' in the DOWN is just saying that there was no > process to monitor. > > > Indeed. But it could equally be true that the 'EXIT' signal was never > dispatched, because the child process unlinked before it died; You can't > wait forever for the 'EXIT' after you've seen a 'DOWN' with 'noproc' as the > reason, so now you've got to choose how long to wait, but whatever timing > works for one particular case isn't going to solve the general problem. > > > We ran into something similar with our supervisor2 fork a while back, > whilst terminating (multiple) simple children: > http://hg.rabbitmq.com/rabbitmq-server/rev/812d71d0716c . That code is > somewhat different though, not only because it was terminating multiple > children (during shutdown) but also because it explicitly unlinks from the > child *after* creating the monitor, and /still/ allowed for an EXIT signal > to have made its way into the mailbox unexpectedly. > > > The monitor_child/1 function also unlinks from the child after > creating the monitor. That patch looks a little bit like the fixes I > was trying. Basically it's checking for an EXIT message after > receiving the DOWN, just in case one is in the mailbox, yes? > > > That's correct. > > The problem is that it might still miss an EXIT, because it might still > not have arrived yet, even though it will later. > > > Yes that's definitely true and we were aware of that problem, however > since we know we cannot wait for the 'EXIT' forever and whatever arbitrary > timeout we choose is just someone else's race condition, we decided that if > the EXIT signal wasn't delivered expediently to the process' mailbox, that > loosing the real exit reason was something we could live with in the worst > case. > > Since we've started merging the R15/R16 changes in though, that code has > disappeared so we're in the same boat as you guys. :) > > Cheers, > Tim > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From erlangsiri@REDACTED Tue May 14 16:43:14 2013 From: erlangsiri@REDACTED (Siri Hansen) Date: Tue, 14 May 2013 16:43:14 +0200 Subject: [erlang-bugs] Supervisor terminate_child race In-Reply-To: References: <3161FF70-B6D0-4565-8664-2FCB9F96E08D@gmail.com> <83357CE5-7BFB-4857-82ED-33AC842ACBD8@gmail.com> Message-ID: Just a thought: would it be an option (and would it help) to monitor each child from birth? /siri 2013/5/13 Siri Hansen > Bryan and Tim, your analysis is very good, and the problem is complicated. > I don't see a "water tight" solution right now, and I can not spend too > much time pondering without having a real priority for this case. I have > written a ticket for it, and it will be prioritized along with all other > backlog items. Any further thoughts and contributions will be very much > appreciated :) > Thanks again > /siri > > > 2013/4/30 Tim Watson > >> Hi Bryan, >> >> On 30 Apr 2013, at 18:34, Bryan Fink wrote: >> >> >> But twiddling the timing there is just as racy, as you've noticed, right? >> >> >> Correct. The length of the timeout is irrelevant. The EXIT signal is >> not guaranteed to arrive within any specific amount of time. >> >> >> Indeed. Almost a halting problem this isn't it. :) >> >> >> Isn't the point that the EXIT signal might /never/ come, if the child >> un-links, or might come *after* the 'DOWN' if the race you've located >> occurs? Surely you've got to be able to handle either case? >> >> >> Yes, the point of the monitor is to handle the case where the EXIT >> never comes (because the child unlinks). It is not the case, however, >> that the EXIT always arrives after the DOWN in the race I'm seeing. >> They might both be delayed. >> >> >> Waiting without a timeout for the 'DOWN' is acceptable, because you've >> got a guarantee (via the runtime) the it *will* arrive, no matter what >> state the target process was in when you created the monitor. Waiting some >> arbitrary time for the 'EXIT' is a real problem though, because you could >> wait forever. >> >> Handling either order is important, but the problem with this race is >> that only the EXIT message contains the actual exit reason when this >> happens. The 'noproc' in the DOWN is just saying that there was no >> process to monitor. >> >> >> Indeed. But it could equally be true that the 'EXIT' signal was never >> dispatched, because the child process unlinked before it died; You can't >> wait forever for the 'EXIT' after you've seen a 'DOWN' with 'noproc' as the >> reason, so now you've got to choose how long to wait, but whatever timing >> works for one particular case isn't going to solve the general problem. >> >> >> We ran into something similar with our supervisor2 fork a while back, >> whilst terminating (multiple) simple children: >> http://hg.rabbitmq.com/rabbitmq-server/rev/812d71d0716c . That code is >> somewhat different though, not only because it was terminating multiple >> children (during shutdown) but also because it explicitly unlinks from the >> child *after* creating the monitor, and /still/ allowed for an EXIT signal >> to have made its way into the mailbox unexpectedly. >> >> >> The monitor_child/1 function also unlinks from the child after >> creating the monitor. That patch looks a little bit like the fixes I >> was trying. Basically it's checking for an EXIT message after >> receiving the DOWN, just in case one is in the mailbox, yes? >> >> >> That's correct. >> >> The problem is that it might still miss an EXIT, because it might still >> not have arrived yet, even though it will later. >> >> >> Yes that's definitely true and we were aware of that problem, however >> since we know we cannot wait for the 'EXIT' forever and whatever arbitrary >> timeout we choose is just someone else's race condition, we decided that if >> the EXIT signal wasn't delivered expediently to the process' mailbox, that >> loosing the real exit reason was something we could live with in the worst >> case. >> >> Since we've started merging the R15/R16 changes in though, that code has >> disappeared so we're in the same boat as you guys. :) >> >> Cheers, >> Tim >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From watson.timothy@REDACTED Wed May 15 10:03:31 2013 From: watson.timothy@REDACTED (Tim Watson) Date: Wed, 15 May 2013 09:03:31 +0100 Subject: [erlang-bugs] Supervisor terminate_child race In-Reply-To: References: <3161FF70-B6D0-4565-8664-2FCB9F96E08D@gmail.com> <83357CE5-7BFB-4857-82ED-33AC842ACBD8@gmail.com> Message-ID: <05D1CF78-A894-4C1C-8848-4F98F80707EF@gmail.com> Switching to monitors is, IMHO a better approach, since using both is prone to races and links are open to be interfered with. Are there any disadvantages I've not thought of though? Or are you suggesting to do both from birth? On 14 May 2013, at 15:43, Siri Hansen wrote: > Just a thought: would it be an option (and would it help) to monitor each child from birth? > /siri > > > 2013/5/13 Siri Hansen > Bryan and Tim, your analysis is very good, and the problem is complicated. I don't see a "water tight" solution right now, and I can not spend too much time pondering without having a real priority for this case. I have written a ticket for it, and it will be prioritized along with all other backlog items. Any further thoughts and contributions will be very much appreciated :) > Thanks again > /siri > > > 2013/4/30 Tim Watson > Hi Bryan, > > On 30 Apr 2013, at 18:34, Bryan Fink wrote: >>> >>> But twiddling the timing there is just as racy, as you've noticed, right? >> >> Correct. The length of the timeout is irrelevant. The EXIT signal is >> not guaranteed to arrive within any specific amount of time. >> > > Indeed. Almost a halting problem this isn't it. :) > >>> >>> Isn't the point that the EXIT signal might /never/ come, if the child un-links, or might come *after* the 'DOWN' if the race you've located occurs? Surely you've got to be able to handle either case? >> >> Yes, the point of the monitor is to handle the case where the EXIT >> never comes (because the child unlinks). It is not the case, however, >> that the EXIT always arrives after the DOWN in the race I'm seeing. >> They might both be delayed. >> > > Waiting without a timeout for the 'DOWN' is acceptable, because you've got a guarantee (via the runtime) the it *will* arrive, no matter what state the target process was in when you created the monitor. Waiting some arbitrary time for the 'EXIT' is a real problem though, because you could wait forever. > >> Handling either order is important, but the problem with this race is >> that only the EXIT message contains the actual exit reason when this >> happens. The 'noproc' in the DOWN is just saying that there was no >> process to monitor. > > Indeed. But it could equally be true that the 'EXIT' signal was never dispatched, because the child process unlinked before it died; You can't wait forever for the 'EXIT' after you've seen a 'DOWN' with 'noproc' as the reason, so now you've got to choose how long to wait, but whatever timing works for one particular case isn't going to solve the general problem. > >> >>> We ran into something similar with our supervisor2 fork a while back, whilst terminating (multiple) simple children: http://hg.rabbitmq.com/rabbitmq-server/rev/812d71d0716c . That code is somewhat different though, not only because it was terminating multiple children (during shutdown) but also because it explicitly unlinks from the child *after* creating the monitor, and /still/ allowed for an EXIT signal to have made its way into the mailbox unexpectedly. >> >> The monitor_child/1 function also unlinks from the child after >> creating the monitor. That patch looks a little bit like the fixes I >> was trying. Basically it's checking for an EXIT message after >> receiving the DOWN, just in case one is in the mailbox, yes? > > That's correct. > >> The problem is that it might still miss an EXIT, because it might still >> not have arrived yet, even though it will later. >> > > Yes that's definitely true and we were aware of that problem, however since we know we cannot wait for the 'EXIT' forever and whatever arbitrary timeout we choose is just someone else's race condition, we decided that if the EXIT signal wasn't delivered expediently to the process' mailbox, that loosing the real exit reason was something we could live with in the worst case. > > Since we've started merging the R15/R16 changes in though, that code has disappeared so we're in the same boat as you guys. :) > > Cheers, > Tim > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.virding@REDACTED Wed May 15 12:17:28 2013 From: robert.virding@REDACTED (Robert Virding) Date: Wed, 15 May 2013 11:17:28 +0100 (BST) Subject: [erlang-bugs] Supervisor terminate_child race In-Reply-To: <05D1CF78-A894-4C1C-8848-4F98F80707EF@gmail.com> Message-ID: <816241784.113785065.1368613048244.JavaMail.root@erlang-solutions.com> Do you mean only using monitors in the supervisor, and no links? If so that would not work as you would then not get an exit signal automatically sent to the child when the supervisor dies. Which you do want. Or have I misunderstood you? Robert ----- Original Message ----- > From: "Tim Watson" > To: "Siri Hansen" > Cc: erlang-bugs@REDACTED > Sent: Wednesday, 15 May, 2013 10:03:31 AM > Subject: Re: [erlang-bugs] Supervisor terminate_child race > Switching to monitors is, IMHO a better approach, since using both is > prone to races and links are open to be interfered with. > Are there any disadvantages I've not thought of though? Or are you > suggesting to do both from birth? > On 14 May 2013, at 15:43, Siri Hansen < erlangsiri@REDACTED > wrote: > > Just a thought: would it be an option (and would it help) to > > monitor > > each child from birth? > > > /siri > > > 2013/5/13 Siri Hansen < erlangsiri@REDACTED > > > > > Bryan and Tim, your analysis is very good, and the problem is > > > complicated. I don't see a "water tight" solution right now, and > > > I > > > can not spend too much time pondering without having a real > > > priority > > > for this case. I have written a ticket for it, and it will be > > > prioritized along with all other backlog items. Any further > > > thoughts > > > and contributions will be very much appreciated :) > > > > > > Thanks again > > > > > > /siri > > > > > > 2013/4/30 Tim Watson < watson.timothy@REDACTED > > > > > > > > Hi Bryan, > > > > > > > > > > On 30 Apr 2013, at 18:34, Bryan Fink wrote: > > > > > > > > > > > > But twiddling the timing there is just as racy, as you've > > > > > > noticed, > > > > > > right? > > > > > > > > > > > > > > > > > > > > Correct. The length of the timeout is irrelevant. The EXIT > > > > > signal > > > > > is > > > > > > > > > > > > > > > not guaranteed to arrive within any specific amount of time. > > > > > > > > > > > > > > Indeed. Almost a halting problem this isn't it. :) > > > > > > > > > > > > Isn't the point that the EXIT signal might /never/ come, if > > > > > > the > > > > > > child > > > > > > un-links, or might come *after* the 'DOWN' if the race > > > > > > you've > > > > > > located occurs? Surely you've got to be able to handle > > > > > > either > > > > > > case? > > > > > > > > > > > > > > > > > > > > Yes, the point of the monitor is to handle the case where the > > > > > EXIT > > > > > > > > > > > > > > > never comes (because the child unlinks). It is not the case, > > > > > however, > > > > > > > > > > > > > > > that the EXIT always arrives after the DOWN in the race I'm > > > > > seeing. > > > > > > > > > > > > > > > They might both be delayed. > > > > > > > > > > > > > > Waiting without a timeout for the 'DOWN' is acceptable, because > > > > you've got a guarantee (via the runtime) the it *will* arrive, > > > > no > > > > matter what state the target process was in when you created > > > > the > > > > monitor. Waiting some arbitrary time for the 'EXIT' is a real > > > > problem though, because you could wait forever. > > > > > > > > > > > Handling either order is important, but the problem with this > > > > > race > > > > > is > > > > > > > > > > > > > > > that only the EXIT message contains the actual exit reason > > > > > when > > > > > this > > > > > > > > > > > > > > > happens. The 'noproc' in the DOWN is just saying that there > > > > > was > > > > > no > > > > > > > > > > > > > > > process to monitor. > > > > > > > > > > > > > > Indeed. But it could equally be true that the 'EXIT' signal was > > > > never > > > > dispatched, because the child process unlinked before it died; > > > > You > > > > can't wait forever for the 'EXIT' after you've seen a 'DOWN' > > > > with > > > > 'noproc' as the reason, so now you've got to choose how long to > > > > wait, but whatever timing works for one particular case isn't > > > > going > > > > to solve the general problem. > > > > > > > > > > > > We ran into something similar with our supervisor2 fork a > > > > > > while > > > > > > back, > > > > > > whilst terminating (multiple) simple children: > > > > > > http://hg.rabbitmq.com/rabbitmq-server/rev/812d71d0716c . > > > > > > That > > > > > > code > > > > > > is somewhat different though, not only because it was > > > > > > terminating > > > > > > multiple children (during shutdown) but also because it > > > > > > explicitly > > > > > > unlinks from the child *after* creating the monitor, and > > > > > > /still/ > > > > > > allowed for an EXIT signal to have made its way into the > > > > > > mailbox > > > > > > unexpectedly. > > > > > > > > > > > > > > > > > > > > The monitor_child/1 function also unlinks from the child > > > > > after > > > > > > > > > > > > > > > creating the monitor. That patch looks a little bit like the > > > > > fixes > > > > > I > > > > > > > > > > > > > > > was trying. Basically it's checking for an EXIT message after > > > > > > > > > > > > > > > receiving the DOWN, just in case one is in the mailbox, yes? > > > > > > > > > > > > > > That's correct. > > > > > > > > > > > The problem is that it might still miss an EXIT, because it > > > > > might > > > > > still > > > > > > > > > > > > > > > not have arrived yet, even though it will later. > > > > > > > > > > > > > > Yes that's definitely true and we were aware of that problem, > > > > however > > > > since we know we cannot wait for the 'EXIT' forever and > > > > whatever > > > > arbitrary timeout we choose is just someone else's race > > > > condition, > > > > we decided that if the EXIT signal wasn't delivered expediently > > > > to > > > > the process' mailbox, that loosing the real exit reason was > > > > something we could live with in the worst case. > > > > > > > > > > Since we've started merging the R15/R16 changes in though, that > > > > code > > > > has disappeared so we're in the same boat as you guys. :) > > > > > > > > > > Cheers, > > > > > > > > > > Tim > > > > > > > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://erlang.org/mailman/listinfo/erlang-bugs -------------- next part -------------- An HTML attachment was scrubbed... URL: From watson.timothy@REDACTED Wed May 15 12:53:04 2013 From: watson.timothy@REDACTED (Tim Watson) Date: Wed, 15 May 2013 11:53:04 +0100 Subject: [erlang-bugs] Supervisor terminate_child race In-Reply-To: <816241784.113785065.1368613048244.JavaMail.root@erlang-solutions.com> References: <816241784.113785065.1368613048244.JavaMail.root@erlang-solutions.com> Message-ID: On 15 May 2013, at 11:17, Robert Virding wrote: > Do you mean only using monitors in the supervisor, and no links? If so that would not work as you would then not get an exit signal automatically sent to the child when the supervisor dies. > Which you do want. Or have I misunderstood you? > Oh gosh, how embarrasing. I was thinking in terms of Uni-directional Links (viz A Unified Semantics for Future Erlang, Svensson et al), and linking child to parent (so as to propagate supervisor exits) but not the other way around. Of course we can't do that - just ignore this suggestion. [note: I've been implementing the supervisor API for cloud haskell in my spare time and got confused between those semantics (viz http://haskell-distributed.github.io/static/semantics.pdf) and what I do for a day job in the *real world*]. But switching all the supervisor's signal handling to rely on monitor notifications rather than trapped exits (which might be ignored) sounds good to me. The use of linking would be there to guarantee supervisor death is propagated correctly, but we could switch away from handling child 'EXIT' signals to handling 'DOWN' notifications instead. This would IMO be a bit cleaner. Cheers, Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: From erlangsiri@REDACTED Wed May 15 15:54:54 2013 From: erlangsiri@REDACTED (Siri Hansen) Date: Wed, 15 May 2013 15:54:54 +0200 Subject: [erlang-bugs] Supervisor terminate_child race In-Reply-To: References: <816241784.113785065.1368613048244.JavaMail.root@erlang-solutions.com> Message-ID: Then again... it is up to the child's start function to create the link, and from the supervisor's point of view, the only place to add the monitor would be when the start function returns - which would be just another place to get a race :( 2013/5/15 Tim Watson > On 15 May 2013, at 11:17, Robert Virding wrote: > > Do you mean only using monitors in the supervisor, and no links? If so > that would not work as you would then not get an exit signal automatically > sent to the child when the supervisor dies. > > Which you do want. Or have I misunderstood you? > > > Oh gosh, how embarrasing. I was thinking in terms of Uni-directional > Links (viz A Unified Semantics for Future Erlang, Svensson et al), and > linking child to parent (so as to propagate supervisor exits) but not the > other way around. Of course we can't do that - just ignore this suggestion. > [note: I've been implementing the supervisor API for cloud haskell in my > spare time and got confused between those semantics (viz > http://haskell-distributed.github.io/static/semantics.pdf) and what I do > for a day job in the *real world*]. > > But switching all the supervisor's signal handling to rely on monitor > notifications rather than trapped exits (which might be ignored) sounds > good to me. The use of linking would be there to guarantee supervisor death > is propagated correctly, but we could switch away from handling child > 'EXIT' signals to handling 'DOWN' notifications instead. This would IMO be > a bit cleaner. > > Cheers, > Tim > -------------- next part -------------- An HTML attachment was scrubbed... URL: From watson.timothy@REDACTED Wed May 15 17:11:17 2013 From: watson.timothy@REDACTED (Tim Watson) Date: Wed, 15 May 2013 16:11:17 +0100 Subject: [erlang-bugs] Supervisor terminate_child race In-Reply-To: References: <816241784.113785065.1368613048244.JavaMail.root@erlang-solutions.com> Message-ID: <2302FD2F-B3F4-4514-88B0-17082D781D1A@gmail.com> On 15 May 2013, at 14:54, Siri Hansen wrote: > Then again... it is up to the child's start function to create the link, and from the supervisor's point of view, the only place to add the monitor would be when the start function returns - which would be just another place to get a race :( > Well quite. *sigh* Perhaps what we do in cloud haskell might be instructive after all, though the approach runs counter to the APIs which its OTP forebears use. Our supervisor performs the actual `spawn' itself, so the child spec provides as its startup term (which is roughly equivalent to an MFArgs tuple), not a function which spawns the new process, but rather the code for its process' main loop. The disadvantage here is that the API is more constrained (in terms of what the main loop looks like), but the advantage is the supervisor can insert arbitrary code into the start phase of the child's server loop. Thus our supervisor performs the link (from child to parent) and forces the child process to wait until the monitor is set up correctly, before actually entering its loop. This ensures we don't end up with a race with regards startup, linking and monitor establishment. There relevant bit of the code looks roughly like this: wrapClosure proc spec' = let chId = childKey spec' in do supervisor <- getSelfPid pid <- spawnLocal $ do link supervisor -- die if our parent dies () <- expect -- wait for a start signal proc >>= checkExitType chId -- evaluate the child's loop void $ monitor pid -- synchronous call to establish a monitor send pid () -- tell the child to go into its main loop return $ Right $ ChildRunning pid Of course, because of this design, our gen_server API looks completely different! The start function, for example, doesn't spawn a process, but rather evaluates the `init' callback and enters the gen server's main loop (or crashes) immediately with the return value, leaving the `spawn' part to its clients. The supervisor is, of course, one of these clients: In fact our supervisor, like it's OTP inspiration, is itself a gen_server (we call them managed processes), and thus its start function never returns either: -- | Starts a supervisor. ... start :: RestartStrategy -> [ChildSpec] -> ManagedProcessLoop SupervisorState start strategy' specs' = ManagedProcess.start (strategy', specs') supInit serverDefinition Now obviously, given that Erlang has been used in the real world for decades, we can't go changing gen server's start_link or the supervisor child spec APIs. But is there a way to achieve something similar without carving things up too much? I'm struggling to think of one, but it would good if we could avoid the race altogether. Cheers, Tim > > 2013/5/15 Tim Watson > On 15 May 2013, at 11:17, Robert Virding wrote: > >> Do you mean only using monitors in the supervisor, and no links? If so that would not work as you would then not get an exit signal automatically sent to the child when the supervisor dies. >> Which you do want. Or have I misunderstood you? >> > > Oh gosh, how embarrasing. I was thinking in terms of Uni-directional Links (viz A Unified Semantics for Future Erlang, Svensson et al), and linking child to parent (so as to propagate supervisor exits) but not the other way around. Of course we can't do that - just ignore this suggestion. [note: I've been implementing the supervisor API for cloud haskell in my spare time and got confused between those semantics (viz http://haskell-distributed.github.io/static/semantics.pdf) and what I do for a day job in the *real world*]. > > But switching all the supervisor's signal handling to rely on monitor notifications rather than trapped exits (which might be ignored) sounds good to me. The use of linking would be there to guarantee supervisor death is propagated correctly, but we could switch away from handling child 'EXIT' signals to handling 'DOWN' notifications instead. This would IMO be a bit cleaner. > > Cheers, > Tim > -------------- next part -------------- An HTML attachment was scrubbed... URL: From essen@REDACTED Thu May 16 19:02:57 2013 From: essen@REDACTED (=?ISO-8859-1?Q?Lo=EFc_Hoguin?=) Date: Thu, 16 May 2013 19:02:57 +0200 Subject: [erlang-bugs] Wrong type for ssl key option Message-ID: <51951141.9000302@ninenines.eu> Type ssl_option() says: {key, Der::binary()} Documentation says: {key, {'RSAPrivateKey'| 'DSAPrivateKey' | 'PrivateKeyInfo', der_encoded()}} I believe the documentation is correct and the code wrong. Please confirm. -- Lo?c Hoguin Erlang Cowboy Nine Nines http://ninenines.eu From daimon@REDACTED Fri May 17 07:15:14 2013 From: daimon@REDACTED (Masatake Daimon) Date: Fri, 17 May 2013 14:15:14 +0900 Subject: [erlang-bugs] Fix {stream, {self, once}} in httpc Message-ID: <5195BCE2.9000704@ymir.co.jp> Hello, Previously the only difference between {stream, self} and {stream, {self, once}} was an extra Pid in the stream_start message due to a bug in httpc_handler. It was immediately sending a bunch of messages till the end instead of waiting for httpc:stream_next/1 being called. Before applying this patch: https://gist.github.com/phonohawk/5589337#file-erl-before-log After: https://gist.github.com/phonohawk/5589337#file-erl-after-log git fetch git://github.com/phonohawk/otp.git httpc-stream-once-fix https://github.com/phonohawk/otp/compare/erlang:maint...httpc-stream-once-fix https://github.com/phonohawk/otp/compare/erlang:maint...httpc-stream-once-fix.patch Regards, -- ?? ?? From daimon@REDACTED Fri May 17 07:30:22 2013 From: daimon@REDACTED (Masatake Daimon) Date: Fri, 17 May 2013 14:30:22 +0900 Subject: [erlang-bugs] Fix {stream, {self, once}} in httpc In-Reply-To: <5195BCE2.9000704@ymir.co.jp> References: <5195BCE2.9000704@ymir.co.jp> Message-ID: <5195C06E.5080502@ymir.co.jp> Oops. I meant to send this to erlang-patches. Sorry for the noise. On 05/17/13 14:15, Masatake Daimon wrote: > Hello, > > Previously the only difference between {stream, self} and {stream, > {self, once}} was an extra Pid in the stream_start message due to a > bug in httpc_handler. It was immediately sending a bunch of messages > till the end instead of waiting for httpc:stream_next/1 being called. > > Before applying this patch: > https://gist.github.com/phonohawk/5589337#file-erl-before-log > > After: > https://gist.github.com/phonohawk/5589337#file-erl-after-log > > git fetch git://github.com/phonohawk/otp.git httpc-stream-once-fix > > > https://github.com/phonohawk/otp/compare/erlang:maint...httpc-stream-once-fix > > > https://github.com/phonohawk/otp/compare/erlang:maint...httpc-stream-once-fix.patch > > > Regards, > -- ?? ?? From mjtruog@REDACTED Fri May 17 08:03:24 2013 From: mjtruog@REDACTED (Michael Truog) Date: Thu, 16 May 2013 23:03:24 -0700 Subject: [erlang-bugs] syntax_tools anonymous function error Message-ID: <5195C82C.104@gmail.com> Hi, I had syntax_tools break on this code "fun M:F/2" with this stack trace: in function erl_syntax_lib:analyze_function_name/1 (erl_syntax_lib.erl, line 1500) in call from igor:transform_implicit_fun/3 (igor.erl, line 1807) in call from igor:transform_list/3 (igor.erl, line 1748) in call from igor:transform_1/3 (igor.erl, line 1741) in call from igor:default_transform/3 (igor.erl, line 1733) in call from igor:transform_list/3 (igor.erl, line 1748) in call from igor:transform_1/3 (igor.erl, line 1741) in call from igor:transform_1/3 (igor.erl, line 1742) Thanks, Michael From daimon@REDACTED Fri May 17 10:55:48 2013 From: daimon@REDACTED (Masatake Daimon) Date: Fri, 17 May 2013 17:55:48 +0900 Subject: [erlang-bugs] Compiler crash with 'inline_list_funcs' and "fun Fun/Arity" notation Message-ID: <5195F094.2010003@ymir.co.jp> Hello, Compiling the following module makes the compiler crash. I'm using R16B. ===== test.erl ===== -module(test). -compile(inline). -compile(inline_list_funcs). -export([foo/0]). foo() -> lists:map(fun bar/1, [1]). bar(X) -> X. ===== the crash ==== % erlc test.erl test: function '-foo/0-lists^map/1-0-'/1+15: Internal consistency check failed - please report this bug. Instruction: {move,{x,0},{yy,0}} Error: {invalid_store,{yy,0},term}: Note that the problem disappears with any of these changes: * Commenting out "-compile(inline)." * Commenting out "-compile(inline_list_funcs)." * Changing the definition of foo/0 to: foo() -> lists:map(fun bar/1, []). % [] instead of [1] * Changing the definition of foo/0 to: foo() -> lists:map(fun (A) -> bar(A) end, [1]). Regards, -- ?? ?? From ingela.anderton.andin@REDACTED Fri May 17 15:57:09 2013 From: ingela.anderton.andin@REDACTED (Ingela Anderton Andin) Date: Fri, 17 May 2013 15:57:09 +0200 Subject: [erlang-bugs] Wrong type for ssl key option In-Reply-To: <51951141.9000302@ninenines.eu> References: <51951141.9000302@ninenines.eu> Message-ID: <51963735.9000108@erix.ericsson.se> Hi! Lo?c Hoguin wrote: > Type ssl_option() says: {key, Der::binary()} > > Documentation says: {key, {'RSAPrivateKey'| 'DSAPrivateKey' | > 'PrivateKeyInfo', der_encoded()}} > > I believe the documentation is correct and the code wrong. > > Please confirm. > You are correct the dialyzer spec is incorrect! Regards Ingela Erlang/OTP team - Ericsson AB From mjtruog@REDACTED Fri May 17 18:26:36 2013 From: mjtruog@REDACTED (Michael Truog) Date: Fri, 17 May 2013 09:26:36 -0700 Subject: [erlang-bugs] crash without crash dump Message-ID: <51965A3C.3070205@gmail.com> Hi, I am not sure about the impact of this problem, however, it may have a larger impact. When killing the application_controller process with the -heart option being used, no crash dump is produced: $ erl Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:8:8] [async-threads:10] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> exit(whereis(application_controller), kill). *** ERROR: Shell process terminated! *** {"Kernel pid terminated",application_controller,killed} Crash dump was written to: erl_crash.dump Kernel pid terminated (application_controller) (killed) (erl_crash.dump file exists) $ erl -heart heart_beat_kill_pid = 24300 Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:8:8] [async-threads:10] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> exit(whereis(application_controller), kill). *** ERROR: Shell process terminated! *** {"Kernel pid terminated",application_controller,killed} heart: Fri May 17 09:20:10 2013: Erlang is crashing .. (waiting for crash dump file) heart: Fri May 17 09:20:10 2013: Would reboot. Terminating. Kernel pid terminated (application_controller) (killed) (erl_crash.dump file does not exist!) Thanks, Michael From n.oxyde@REDACTED Fri May 17 20:06:35 2013 From: n.oxyde@REDACTED (Anthony Ramine) Date: Fri, 17 May 2013 20:06:35 +0200 Subject: [erlang-bugs] Compiler crash with 'inline_list_funcs' and "fun Fun/Arity" notation In-Reply-To: <5195F094.2010003@ymir.co.jp> References: <5195F094.2010003@ymir.co.jp> Message-ID: Hello, Shorter test case, showing the problem comes from the inline itself and not inline_list_funcs: -module(test). -compile(inline). -export([foo/0]). foo() -> F = fun bar/1, fun (X) when X =:= F -> X end. bar(X) -> X. If you run the core_lint pass, you can see where the problem comes from: $ erlc +clint test.erl test: illegal guard expression in foo/0 The inliner inlines `when 'erlang':'=:='(X, F)` to `'erlang':'=:='(X, 'bar'/1)` but local fun references can't appear in guards. I'll try to make a patch. Regards, -- Anthony Ramine Le 17 mai 2013 ? 10:55, Masatake Daimon a ?crit : > Hello, > > Compiling the following module makes the compiler crash. I'm using > R16B. > > ===== test.erl ===== > -module(test). > -compile(inline). > -compile(inline_list_funcs). > -export([foo/0]). > > foo() -> > lists:map(fun bar/1, [1]). > > bar(X) -> X. > > ===== the crash ==== > % erlc test.erl > test: function '-foo/0-lists^map/1-0-'/1+15: > Internal consistency check failed - please report this bug. > Instruction: {move,{x,0},{yy,0}} > Error: {invalid_store,{yy,0},term}: > > > Note that the problem disappears with any of these changes: > > * Commenting out "-compile(inline)." > * Commenting out "-compile(inline_list_funcs)." > * Changing the definition of foo/0 to: > foo() -> > lists:map(fun bar/1, []). % [] instead of [1] > * Changing the definition of foo/0 to: > foo() -> > lists:map(fun (A) -> bar(A) end, [1]). > > Regards, > -- > ?? ?? > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://erlang.org/mailman/listinfo/erlang-bugs From mjtruog@REDACTED Sat May 18 04:42:01 2013 From: mjtruog@REDACTED (Michael Truog) Date: Fri, 17 May 2013 19:42:01 -0700 Subject: [erlang-bugs] igor -callback external type bug Message-ID: <5196EA79.1000407@gmail.com> Hi, When using igor (to rename modules) it generates invalid syntax when it finds -callback() types which have been exported from external modules. igor may just not be changing the module names to make the types valid, but somehow no igor error occurs and you will only see the error when attempting to compile the module. Thanks, Michael From mjtruog@REDACTED Sat May 18 05:50:07 2013 From: mjtruog@REDACTED (Michael Truog) Date: Fri, 17 May 2013 20:50:07 -0700 Subject: [erlang-bugs] igor reorders types to create errors Message-ID: <5196FA6F.8070107@gmail.com> Hi, If a type is declared in the same file as a record and the type depends on a record being defined the resulting file will fail to compile due to the record not being defined, simply because the type is automatically put at the top of the file (by igor), above the record definition. This problem may relate to the preprocessor, since I am surprised the order is significant. I understand the various igor bugs might be an annoyance, since the module name may simply indicate that the module itself is only meant to annoy and that it may never actually do anything properly for the mad scientist. However, I am still hopeful that it (or something like it) might provide error-less module transformations, despite module names within the Erlang code (like child specs). So, it would be nice if it wasn't simply discarded due to its problems. Thanks, Michael From n.oxyde@REDACTED Sat May 18 18:22:03 2013 From: n.oxyde@REDACTED (Anthony Ramine) Date: Sat, 18 May 2013 18:22:03 +0200 Subject: [erlang-bugs] Compiler crash with 'inline_list_funcs' and "fun Fun/Arity" notation In-Reply-To: References: <5195F094.2010003@ymir.co.jp> Message-ID: <21AE91E1-9EF1-41B8-913E-AC0C959AC3F7@gmail.com> Hello, This patch fixes the bug by forbidding inlining of variables which values are local fun references outside of application contexts. git fetch https://github.com/nox/otp.git fix-fname-inlining https://github.com/nox/otp/compare/erlang:maint...fix-fname-inlining https://github.com/nox/otp/compare/erlang:maint...fix-fname-inlining.patch Regards, -- Anthony Ramine Le 17 mai 2013 ? 20:06, Anthony Ramine a ?crit : > Hello, > > Shorter test case, showing the problem comes from the inline itself and not inline_list_funcs: > > -module(test). > -compile(inline). > -export([foo/0]). > > foo() -> > F = fun bar/1, > fun (X) when X =:= F -> X end. > > bar(X) -> X. > > If you run the core_lint pass, you can see where the problem comes from: > > $ erlc +clint test.erl > test: illegal guard expression in foo/0 > > The inliner inlines `when 'erlang':'=:='(X, F)` to `'erlang':'=:='(X, 'bar'/1)` but local fun references can't appear in guards. > > I'll try to make a patch. > > Regards, > > -- > Anthony Ramine > > Le 17 mai 2013 ? 10:55, Masatake Daimon a ?crit : > >> Hello, >> >> Compiling the following module makes the compiler crash. I'm using >> R16B. >> >> ===== test.erl ===== >> -module(test). >> -compile(inline). >> -compile(inline_list_funcs). >> -export([foo/0]). >> >> foo() -> >> lists:map(fun bar/1, [1]). >> >> bar(X) -> X. >> >> ===== the crash ==== >> % erlc test.erl >> test: function '-foo/0-lists^map/1-0-'/1+15: >> Internal consistency check failed - please report this bug. >> Instruction: {move,{x,0},{yy,0}} >> Error: {invalid_store,{yy,0},term}: >> >> >> Note that the problem disappears with any of these changes: >> >> * Commenting out "-compile(inline)." >> * Commenting out "-compile(inline_list_funcs)." >> * Changing the definition of foo/0 to: >> foo() -> >> lists:map(fun bar/1, []). % [] instead of [1] >> * Changing the definition of foo/0 to: >> foo() -> >> lists:map(fun (A) -> bar(A) end, [1]). >> >> Regards, >> -- >> ?? ?? >> _______________________________________________ >> erlang-bugs mailing list >> erlang-bugs@REDACTED >> http://erlang.org/mailman/listinfo/erlang-bugs > From carlsson.richard@REDACTED Sat May 18 22:14:46 2013 From: carlsson.richard@REDACTED (Richard Carlsson) Date: Sat, 18 May 2013 22:14:46 +0200 Subject: [erlang-bugs] Compiler crash with 'inline_list_funcs' and "fun Fun/Arity" notation In-Reply-To: <21AE91E1-9EF1-41B8-913E-AC0C959AC3F7@gmail.com> References: <5195F094.2010003@ymir.co.jp> <21AE91E1-9EF1-41B8-913E-AC0C959AC3F7@gmail.com> Message-ID: <5197E136.208@gmail.com> On 2013-05-18 18:22 , Anthony Ramine wrote: > Hello, > > This patch fixes the bug by forbidding inlining of variables which values are local fun references outside of application contexts. > > git fetch https://github.com/nox/otp.git fix-fname-inlining > > https://github.com/nox/otp/compare/erlang:maint...fix-fname-inlining > https://github.com/nox/otp/compare/erlang:maint...fix-fname-inlining.patch > > Regards, > Looks reasonable to me, but it's ages since I worked on that code. /Richard From mjtruog@REDACTED Sun May 19 01:25:06 2013 From: mjtruog@REDACTED (Michael Truog) Date: Sat, 18 May 2013 16:25:06 -0700 Subject: [erlang-bugs] escript file operations fail on halt Message-ID: <51980DD2.7060501@gmail.com> Hi, There is an odd type of failure when: 1) async threads are enabled by default for the Erlang VM 2) an escript is used to spawn the Erlang VM 3) erlang:halt/1 is used to terminate the escript with a known error code The erlang:halt/1 and erlang:halt/2 code here: https://github.com/erlang/otp/blob/maint/erts/emulator/beam/bif.c#L3937 Makes the default flush parameter false! The default flush parameter is currently undocumented. So, when an escript performs a file operation that depends on the async thread pool (based on the internal Erlang code and configuration) and then attempts to do erlang:halt(integer()), the file operations may not complete or perhaps only partially complete. In my particular use case, I can observe a rename file operation getting stuck inbetween the actual completion of the rename (and I am not using anything but a normal/default Linux filesystem, not NFS). It seems important to change the default erlang:halt/1 behaviour for escript usage so that flush is true (I understand fail-fast probably means normal Erlang VM usage shouldn't have flush default to true). An alternative is a new escript function that sets the flush option for the user (which is probably an easier solution to agree on) (e.g., escript:exit/1). Thanks, Michael From n.oxyde@REDACTED Sun May 19 12:33:12 2013 From: n.oxyde@REDACTED (Anthony Ramine) Date: Sun, 19 May 2013 12:33:12 +0200 Subject: [erlang-bugs] syntax_tools anonymous function error In-Reply-To: <5195C82C.104@gmail.com> References: <5195C82C.104@gmail.com> Message-ID: Hello Michael, This patch fixes support of implicit funs with variables in igor. git fetch https://github.com/nox/otp.git igor-funs https://github.com/nox/otp/compare/erlang:maint...igor-funs https://github.com/nox/otp/compare/erlang:maint...igor-funs.patch Regards, -- Anthony Ramine Le 17 mai 2013 ? 08:03, Michael Truog a ?crit : > Hi, > > I had syntax_tools break on this code "fun M:F/2" with this stack trace: > in function erl_syntax_lib:analyze_function_name/1 (erl_syntax_lib.erl, line 1500) > in call from igor:transform_implicit_fun/3 (igor.erl, line 1807) > in call from igor:transform_list/3 (igor.erl, line 1748) > in call from igor:transform_1/3 (igor.erl, line 1741) > in call from igor:default_transform/3 (igor.erl, line 1733) > in call from igor:transform_list/3 (igor.erl, line 1748) > in call from igor:transform_1/3 (igor.erl, line 1741) > in call from igor:transform_1/3 (igor.erl, line 1742) > > Thanks, > Michael > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://erlang.org/mailman/listinfo/erlang-bugs From fredrik@REDACTED Mon May 20 09:55:43 2013 From: fredrik@REDACTED (Fredrik) Date: Mon, 20 May 2013 09:55:43 +0200 Subject: [erlang-bugs] [erlang-patches] Compiler crash with 'inline_list_funcs' and "fun Fun/Arity" notation In-Reply-To: <21AE91E1-9EF1-41B8-913E-AC0C959AC3F7@gmail.com> References: <5195F094.2010003@ymir.co.jp> <21AE91E1-9EF1-41B8-913E-AC0C959AC3F7@gmail.com> Message-ID: <5199D6FF.1010302@erlang.org> On 05/18/2013 06:22 PM, Anthony Ramine wrote: > Hello, > > This patch fixes the bug by forbidding inlining of variables which values are local fun references outside of application contexts. > > git fetch https://github.com/nox/otp.git fix-fname-inlining > > https://github.com/nox/otp/compare/erlang:maint...fix-fname-inlining > https://github.com/nox/otp/compare/erlang:maint...fix-fname-inlining.patch > > Regards, > Hello Anthony, I've fetched your branch and it should be visible in the 'pu' branch shortly. I also assigned it to the responsible team for review. Thanks, -- BR Fredrik Gustafsson Erlang OTP Team From lukas@REDACTED Mon May 20 10:11:52 2013 From: lukas@REDACTED (Lukas Larsson) Date: Mon, 20 May 2013 10:11:52 +0200 Subject: [erlang-bugs] crash without crash dump In-Reply-To: <51965A3C.3070205@gmail.com> References: <51965A3C.3070205@gmail.com> Message-ID: <5199DAC8.8090403@erlang.org> Hello Michael, Have you set ERL_CRASH_DUMP_SECONDS[1] to an appropriate value? Lukas [1]: http://www.erlang.org/doc/man/heart.html On 17/05/13 18:26, Michael Truog wrote: > Hi, > > I am not sure about the impact of this problem, however, it may have a larger impact. When killing the application_controller process with the -heart option being used, no crash dump is produced: > > $ erl > Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:8:8] [async-threads:10] [kernel-poll:false] > > Eshell V5.10.1 (abort with ^G) > 1> exit(whereis(application_controller), kill). > *** ERROR: Shell process terminated! *** > {"Kernel pid terminated",application_controller,killed} > > Crash dump was written to: erl_crash.dump > Kernel pid terminated (application_controller) (killed) > > (erl_crash.dump file exists) > > $ erl -heart > heart_beat_kill_pid = 24300 > Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:8:8] [async-threads:10] [kernel-poll:false] > > Eshell V5.10.1 (abort with ^G) > 1> exit(whereis(application_controller), kill). > *** ERROR: Shell process terminated! *** > {"Kernel pid terminated",application_controller,killed} > heart: Fri May 17 09:20:10 2013: Erlang is crashing .. (waiting for crash dump file) > heart: Fri May 17 09:20:10 2013: Would reboot. Terminating. > Kernel pid terminated (application_controller) (killed) > > (erl_crash.dump file does not exist!) > > Thanks, > Michael > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://erlang.org/mailman/listinfo/erlang-bugs > From fredrik@REDACTED Mon May 20 10:12:31 2013 From: fredrik@REDACTED (Fredrik) Date: Mon, 20 May 2013 10:12:31 +0200 Subject: [erlang-bugs] [erlang-patches] syntax_tools anonymous function error In-Reply-To: References: <5195C82C.104@gmail.com> Message-ID: <5199DAEF.2080603@erlang.org> On 05/19/2013 12:33 PM, Anthony Ramine wrote: > Hello Michael, > > This patch fixes support of implicit funs with variables in igor. > > git fetch https://github.com/nox/otp.git igor-funs > > https://github.com/nox/otp/compare/erlang:maint...igor-funs > https://github.com/nox/otp/compare/erlang:maint...igor-funs.patch > > Regards, > Hello Anthony, I've fetched your patch and it should be visible in the 'pu' branch shortly. Thanks, -- BR Fredrik Gustafsson Erlang OTP Team From Aleksander.Nycz@REDACTED Mon May 20 10:41:46 2013 From: Aleksander.Nycz@REDACTED (Aleksander Nycz) Date: Mon, 20 May 2013 10:41:46 +0200 Subject: [erlang-bugs] Problem with tw timer support in diameter app (otp_R16B) Message-ID: <5199E1CA.6020209@comarch.pl> Hello, I change default value for param *restrict_connections *from 'nodes' to 'false'. After that I run very simple test using seagull symulator. Test scenario was following: 1. seagull: send CER 2. seagull: recv CEA 3. seagull: send CCR (init) 4. seagull: recv CCA (init) 5. seagull: send CCR (update) 6. seagull: recv CCR (update) 7. seagull: send CCR (terminate) 8. seagull: recv CCA (terminate) After step 8. seagull does't send DPR, but just closes transport connection (TCP) On server side every think looks good, but 30 sec. after CCR (terminate) when tw elapsed, following error message appears in log: 13:40:58.187129: <0.5046.0>: error: error_logger: --:--/--: ** Generic server <0.5046.0> terminating ** Last message in was {timeout,#Ref<0.0.0.14845>,tw} ** When Server state == {watchdog,down,false,30000,0,<0.1009.0>,undefined, #Ref<0.0.0.14845>,diameter_gen_base_rfc3588, {recvdata,4259932,diameterNode, [{diameter_app,diameterNode,dictionaryDCCA, [dccaCallback], diameterNode,4,false, [{answer_errors,report}, {request_errors,answer_3xxx}]}], {0,32}}, {0,32}, {false,false}, false} ** Reason for termination == ** {function_clause, [{diameter_watchdog,set_watchdog, [stop], [{file,"base/diameter_watchdog.erl"},{line,451}]}, {diameter_watchdog,handle_info,2, [{file,"base/diameter_watchdog.erl"},{line,211}]}, {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]}, {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]} 13:40:58.187500: <0.5046.0>: error: error_logger: --:--/--: [crash_report][[[{initial_call,{diameter_watchdog,init,['Argument__1']}}, {pid,<0.5046.0>}, {registered_name,[]}, {error_info,{exit,{function_clause,[{diameter_watchdog,set_watchdog,[stop],[{file,"base/diameter_watchdog.erl"},{line,451}]}, {diameter_watchdog,handle_info,2,[{file,"base/diameter_watchdog.erl"},{line,211}]}, {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]}, {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}, [{gen_server,terminate,6,[{file,"gen_server.erl"},{line,737}]}, {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}, {ancestors,[diameter_watchdog_sup,diameter_sup,<0.946.0>]}, {messages,[]}, {links,[<0.954.0>]}, {dictionary,[{random_seed,{15047,18051,14647}}, {{diameter_watchdog,restart}, {{accept,#Ref<0.0.0.1696>}, [{transport_module,diameter_tcp}, {transport_config,[{reuseaddr,true},{ip,{0,0,0,0}},{port,4068}]}, {capabilities_cb,[#Fun]}, {watchdog_timer,30000}, {reconnect_timer,60000}], {diameter_service,<0.1009.0>, {diameter_caps,"zyndram.krakow.comarch","krakow.comarch",[],25429,"Comarch DIAMETER Server",[], [12645,10415,8164], [4], [],[],[],[],[]}, [{diameter_app,diameterNode,dictionaryDCCA, [dccaCallback], diameterNode,4,false, [{answer_errors,report},{request_errors,answer_3xxx}]}]}}}, {{diameter_watchdog,dwr}, ['DWR',{'Origin-Host',"zyndram.krakow.comarch"},{'Origin-Realm',"krakow.comarch"},{'Origin-State-Id',[]}]}]}, {trap_exit,false}, {status,running}, {heap_size,75025}, {stack_size,24}, {reductions,294}], []]] 13:40:58.189060: <0.954.0>: error: error_logger: --:--/--: [supervisor_report][[{supervisor,{local,diameter_watchdog_sup}}, {errorContext,child_terminated}, {reason,{function_clause,[{diameter_watchdog,set_watchdog,[stop],[{file,"base/diameter_watchdog.erl"},{line,451}]}, {diameter_watchdog,handle_info,2,[{file,"base/diameter_watchdog.erl"},{line,211}]}, {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]}, {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}, {offender,[{pid,<0.5046.0>}, {name,diameter_watchdog}, {mfargs,{diameter_watchdog,start_link,undefined}}, {restart_type,temporary}, {shutdown,1000}, {child_type,worker}]}]] You can check, that function set_watchdog should be called with param #watchdog{}, but 'stop' param is used instead. As a result function_clause exception is thrown. I suggest following change in code to correct this problem (file diameter_watchdog.erl): $ diff diameter_watchdog.erl_org diameter_watchdog.erl 385a386,393 > transition({timeout, TRef, tw}, #watchdog{tref = TRef, status = T} = S) > when T == initial; > T == down -> > case restart(S) of > stop -> stop; > #watchdog{} = NewS -> set_watchdog(NewS) > end; > You can find this solution in attachement. Best regards Aleksander Nycz -- Aleksander Nycz Senior Software Engineer Telco_021 BSS R&D Comarch SA Phone: +48 12 646 1216 Mobile: +48 691 464 275 website: www.comarch.pl -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- %% %% %CopyrightBegin% %% %% Copyright Ericsson AB 2010-2013. All Rights Reserved. %% %% The contents of this file are subject to the Erlang Public License, %% Version 1.1, (the "License"); you may not use this file except in %% compliance with the License. You should have received a copy of the %% Erlang Public License along with this software. If not, it can be %% retrieved online at http://www.erlang.org/. %% %% Software distributed under the License is distributed on an "AS IS" %% basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See %% the License for the specific language governing rights and limitations %% under the License. %% %% %CopyrightEnd% %% %% %% This module implements (as a process) the state machine documented %% in Appendix A of RFC 3539. %% -module(diameter_watchdog). -behaviour(gen_server). %% towards diameter_service -export([start/2]). %% gen_server callbacks -export([init/1, handle_call/3, handle_cast/2, handle_info/2, terminate/2, code_change/3]). %% diameter_watchdog_sup callback -export([start_link/1]). -include_lib("diameter/include/diameter.hrl"). -include("diameter_internal.hrl"). -define(DEFAULT_TW_INIT, 30000). %% RFC 3539 ch 3.4.1 -define(NOMASK, {0,32}). %% default sequence mask -define(BASE, ?DIAMETER_DICT_COMMON). -record(watchdog, {%% PCB - Peer Control Block; see RFC 3539, Appendix A status = initial :: initial | okay | suspect | down | reopen, pending = false :: boolean(), %% DWA tw :: 6000..16#FFFFFFFF | {module(), atom(), list()}, %% {M,F,A} -> integer() >= 0 num_dwa = 0 :: -1 | non_neg_integer(), %% number of DWAs received during reopen %% end PCB parent = self() :: pid(), %% service process transport :: pid() | undefined, %% peer_fsm process tref :: reference(), %% reference for current watchdog timer dictionary :: module(), %% common dictionary receive_data :: term(), %% term passed into diameter_service with incoming message sequence :: diameter:sequence(), %% mask restrict :: {diameter:restriction(), boolean()}, shutdown = false :: boolean()}). %% --------------------------------------------------------------------------- %% start/2 %% %% Start a monitor before the watchdog is allowed to proceed to ensure %% that a failed capabilities exchange produces the desired exit %% reason. %% --------------------------------------------------------------------------- -spec start(Type, {RecvData, [Opt], SvcOpts, #diameter_service{}}) -> {reference(), pid()} when Type :: {connect|accept, diameter:transport_ref()}, RecvData :: term(), Opt :: diameter:transport_opt(), SvcOpts :: [diameter:service_opt()]. start({_,_} = Type, T) -> Ack = make_ref(), {ok, Pid} = diameter_watchdog_sup:start_child({Ack, Type, self(), T}), try {erlang:monitor(process, Pid), Pid} after send(Pid, Ack) end. start_link(T) -> {ok, _} = proc_lib:start_link(?MODULE, init, [T], infinity, diameter_lib:spawn_opts(server, [])). %% =========================================================================== %% =========================================================================== %% init/1 init(T) -> proc_lib:init_ack({ok, self()}), gen_server:enter_loop(?MODULE, [], i(T)). i({Ack, T, Pid, {RecvData, Opts, SvcOpts, #diameter_service{applications = Apps, capabilities = Caps} = Svc}}) -> erlang:monitor(process, Pid), wait(Ack, Pid), random:seed(now()), putr(restart, {T, Opts, Svc}), %% save seeing it in trace putr(dwr, dwr(Caps)), %% {_,_} = Mask = proplists:get_value(sequence, SvcOpts), Restrict = proplists:get_value(restrict_connections, SvcOpts), Nodes = restrict_nodes(Restrict), Dict0 = common_dictionary(Apps), #watchdog{parent = Pid, transport = start(T, Opts, Mask, Nodes, Dict0, Svc), tw = proplists:get_value(watchdog_timer, Opts, ?DEFAULT_TW_INIT), receive_data = RecvData, dictionary = Dict0, sequence = Mask, restrict = {Restrict, lists:member(node(), Nodes)}}. wait(Ref, Pid) -> receive Ref -> ok; {'DOWN', _, process, Pid, _} = D -> exit({shutdown, D}) end. %% start/5 start(T, Opts, Mask, Nodes, Dict0, Svc) -> {_MRef, Pid} = diameter_peer_fsm:start(T, Opts, {Mask, Nodes, Dict0, Svc}), Pid. %% common_dictionary/1 %% %% Determine the dictionary of the Diameter common application with %% Application Id 0. Fail on config errors. common_dictionary(Apps) -> case orddict:fold(fun dict0/3, false, lists:foldl(fun(#diameter_app{dictionary = M}, D) -> orddict:append(M:id(), M, D) end, orddict:new(), Apps)) of {value, Mod} -> Mod; false -> %% A transport should configure a common dictionary but %% don't require it. Not configuring a common dictionary %% means a user won't be able either send of receive %% messages in the common dictionary: incoming request %% will be answered with 3007 and outgoing requests cannot %% be sent. The dictionary returned here is oly used for %% messages diameter sends and receives: CER/CEA, DPR/DPA %% and DWR/DWA. ?BASE end. %% Each application should be represented by a single dictionary. dict0(Id, [_,_|_] = Ms, _) -> config_error({multiple_dictionaries, Ms, {application_id, Id}}); %% An explicit common dictionary. dict0(?APP_ID_COMMON, [Mod], _) -> {value, Mod}; %% A pure relay, in which case the common application is implicit. %% This uses the fact that the common application will already have %% been folded. dict0(?APP_ID_RELAY, _, false) -> {value, ?BASE}; dict0(_, _, Acc) -> Acc. config_error(T) -> ?ERROR({configuration_error, T}). %% handle_call/3 handle_call(_, _, State) -> {reply, nok, State}. %% handle_cast/2 handle_cast(_, State) -> {noreply, State}. %% handle_info/2 handle_info(T, #watchdog{} = State) -> case transition(T, State) of ok -> {noreply, State}; #watchdog{} = S -> close(T, State), %% service expects 'close' message event(T, State, S), %% before 'watchdog' {noreply, S}; stop -> ?LOG(stop, T), event(T, State, State#watchdog{status = down}), {stop, {shutdown, T}, State} end. close({'DOWN', _, process, TPid, {shutdown, Reason}}, #watchdog{transport = TPid, parent = Pid}) -> send(Pid, {close, self(), Reason}); close(_, _) -> ok. event(_, #watchdog{status = T}, #watchdog{status = T}) -> ok; event(_, #watchdog{transport = undefined}, #watchdog{transport = undefined}) -> ok; event(Msg, #watchdog{status = From, transport = F, parent = Pid}, #watchdog{status = To, transport = T}) -> TPid = tpid(F,T), E = {[TPid | data(Msg, TPid, From, To)], From, To}, send(Pid, {watchdog, self(), E}), ?LOG(transition, {self(), E}). data(Msg, TPid, reopen, okay) -> {recv, TPid, 'DWA', _Pkt} = Msg, %% assert {TPid, T} = eraser(open), [T]; data({open, TPid, _Hosts, T}, TPid, _From, To) when To == okay; To == reopen -> [T]; data(_, _, _, _) -> []. tpid(_, Pid) when is_pid(Pid) -> Pid; tpid(Pid, _) -> Pid. send(Pid, T) -> Pid ! T. %% terminate/2 terminate(_, _) -> ok. %% code_change/3 code_change(_, State, _) -> {ok, State}. %% =========================================================================== %% =========================================================================== %% transition/2 %% %% The state transitions documented here are extracted from RFC 3539, %% the commentary is ours. %% Service or watchdog is telling the watchdog of an accepting %% transport to die after reconnect_timer expiry or reestablished %% connection (in another transport process) respectively. transition(close, #watchdog{status = down}) -> {{accept, _}, _, _} = getr(restart), %% assert stop; transition(close, #watchdog{}) -> ok; %% Service is asking for the peer to be taken down gracefully. transition({shutdown, Pid, _}, #watchdog{parent = Pid, transport = undefined}) -> stop; transition({shutdown = T, Pid, Reason}, #watchdog{parent = Pid, transport = TPid} = S) -> send(TPid, {T, self(), Reason}), S#watchdog{shutdown = true}; %% Parent process has died, transition({'DOWN', _, process, Pid, _Reason}, #watchdog{parent = Pid}) -> stop; %% Transport has accepted a connection. transition({accepted = T, TPid}, #watchdog{transport = TPid, parent = Pid}) -> send(Pid, {T, self(), TPid}), ok; %% STATE Event Actions New State %% ===== ------ ------- ---------- %% INITIAL Connection up SetWatchdog() OKAY %% By construction, the watchdog timer isn't set until we move into %% state okay as the result of the Peer State Machine reaching the %% Open state. %% %% If we're accepting then we may be resuming a connection that went %% down in another watchdog process, in which case this is the %% transition below, from down to reopen. That is, it's not until we %% know the identity of the peer (ie. now) that we know that we're in %% state down rather than initial. transition({open, TPid, Hosts, _} = Open, #watchdog{transport = TPid, status = initial, restrict = {_, R}} = S) -> case okay(getr(restart), Hosts, R) of okay -> set_watchdog(S#watchdog{status = okay}); reopen -> transition(Open, S#watchdog{status = down}) end; %% DOWN Connection up NumDWA = 0 %% SendWatchdog() %% SetWatchdog() %% Pending = TRUE REOPEN transition({open = Key, TPid, _Hosts, T}, #watchdog{transport = TPid, status = down} = S) -> %% Store the info we need to notify the parent to reopen the %% connection after the requisite DWA's are received, at which %% time we eraser(open). The reopen message is a later addition, %% to communicate the new capabilities as soon as they're known. putr(Key, {TPid, T}), set_watchdog(send_watchdog(S#watchdog{status = reopen, num_dwa = 0})); %% OKAY Connection down CloseConnection() %% Failover() %% SetWatchdog() DOWN %% SUSPECT Connection down CloseConnection() %% SetWatchdog() DOWN %% REOPEN Connection down CloseConnection() %% SetWatchdog() DOWN transition({'DOWN', _, process, TPid, _Reason}, #watchdog{transport = TPid, shutdown = true}) -> stop; transition({'DOWN', _, process, TPid, _Reason}, #watchdog{transport = TPid, status = T} = S) -> set_watchdog(S#watchdog{status = case T of initial -> T; _ -> down end, pending = false, transport = undefined}); %% Incoming message. transition({recv, TPid, Name, Pkt}, #watchdog{transport = TPid} = S) -> recv(Name, Pkt, S); %% Current watchdog has timed out. transition({timeout, TRef, tw}, #watchdog{tref = TRef, status = T} = S) when T == initial; T == down -> case restart(S) of stop -> stop; #watchdog{} = NewS -> set_watchdog(NewS) end; transition({timeout, TRef, tw}, #watchdog{tref = TRef} = S) -> set_watchdog(timeout(S)); %% Timer was canceled after message was already sent. transition({timeout, _, tw}, #watchdog{}) -> ok; %% State query. transition({state, Pid}, #watchdog{status = S}) -> send(Pid, {self(), S}), ok. %% =========================================================================== putr(Key, Val) -> put({?MODULE, Key}, Val). getr(Key) -> get({?MODULE, Key}). eraser(Key) -> erase({?MODULE, Key}). %% encode/3 encode(Msg, Mask, Dict) -> Seq = diameter_session:sequence(Mask), Hdr = #diameter_header{version = ?DIAMETER_VERSION, end_to_end_id = Seq, hop_by_hop_id = Seq}, Pkt = #diameter_packet{header = Hdr, msg = Msg}, #diameter_packet{bin = Bin} = diameter_codec:encode(Dict, Pkt), Bin. %% okay/3 okay({{accept, Ref}, _, _}, Hosts, Restrict) -> T = {?MODULE, connection, Ref, Hosts}, diameter_reg:add(T), if Restrict -> okay(diameter_reg:match(T)); true -> okay end; %% Register before matching so that at least one of two registering %% processes will match the other. okay({{connect, _}, _, _}, _, _) -> okay. %% okay/2 %% The peer hasn't been connected recently ... okay([{_,P}]) -> P = self(), %% assert okay; %% ... or it has. okay(C) -> [_|_] = [send(P, close) || {_,P} <- C, self() /= P], reopen. %% set_watchdog/1 set_watchdog(#watchdog{tw = TwInit, tref = TRef} = S) -> cancel(TRef), S#watchdog{tref = erlang:start_timer(tw(TwInit), self(), tw)}. cancel(undefined) -> ok; cancel(TRef) -> erlang:cancel_timer(TRef). tw(T) when is_integer(T), T >= 6000 -> T - 2000 + (random:uniform(4001) - 1); %% RFC3539 jitter of +/- 2 sec. tw({M,F,A}) -> apply(M,F,A). %% send_watchdog/1 send_watchdog(#watchdog{pending = false, transport = TPid, dictionary = Dict0, sequence = Mask} = S) -> send(TPid, {send, encode(getr(dwr), Mask, Dict0)}), ?LOG(send, 'DWR'), S#watchdog{pending = true}. %% recv/3 recv(Name, Pkt, S) -> try rcv(Name, S) of #watchdog{} = NS -> rcv(Name, Pkt, S), NS catch {?MODULE, throwaway, #watchdog{} = NS} -> NS end. %% rcv/3 rcv(N, _, _) when N == 'CER'; N == 'CEA'; N == 'DWR'; N == 'DWA'; N == 'DPR'; N == 'DPA' -> false; rcv(_, Pkt, #watchdog{transport = TPid, dictionary = Dict0, receive_data = T}) -> diameter_traffic:receive_message(TPid, Pkt, Dict0, T). throwaway(S) -> throw({?MODULE, throwaway, S}). %% rcv/2 %% %% The lack of Hop-by-Hop and End-to-End Identifiers checks in a %% received DWA is intentional. The purpose of the message is to %% demonstrate life but a peer that consistently bungles it by sending %% the wrong identifiers causes the connection to toggle between OPEN %% and SUSPECT, with failover and failback as result, despite there %% being no real problem with connectivity. Thus, relax and accept any %% incoming DWA as being in response to an outgoing DWR. %% INITIAL Receive DWA Pending = FALSE %% Throwaway() INITIAL %% INITIAL Receive non-DWA Throwaway() INITIAL rcv('DWA', #watchdog{status = initial} = S) -> throwaway(S#watchdog{pending = false}); rcv(_, #watchdog{status = initial} = S) -> throwaway(S); %% DOWN Receive DWA Pending = FALSE %% Throwaway() DOWN %% DOWN Receive non-DWA Throwaway() DOWN rcv('DWA', #watchdog{status = down} = S) -> throwaway(S#watchdog{pending = false}); rcv(_, #watchdog{status = down} = S) -> throwaway(S); %% OKAY Receive DWA Pending = FALSE %% SetWatchdog() OKAY %% OKAY Receive non-DWA SetWatchdog() OKAY rcv('DWA', #watchdog{status = okay} = S) -> set_watchdog(S#watchdog{pending = false}); rcv(_, #watchdog{status = okay} = S) -> set_watchdog(S); %% SUSPECT Receive DWA Pending = FALSE %% Failback() %% SetWatchdog() OKAY %% SUSPECT Receive non-DWA Failback() %% SetWatchdog() OKAY rcv('DWA', #watchdog{status = suspect} = S) -> set_watchdog(S#watchdog{status = okay, pending = false}); rcv(_, #watchdog{status = suspect} = S) -> set_watchdog(S#watchdog{status = okay}); %% REOPEN Receive DWA & Pending = FALSE %% NumDWA == 2 NumDWA++ %% Failback() OKAY rcv('DWA', #watchdog{status = reopen, num_dwa = 2 = N} = S) -> S#watchdog{status = okay, num_dwa = N+1, pending = false}; %% REOPEN Receive DWA & Pending = FALSE %% NumDWA < 2 NumDWA++ REOPEN rcv('DWA', #watchdog{status = reopen, num_dwa = N} = S) -> S#watchdog{num_dwa = N+1, pending = false}; %% REOPEN Receive non-DWA Throwaway() REOPEN rcv(_, #watchdog{status = reopen} = S) -> throwaway(S). %% timeout/1 %% %% The caller sets the watchdog on the return value. %% OKAY Timer expires & SendWatchdog() %% !Pending SetWatchdog() %% Pending = TRUE OKAY %% REOPEN Timer expires & SendWatchdog() %% !Pending SetWatchdog() %% Pending = TRUE REOPEN timeout(#watchdog{status = T, pending = false} = S) when T == okay; T == reopen -> send_watchdog(S); %% OKAY Timer expires & Failover() %% Pending SetWatchdog() SUSPECT timeout(#watchdog{status = okay, pending = true} = S) -> S#watchdog{status = suspect}; %% SUSPECT Timer expires CloseConnection() %% SetWatchdog() DOWN %% REOPEN Timer expires & CloseConnection() %% Pending & SetWatchdog() %% NumDWA < 0 DOWN timeout(#watchdog{status = T, pending = P, num_dwa = N, transport = TPid} = S) when T == suspect; T == reopen, P, N < 0 -> exit(TPid, {shutdown, watchdog_timeout}), S#watchdog{status = down}; %% REOPEN Timer expires & NumDWA = -1 %% Pending & SetWatchdog() %% NumDWA >= 0 REOPEN timeout(#watchdog{status = reopen, pending = true, num_dwa = N} = S) when 0 =< N -> S#watchdog{num_dwa = -1}; %% DOWN Timer expires AttemptOpen() %% SetWatchdog() DOWN %% INITIAL Timer expires AttemptOpen() %% SetWatchdog() INITIAL %% RFC 3539, 3.4.1: %% %% [5] While the connection is in the closed state, the AAA client MUST %% NOT attempt to send further watchdog messages on the connection. %% However, after the connection is closed, the AAA client continues %% to periodically attempt to reopen the connection. %% %% The AAA client SHOULD wait for the transport layer to report %% connection failure before attempting again, but MAY choose to %% bound this wait time by the watchdog interval, Tw. %% Don't bound, restarting the peer process only when the previous %% process has died. We only need to handle state down since we start %% the first watchdog when transitioning out of initial. timeout(#watchdog{status = T} = S) when T == initial; T == down -> restart(S). %% restart/1 restart(#watchdog{transport = undefined} = S) -> restart(getr(restart), S); restart(S) -> S. %% restart/2 %% %% Only restart the transport in the connecting case. For an accepting %% transport, there's no guarantee that an accepted connection in a %% restarted transport if from the peer we've lost contact with so %% have to be prepared for another watchdog to handle it. This is what %% the diameter_reg registration in this module is for: the peer %% connection is registered when leaving state initial and this is %% used by a new accepting watchdog to realize that it's actually in %% state down rather then initial when receiving notification of an %% open connection. restart({{connect, _} = T, Opts, Svc}, #watchdog{parent = Pid, sequence = Mask, restrict = {R,_}, dictionary = Dict0} = S) -> send(Pid, {reconnect, self()}), Nodes = restrict_nodes(R), S#watchdog{transport = start(T, Opts, Mask, Nodes, Dict0, Svc), restrict = {R, lists:member(node(), Nodes)}}; %% No restriction on the number of connections to the same peer: just %% die. Note that a state machine never enters state REOPEN in this %% case. restart({{accept, _}, _, _}, #watchdog{restrict = {_, false}}) -> stop; %% Otherwise hang around until told to die. restart({{accept, _}, _, _}, S) -> S. %% Don't currently use Opts/Svc in the accept case. %% dwr/1 dwr(#diameter_caps{origin_host = OH, origin_realm = OR, origin_state_id = OSI}) -> ['DWR', {'Origin-Host', OH}, {'Origin-Realm', OR}, {'Origin-State-Id', OSI}]. %% restrict_nodes/1 restrict_nodes(false) -> []; restrict_nodes(nodes) -> [node() | nodes()]; restrict_nodes(node) -> [node()]; restrict_nodes(Nodes) when [] == Nodes; is_atom(hd(Nodes)) -> Nodes; restrict_nodes(F) -> diameter_lib:eval(F). -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2182 bytes Desc: Kryptograficzna sygnatura S/MIME URL: From Aleksander.Nycz@REDACTED Mon May 20 11:45:16 2013 From: Aleksander.Nycz@REDACTED (Aleksander Nycz) Date: Mon, 20 May 2013 11:45:16 +0200 Subject: [erlang-bugs] Memory leak in diameter_service module in diameter app (otp_R16B) In-Reply-To: <5199E1CA.6020209@comarch.pl> References: <5199E1CA.6020209@comarch.pl> Message-ID: <5199F0AC.9090800@comarch.pl> Hello, I think there is a problem with resource leak (memory) in diameter_service module. This module is a gen_server, that state contains field watchdogT :: ets:tid(). This ets contains info about watchdogs. Diameter app service cfg is: [{'Origin-Host', HostName}, {'Origin-Realm', Realm}, {'Vendor-Id', ...}, {'Product-Name', ...}, {'Auth-Application-Id', [?DCCA_APP_ID]}, {'Supported-Vendor-Id', [...]}, {application, [{alias, diameterNode}, {dictionary, dictionaryDCCA}, {module, dccaCallback}]}, {*restrict_connections, false*}] After start dimeter app, adding service and transport, diameter_service state is: > diameter_service:state(diameterNode). #state{id = {1369,41606,329900}, service_name = diameterNode, service = #diameter_service{pid = <0.1011.0>, capabilities = #diameter_caps{...}, applications = [#diameter_app{...}]}, watchdogT = 4194395,peerT = 4259932,shared_peers = 4325469, local_peers = 4391006,monitor = false, options = [{sequence,{0,32}}, {share_peers,false}, {use_shared_peers,false}, {restrict_connections,false}]} and ets 4194395 has one record: > ets:tab2list(4194395). [#watchdog{pid = <0.1013.0>,type = accept, ref = #Ref<0.0.0.1696>, options = [{transport_module,diameter_tcp}, {transport_config,[{reuseaddr,true}, {ip,{0,0,0,0}}, {port,4068}]}, {capabilities_cb,[#Fun]}, {watchdog_timer,30000}, {reconnect_timer,60000}], state = initial, started = {1369,41606,330086}, peer = false}] Next I run very simple test using seagull symulator. Test scenario is following: 1. seagull: send CER 2. seagull: recv CEA 3. seagull: send CCR (init) 4. seagull: recv CCA (init) 5. seagull: send CCR (update) 6. seagull: recv CCR (update) 7. seagull: send CCR (terminate) 8. seagull: recv CCA (terminate) Durring test there are two watchdogs in ets: > ets:tab2list(4194395). [#watchdog{pid = <0.1816.0>,type = accept, ref = #Ref<0.0.0.1696>, options = [{transport_module,diameter_tcp}, {transport_config,[{reuseaddr,true}, {ip,{0,0,0,0}}, {port,4068}]}, {capabilities_cb,[#Fun]}, {watchdog_timer,30000}, {reconnect_timer,60000}], *state = initial*, started = {1369,41823,711370}, peer = false}, #watchdog{pid = <0.1013.0>,type = accept, ref = #Ref<0.0.0.1696>, options = [{transport_module,diameter_tcp}, {transport_config,[{reuseaddr,true}, {ip,{0,0,0,0}}, {port,4068}]}, {capabilities_cb,[#Fun]}, {watchdog_timer,30000}, {reconnect_timer,60000}], *state = okay*, started = {1369,41606,330086}, peer = <0.1014.0>}] After test but before tw timer elapsed, there is two watchdogs also and this is ok: > ets:tab2list(4194395). [#watchdog{pid = <0.1816.0>,type = accept, ref = #Ref<0.0.0.1696>, options = [{transport_module,diameter_tcp}, {transport_config,[{reuseaddr,true}, {ip,{0,0,0,0}}, {port,4068}]}, {capabilities_cb,[#Fun]}, {watchdog_timer,30000}, {reconnect_timer,60000}], *state = initial*, started = {1369,41823,711370}, peer = false}, #watchdog{pid = *<0.1013.0>*,type = accept, ref = #Ref<0.0.0.1696>, options = [{transport_module,diameter_tcp}, {transport_config,[{reuseaddr,true}, {ip,{0,0,0,0}}, {port,4068}]}, {capabilities_cb,[#Fun]}, {watchdog_timer,30000}, {reconnect_timer,60000}], *state = down*, started = {1369,41606,330086}, peer = *<0.1014.0>*}] But when tm timer elapsed transport and watchdog processes are finished: > erlang:is_process_alive(list_to_pid("*<0.1014.0>*")). false > erlang:is_process_alive(list_to_pid("*<0.1013.0>*")). false and still two watchdogs are in ets: > ets:tab2list(4194395). [#watchdog{pid = <0.1816.0>,type = accept, ref = #Ref<0.0.0.1696>, options = [{transport_module,diameter_tcp}, {transport_config,[{reuseaddr,true}, {ip,{0,0,0,0}}, {port,4068}]}, {capabilities_cb,[#Fun]}, {watchdog_timer,30000}, {reconnect_timer,60000}], state = initial, started = {1369,41823,711370}, peer = false}, #watchdog{pid = <0.1013.0>,type = accept, ref = #Ref<0.0.0.1696>, options = [{transport_module,diameter_tcp}, {transport_config,[{reuseaddr,true}, {ip,{0,0,0,0}}, {port,4068}]}, {capabilities_cb,[#Fun]}, {watchdog_timer,30000}, {reconnect_timer,60000}], state = down, started = {1369,41606,330086}, peer = <0.1014.0>}] I think watchdog *<0.1013.0> *should be removed when watchdog process is being finished. I run next test and now there are 3 watchdogs in ets: > ets:tab2list(4194395). [#watchdog{pid = *<0.1816.0>*,type = accept, ref = #Ref<0.0.0.1696>, options = [{transport_module,diameter_tcp}, {transport_config,[{reuseaddr,true}, {ip,{0,0,0,0}}, {port,4068}]}, {capabilities_cb,[#Fun]}, {watchdog_timer,30000}, {reconnect_timer,60000}], *state = down*, started = {1369,41823,711370}, peer = *<0.1817.0>*}, #watchdog{pid = <0.1013.0>,type = accept, ref = #Ref<0.0.0.1696>, options = [{transport_module,diameter_tcp}, {transport_config,[{reuseaddr,true}, {ip,{0,0,0,0}}, {port,4068}]}, {capabilities_cb,[#Fun]}, {watchdog_timer,30000}, {reconnect_timer,60000}], *state = down*, started = {1369,41606,330086}, peer = <0.1014.0>}, #watchdog{pid = <0.3533.0>,type = accept, ref = #Ref<0.0.0.1696>, options = [{transport_module,diameter_tcp}, {transport_config,[{reuseaddr,true}, {ip,{0,0,0,0}}, {port,4068}]}, {capabilities_cb,[#Fun]}, {watchdog_timer,30000}, {reconnect_timer,60000}], *state = initial*, started = {1369,42342,845898}, peer = false}] Watchdog and transport process are not alive: > erlang:is_process_alive(list_to_pid("<0.1816.0>")). false > erlang:is_process_alive(list_to_pid("<0.1817.0>")). false I suggest following change in code to correct this problem (file diameter_service.erl): $ diff diameter_service.erl diameter_service.erl_ok 1006c1006 < connection_down(#watchdog{state = WS, --- > connection_down(#watchdog{state = ?WD_OKAY, 1015,1017c1015,1021 < ?WD_OKAY == WS < andalso < connection_down(Wd, fetch(PeerT, TPid), S). --- > connection_down(Wd, fetch(PeerT, TPid), S); > > connection_down(#watchdog{}, > To, > #state{}) > when is_atom(To) -> > ok. You can find this solution in attachement. Regards Aleksander Nycz -- Aleksander Nycz Senior Software Engineer Telco_021 BSS R&D Comarch SA Phone: +48 12 646 1216 Mobile: +48 691 464 275 website: www.comarch.pl -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- %% %% %CopyrightBegin% %% %% Copyright Ericsson AB 2010-2013. All Rights Reserved. %% %% The contents of this file are subject to the Erlang Public License, %% Version 1.1, (the "License"); you may not use this file except in %% compliance with the License. You should have received a copy of the %% Erlang Public License along with this software. If not, it can be %% retrieved online at http://www.erlang.org/. %% %% Software distributed under the License is distributed on an "AS IS" %% basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See %% the License for the specific language governing rights and limitations %% under the License. %% %% %CopyrightEnd% %% %% %% Implements the process that represents a service. %% -module(diameter_service). -behaviour(gen_server). %% towards diameter_service_sup -export([start_link/1]). %% towards diameter -export([subscribe/1, unsubscribe/1, services/0, info/2]). %% towards diameter_config -export([start/1, stop/1, start_transport/2, stop_transport/2]). %% towards diameter_peer -export([notify/2]). %% towards diameter_traffic -export([find_incoming_app/4, pick_peer/3]). %% test/debug -export([services/1, subscriptions/1, subscriptions/0, call_module/3, whois/1, state/1, uptime/1]). %% gen_server callbacks -export([init/1, handle_call/3, handle_cast/2, handle_info/2, terminate/2, code_change/3]). -include_lib("diameter/include/diameter.hrl"). -include("diameter_internal.hrl"). %% RFC 3539 watchdog states. -define(WD_INITIAL, initial). -define(WD_OKAY, okay). -define(WD_SUSPECT, suspect). -define(WD_DOWN, down). -define(WD_REOPEN, reopen). -type wd_state() :: ?WD_INITIAL | ?WD_OKAY | ?WD_SUSPECT | ?WD_DOWN | ?WD_REOPEN. -define(DEFAULT_TC, 30000). %% RFC 3588 ch 2.1 -define(RESTART_TC, 1000). %% if restart was this recent %% Used to be able to swap this with anything else dict-like but now %% rely on the fact that a service's #state{} record does not change %% in storing in it ?STATE table and not always going through the %% service process. In particular, rely on the fact that operations on %% a ?Dict don't change the handle to it. -define(Dict, diameter_dict). %% Maintains state in a table. In contrast to previously, a service's %% stat is not constant and is accessed outside of the service %% process. -define(STATE_TABLE, ?MODULE). %% The default sequence mask. -define(NOMASK, {0,32}). %% The default restrict_connections. -define(RESTRICT, nodes). %% Workaround for dialyzer's lack of understanding of match specs. -type match(T) :: T | '_' | '$1' | '$2'. %% State of service gen_server. Note that the state term itself %% doesn't change, which is relevant for the stateless application %% callbacks since the state is retrieved from ?STATE_TABLE from %% outside the service process. The pid in the service record is used %% to determine whether or not we need to call the process for a %% pick_peer callback in the statefull case. -record(state, {id = now(), service_name :: diameter:service_name(), %% key in ?STATE_TABLE service :: #diameter_service{}, watchdogT = ets_new(watchdogs) %% #watchdog{} at start :: ets:tid(), peerT = ets_new(peers) %% #peer{pid = TPid} at okay/reopen :: ets:tid(), shared_peers = ?Dict:new() %% Alias -> [{TPid, Caps}, ...] :: ets:tid(), local_peers = ?Dict:new() %% Alias -> [{TPid, Caps}, ...] :: ets:tid(), monitor = false :: false | pid(), %% process to die with options :: [{sequence, diameter:sequence()} %% sequence mask | {restrict_connections, diameter:restriction()} | {share_peers, boolean()} %% broadcast peers to remote nodes? | {use_shared_peers, boolean()}]}).%% use broadcasted peers? %% shared_peers reflects the peers broadcast from remote nodes. %% Record representing an RFC 3539 watchdog process implemented by %% diameter_watchdog. -record(watchdog, {pid :: match(pid()), type :: match(connect | accept), ref :: match(reference()), %% key into diameter_config options :: match([diameter:transport_opt()]),%% from start_transport state = ?WD_INITIAL :: match(wd_state()), started = now(), %% at process start peer = false :: match(boolean() | pid())}). %% true at accepted, pid() at okay/reopen %% Record representing an Peer State Machine processes implemented by %% diameter_peer_fsm. -record(peer, {pid :: pid(), apps :: [{0..16#FFFFFFFF, diameter:app_alias()}], %% {Id, Alias} caps :: #diameter_caps{}, started = now(), %% at process start watchdog :: pid()}). %% key into watchdogT %% --------------------------------------------------------------------------- %% # start/1 %% --------------------------------------------------------------------------- start(SvcName) -> diameter_service_sup:start_child(SvcName). start_link(SvcName) -> Options = [{spawn_opt, diameter_lib:spawn_opts(server, [])}], gen_server:start_link(?MODULE, [SvcName], Options). %% Put the arbitrary term SvcName in a list in case we ever want to %% send more than this and need to distinguish old from new. %% --------------------------------------------------------------------------- %% # stop/1 %% --------------------------------------------------------------------------- stop(SvcName) -> case whois(SvcName) of undefined -> {error, not_started}; Pid -> stop(call_service(Pid, stop), Pid) end. stop(ok, Pid) -> MRef = erlang:monitor(process, Pid), receive {'DOWN', MRef, process, _, _} -> ok end; stop(No, _) -> No. %% --------------------------------------------------------------------------- %% # start_transport/3 %% --------------------------------------------------------------------------- start_transport(SvcName, {_Ref, _Type, _Opts} = T) -> call_service_by_name(SvcName, {start, T}). %% --------------------------------------------------------------------------- %% # stop_transport/2 %% --------------------------------------------------------------------------- stop_transport(_, []) -> ok; stop_transport(SvcName, [_|_] = Refs) -> call_service_by_name(SvcName, {stop, Refs}). %% --------------------------------------------------------------------------- %% # info/2 %% --------------------------------------------------------------------------- info(SvcName, Item) -> case lookup_state(SvcName) of [#state{} = S] -> service_info(Item, S); [] -> undefined end. %% lookup_state/1 lookup_state(SvcName) -> ets:lookup(?STATE_TABLE, SvcName). %% --------------------------------------------------------------------------- %% # subscribe/1 %% # unsubscribe/1 %% --------------------------------------------------------------------------- subscribe(SvcName) -> diameter_reg:add({?MODULE, subscriber, SvcName}). unsubscribe(SvcName) -> diameter_reg:del({?MODULE, subscriber, SvcName}). subscriptions(Pat) -> pmap(diameter_reg:match({?MODULE, subscriber, Pat})). subscriptions() -> subscriptions('_'). pmap(Props) -> lists:map(fun({{?MODULE, _, Name}, Pid}) -> {Name, Pid} end, Props). %% --------------------------------------------------------------------------- %% # services/1 %% --------------------------------------------------------------------------- services(Pat) -> pmap(diameter_reg:match({?MODULE, service, Pat})). services() -> services('_'). whois(SvcName) -> case diameter_reg:match({?MODULE, service, SvcName}) of [{_, Pid}] -> Pid; [] -> undefined end. %% --------------------------------------------------------------------------- %% # pick_peer/3 %% --------------------------------------------------------------------------- -spec pick_peer(SvcName, AppOrAlias, Opts) -> {{TPid, Caps, App}, Mask} | false | {error, term()} when SvcName :: diameter:service_name(), AppOrAlias :: {alias, diameter:app_alias()} | #diameter_app{}, Opts :: tuple(), TPid :: pid(), Caps :: #diameter_caps{}, App :: #diameter_app{}, Mask :: diameter:sequence(). pick_peer(SvcName, App, Opts) -> pick(lookup_state(SvcName), App, Opts). pick([], _, _) -> {error, no_service}; pick([S], App, Opts) -> pick(S, App, Opts); pick(#state{service = #diameter_service{applications = Apps}} = S, {alias, Alias}, Opts) -> %% initial call from diameter:call/4 pick(S, find_outgoing_app(Alias, Apps), Opts); pick(_, false, _) -> false; pick(#state{options = [{_, Mask} | _]} = S, #diameter_app{module = ModX, dictionary = Dict} = App0, {DestF, Filter, Xtra}) -> App = App0#diameter_app{module = ModX ++ Xtra}, [_,_] = RealmAndHost = diameter_lib:eval([DestF, Dict]), case pick_peer(App, RealmAndHost, Filter, S) of {TPid, Caps} -> {{TPid, Caps, App}, Mask}; false = No -> No end. %% --------------------------------------------------------------------------- %% # find_incoming_app/4 %% --------------------------------------------------------------------------- -spec find_incoming_app(PeerT, TPid, Id, Apps) -> {#diameter_app{}, #diameter_caps{}} %% connection and suitable app | #diameter_caps{} %% connection but no suitable app | false %% no connection when PeerT :: ets:tid(), TPid :: pid(), Id :: non_neg_integer(), Apps :: [#diameter_app{}]. find_incoming_app(PeerT, TPid, Id, Apps) -> try ets:lookup(PeerT, TPid) of [#peer{} = P] -> find_incoming_app(P, Id, Apps); [] -> %% transport has gone down false catch error: badarg -> %% service has gone down (and taken table with it) false end. %% --------------------------------------------------------------------------- %% # notify/2 %% --------------------------------------------------------------------------- notify(SvcName, Msg) -> Pid = whois(SvcName), is_pid(Pid) andalso (Pid ! Msg). %% =========================================================================== %% =========================================================================== state(Svc) -> call_service(Svc, state). uptime(Svc) -> call_service(Svc, uptime). %% call_module/3 call_module(Service, AppMod, Request) -> call_service(Service, {call_module, AppMod, Request}). %% --------------------------------------------------------------------------- %% # init/1 %% --------------------------------------------------------------------------- init([SvcName]) -> process_flag(trap_exit, true), %% ensure terminate(shutdown, _) i(SvcName, diameter_reg:add_new({?MODULE, service, SvcName})). i(SvcName, true) -> {ok, i(SvcName)}; i(_, false) -> {stop, {shutdown, already_started}}. %% --------------------------------------------------------------------------- %% # handle_call/3 %% --------------------------------------------------------------------------- handle_call(state, _, S) -> {reply, S, S}; handle_call(uptime, _, #state{id = T} = S) -> {reply, diameter_lib:now_diff(T), S}; %% Start a transport. handle_call({start, {Ref, Type, Opts}}, _From, S) -> {reply, start(Ref, {Type, Opts}, S), S}; %% Stop transports. handle_call({stop, Refs}, _From, S) -> shutdown(Refs, S), {reply, ok, S}; %% pick_peer with mutable state handle_call({pick_peer, Local, Remote, App}, _From, S) -> #diameter_app{mutable = true} = App, %% assert {reply, pick_peer(Local, Remote, self(), S#state.service_name, App), S}; handle_call({call_module, AppMod, Req}, From, S) -> call_module(AppMod, Req, From, S); handle_call(stop, _From, S) -> shutdown(service, S), {stop, normal, ok, S}; %% The server currently isn't guaranteed to be dead when the caller %% gets the reply. We deal with this in the call to the server, %% stating a monitor that waits for DOWN before returning. handle_call(Req, From, S) -> unexpected(handle_call, [Req, From], S), {reply, nok, S}. %% --------------------------------------------------------------------------- %% # handle_cast/2 %% --------------------------------------------------------------------------- handle_cast(Req, S) -> unexpected(handle_cast, [Req], S), {noreply, S}. %% --------------------------------------------------------------------------- %% # handle_info/2 %% --------------------------------------------------------------------------- handle_info(T, #state{} = S) -> case transition(T,S) of ok -> {noreply, S}; {stop, Reason} -> {stop, {shutdown, Reason}, S} end. %% transition/2 %% Peer process is telling us to start a new accept process. transition({accepted, Pid, TPid}, S) -> accepted(Pid, TPid, S), ok; %% Connecting transport is being restarted by watchdog. transition({reconnect, Pid}, S) -> reconnect(Pid, S), ok; %% Watchdog is sending notification of transport death. transition({close, Pid, Reason}, #state{service_name = SvcName, watchdogT = WatchdogT}) -> #watchdog{state = WS, ref = Ref, type = Type, options = Opts} = fetch(WatchdogT, Pid), WS /= ?WD_OKAY andalso send_event(SvcName, {closed, Ref, Reason, {type(Type), Opts}}), ok; %% Watchdog is sending notification of a state transition. transition({watchdog, Pid, {[TPid | Data], From, To}}, #state{service_name = SvcName, watchdogT = WatchdogT} = S) -> #watchdog{ref = Ref, type = T, options = Opts} = Wd = fetch(WatchdogT, Pid), watchdog(TPid, Data, From, To, Wd, S), send_event(SvcName, {watchdog, Ref, TPid, {From, To}, {T, Opts}}), ok; %% Death of a watchdog process (#watchdog.pid) results in the removal of %% it's peer and any associated conn record when 'DOWN' is received. %% Death of a peer process process (#peer.pid, #watchdog.peer) results in %% ?WD_DOWN. %% Monitor process has died. Just die with a reason that tells %% diameter_config about the happening. If a cleaner shutdown is %% required then someone should stop us. transition({'DOWN', MRef, process, _, Reason}, #state{monitor = MRef}) -> {stop, {monitor, Reason}}; %% Local watchdog process has died. transition({'DOWN', _, process, Pid, _Reason}, S) when node(Pid) == node() -> watchdog_down(Pid, S), ok; %% Remote service wants to know about shared peers. transition({service, Pid}, S) -> share_peers(Pid, S), ok; %% Remote service is communicating a shared peer. transition({peer, TPid, Aliases, Caps}, S) -> remote_peer_up(TPid, Aliases, Caps, S), ok; %% Remote peer process has died. transition({'DOWN', _, process, TPid, _}, S) -> remote_peer_down(TPid, S), ok; %% Restart after tc expiry. transition({tc_timeout, T}, S) -> tc_timeout(T, S), ok; transition(Req, S) -> unexpected(handle_info, [Req], S), ok. %% --------------------------------------------------------------------------- %% # terminate/2 %% --------------------------------------------------------------------------- terminate(Reason, #state{service_name = Name} = S) -> send_event(Name, stop), ets:delete(?STATE_TABLE, Name), shutdown == Reason %% application shutdown andalso shutdown(application, S). %% --------------------------------------------------------------------------- %% # code_change/3 %% --------------------------------------------------------------------------- code_change(FromVsn, #state{service_name = SvcName, service = #diameter_service{applications = Apps}} = S, Extra) -> lists:foreach(fun(A) -> code_change(FromVsn, SvcName, Extra, A) end, Apps), {ok, S}. code_change(FromVsn, SvcName, Extra, #diameter_app{alias = Alias} = A) -> {ok, S} = cb(A, code_change, [FromVsn, mod_state(Alias), Extra, SvcName]), mod_state(Alias, S). %% =========================================================================== %% =========================================================================== unexpected(F, A, #state{service_name = Name}) -> ?UNEXPECTED(F, A ++ [Name]). cb(#diameter_app{module = [_|_] = M}, F, A) -> eval(M, F, A). eval([M|X], F, A) -> apply(M, F, A ++ X). %% Callback with state. state_cb(#diameter_app{module = ModX, mutable = false, init_state = S}, pick_peer = F, A) -> eval(ModX, F, A ++ [S]); state_cb(#diameter_app{module = ModX, alias = Alias}, F, A) -> eval(ModX, F, A ++ [mod_state(Alias)]). choose(true, X, _) -> X; choose(false, _, X) -> X. ets_new(Tbl) -> ets:new(Tbl, [{keypos, 2}]). insert(Tbl, Rec) -> ets:insert(Tbl, Rec), Rec. %% Using the process dictionary for the callback state was initially %% just a way to make what was horrendous trace (big state record and %% much else everywhere) somewhat more readable. There's not as much %% need for it now but it's no worse (except possibly that we don't %% see the table identifier being passed around) than an ets table so %% keep it. mod_state(Alias) -> get({?MODULE, mod_state, Alias}). mod_state(Alias, ModS) -> put({?MODULE, mod_state, Alias}, ModS). %% --------------------------------------------------------------------------- %% # shutdown/2 %% --------------------------------------------------------------------------- %% remove_transport shutdown(Refs, #state{watchdogT = WatchdogT}) when is_list(Refs) -> ets:foldl(fun(P,ok) -> st(P, Refs), ok end, ok, WatchdogT); %% application/service shutdown shutdown(Reason, #state{watchdogT = WatchdogT}) when Reason == application; Reason == service -> diameter_lib:wait(ets:foldl(fun(P,A) -> st(P, Reason, A) end, [], WatchdogT)). %% st/2 st(#watchdog{ref = Ref, pid = Pid}, Refs) -> lists:member(Ref, Refs) andalso (Pid ! {shutdown, self(), transport}). %% 'DOWN' cleans up %% st/3 st(#watchdog{pid = Pid}, Reason, Acc) -> Pid ! {shutdown, self(), Reason}, [Pid | Acc]. %% --------------------------------------------------------------------------- %% # call_service/2 %% --------------------------------------------------------------------------- call_service(Pid, Req) when is_pid(Pid) -> cs(Pid, Req); call_service(SvcName, Req) -> call_service_by_name(SvcName, Req). call_service_by_name(SvcName, Req) -> cs(whois(SvcName), Req). cs(Pid, Req) when is_pid(Pid) -> try gen_server:call(Pid, Req, infinity) catch E: Reason when E == exit -> {error, {E, Reason}} end; cs(undefined, _) -> {error, no_service}. %% --------------------------------------------------------------------------- %% # i/1 %% --------------------------------------------------------------------------- %% Intialize the state of a service gen_server. i(SvcName) -> %% Split the config into a server state and a list of transports. {#state{} = S, CL} = lists:foldl(fun cfg_acc/2, {false, []}, diameter_config:lookup(SvcName)), %% Publish the state in order to be able to access it outside of %% the service process. Originally table identifiers were only %% known to the service process but we now want to provide the %% option of application callbacks being 'stateless' in order to %% avoid having to go through a common process. (Eg. An agent that %% sends a request for every incoming request.) true = ets:insert_new(?STATE_TABLE, S), %% Start fsms for each transport. send_event(SvcName, start), lists:foreach(fun(T) -> start_fsm(T,S) end, CL), init_shared(S), S. cfg_acc({SvcName, #diameter_service{applications = Apps} = Rec, Opts}, {false, Acc}) -> lists:foreach(fun init_mod/1, Apps), S = #state{service_name = SvcName, service = Rec#diameter_service{pid = self()}, monitor = mref(get_value(monitor, Opts)), options = service_options(Opts)}, {S, Acc}; cfg_acc({_Ref, Type, _Opts} = T, {S, Acc}) when Type == connect; Type == listen -> {S, [T | Acc]}. service_options(Opts) -> [{sequence, proplists:get_value(sequence, Opts, ?NOMASK)}, {share_peers, get_value(share_peers, Opts)}, {use_shared_peers, get_value(use_shared_peers, Opts)}, {restrict_connections, proplists:get_value(restrict_connections, Opts, ?RESTRICT)}]. %% The order of options is significant since we match against the list. mref(false = No) -> No; mref(P) -> erlang:monitor(process, P). init_shared(#state{options = [_, _, {_, true} | _], service_name = Svc}) -> diameter_peer:notify(Svc, {service, self()}); init_shared(#state{options = [_, _, {_, false} | _]}) -> ok. init_mod(#diameter_app{alias = Alias, init_state = S}) -> mod_state(Alias, S). start_fsm({Ref, Type, Opts}, S) -> start(Ref, {Type, Opts}, S). get_value(Key, Vs) -> {_, V} = lists:keyfind(Key, 1, Vs), V. %% --------------------------------------------------------------------------- %% # start/3 %% --------------------------------------------------------------------------- %% If the initial start/3 at service/transport start succeeds then %% subsequent calls to start/4 on the same service will also succeed %% since they involve the same call to merge_service/2. We merge here %% rather than earlier since the service may not yet be configured %% when the transport is configured. start(Ref, {T, Opts}, S) when T == connect; T == listen -> try {ok, start(Ref, type(T), Opts, S)} catch ?FAILURE(Reason) -> {error, Reason} end. %% TODO: don't actually raise any errors yet %% There used to be a difference here between the handling of %% configured listening and connecting transports but now we simply %% tell the transport_module to start an accepting or connecting %% process respectively, the transport implementation initiating %% listening on a port as required. type(listen) -> accept; type(accept) -> listen; type(connect = T) -> T. %% start/4 start(Ref, Type, Opts, #state{watchdogT = WatchdogT, peerT = PeerT, options = SvcOpts, service_name = SvcName, service = Svc0}) when Type == connect; Type == accept -> #diameter_service{applications = Apps} = Svc = merge_service(Opts, Svc0), {_,_} = Mask = proplists:get_value(sequence, SvcOpts), Pid = s(Type, Ref, {diameter_traffic:make_recvdata([SvcName, PeerT, Apps, Mask]), Opts, SvcOpts, Svc}), insert(WatchdogT, #watchdog{pid = Pid, type = Type, ref = Ref, options = Opts}), Pid. %% Note that the service record passed into the watchdog is the merged %% record so that each watchdog may get a different record. This %% record is what is passed back into application callbacks. s(Type, Ref, T) -> {_MRef, Pid} = diameter_watchdog:start({Type, Ref}, T), Pid. %% merge_service/2 merge_service(Opts, Svc) -> lists:foldl(fun ms/2, Svc, Opts). %% Limit the applications known to the fsm to those in the 'apps' %% option. That this might be empty is checked by the fsm. It's not %% checked at config-time since there's no requirement that the %% service be configured first. (Which could be considered a bit odd.) ms({applications, As}, #diameter_service{applications = Apps} = S) when is_list(As) -> S#diameter_service{applications = [A || A <- Apps, lists:member(A#diameter_app.alias, As)]}; %% The fact that all capabilities can be configured on the transports %% means that the service doesn't necessarily represent a single %% locally implemented Diameter peer as identified by Origin-Host: a %% transport can configure its own Origin-Host. This means that the %% service little more than a placeholder for default capabilities %% plus a list of applications that individual transports can choose %% to support (or not). ms({capabilities, Opts}, #diameter_service{capabilities = Caps0} = Svc) when is_list(Opts) -> %% make_caps has already succeeded in diameter_config so it will succeed %% again here. {ok, Caps} = diameter_capx:make_caps(Caps0, Opts), Svc#diameter_service{capabilities = Caps}; ms(_, Svc) -> Svc. %% --------------------------------------------------------------------------- %% # accepted/3 %% --------------------------------------------------------------------------- accepted(Pid, _TPid, #state{watchdogT = WatchdogT} = S) -> #watchdog{ref = Ref, type = accept = T, peer = false, options = Opts} = Wd = fetch(WatchdogT, Pid), insert(WatchdogT, Wd#watchdog{peer = true}),%% mark replacement as started start(Ref, T, Opts, S). %% start new watchdog fetch(Tid, Key) -> [T] = ets:lookup(Tid, Key), T. %% --------------------------------------------------------------------------- %% # watchdog/6 %% %% React to a watchdog state transition. %% --------------------------------------------------------------------------- %% Watchdog has a new open connection. watchdog(TPid, [T], _, ?WD_OKAY, Wd, State) -> connection_up({TPid, T}, Wd, State); %% Watchdog has a new connection that will be opened after DW[RA] %% exchange. watchdog(TPid, [T], _, ?WD_REOPEN, Wd, State) -> reopen({TPid, T}, Wd, State); %% Watchdog has recovered a suspect connection. watchdog(TPid, [], ?WD_SUSPECT, ?WD_OKAY, Wd, State) -> #watchdog{peer = TPid} = Wd, %% assert connection_up(Wd, State); %% Watchdog has an unresponsive connection. watchdog(TPid, [], ?WD_OKAY, ?WD_SUSPECT = To, Wd, State) -> #watchdog{peer = TPid} = Wd, %% assert connection_down(Wd, To, State); %% Watchdog has lost its connection. watchdog(TPid, [], _, ?WD_DOWN = To, Wd, #state{peerT = PeerT} = S) -> close(Wd, S), connection_down(Wd, To, S), ets:delete(PeerT, TPid); watchdog(_, [], _, _, _, _) -> ok. %% --------------------------------------------------------------------------- %% # connection_up/3 %% --------------------------------------------------------------------------- %% Watchdog process has reached state OKAY. connection_up({TPid, {Caps, SupportedApps, Pkt}}, #watchdog{pid = Pid} = Wd, #state{peerT = PeerT} = S) -> Pr = #peer{pid = TPid, apps = SupportedApps, caps = Caps, watchdog = Pid}, insert(PeerT, Pr), connection_up([Pkt], Wd#watchdog{peer = TPid}, Pr, S). %% --------------------------------------------------------------------------- %% # reopen/3 %% --------------------------------------------------------------------------- reopen({TPid, {Caps, SupportedApps, _Pkt}}, #watchdog{pid = Pid} = Wd, #state{watchdogT = WatchdogT, peerT = PeerT}) -> insert(PeerT, #peer{pid = TPid, apps = SupportedApps, caps = Caps, watchdog = Pid}), insert(WatchdogT, Wd#watchdog{state = ?WD_REOPEN, peer = TPid}). %% --------------------------------------------------------------------------- %% # connection_up/2 %% --------------------------------------------------------------------------- %% Watchdog has recovered as suspect connection. Note that there has %% been no new capabilties exchange in this case. connection_up(#watchdog{peer = TPid} = Wd, #state{peerT = PeerT} = S) -> connection_up([], Wd, fetch(PeerT, TPid), S). %% connection_up/4 connection_up(Extra, #watchdog{peer = TPid} = Wd, #peer{apps = SApps, caps = Caps} = Pr, #state{watchdogT = WatchdogT, local_peers = LDict, service_name = SvcName, service = #diameter_service{applications = Apps}} = S) -> insert(WatchdogT, Wd#watchdog{state = ?WD_OKAY}), diameter_traffic:peer_up(TPid), insert_local_peer(SApps, {{TPid, Caps}, {SvcName, Apps}}, LDict), report_status(up, Wd, Pr, S, Extra). insert_local_peer(SApps, T, LDict) -> lists:foldl(fun(A,D) -> ilp(A, T, D) end, LDict, SApps). ilp({Id, Alias}, {TC, SA}, LDict) -> init_conn(Id, Alias, TC, SA), ?Dict:append(Alias, TC, LDict). init_conn(Id, Alias, {TPid, _} = TC, {SvcName, Apps}) -> #diameter_app{id = Id} %% assert = App = find_app(Alias, Apps), peer_cb(App, peer_up, [SvcName, TC]) orelse exit(TPid, kill). %% fake transport failure %% --------------------------------------------------------------------------- %% # find_incoming_app/3 %% --------------------------------------------------------------------------- %% No one should be sending the relay identifier. find_incoming_app(#peer{caps = Caps}, ?APP_ID_RELAY, _) -> Caps; find_incoming_app(Peer, Id, Apps) when is_integer(Id) -> find_incoming_app(Peer, [Id, ?APP_ID_RELAY], Apps); %% Note that the apps represented in SApps may be a strict subset of %% those in Apps. find_incoming_app(#peer{apps = SApps, caps = Caps}, Ids, Apps) -> case keyfind(Ids, 1, SApps) of {_Id, Alias} -> {#diameter_app{} = find_app(Alias, Apps), Caps}; false -> Caps end. %% keyfind/3 keyfind([], _, _) -> false; keyfind([Key | Rest], Pos, L) -> case lists:keyfind(Key, Pos, L) of false -> keyfind(Rest, Pos, L); T -> T end. %% find_outgoing_app/2 find_outgoing_app(Alias, Apps) -> case find_app(Alias, Apps) of #diameter_app{id = ?APP_ID_RELAY} -> false; A -> A end. %% find_app/2 find_app(Alias, Apps) -> lists:keyfind(Alias, #diameter_app.alias, Apps). %% Don't bring down the service (and all associated connections) %% regardless of what happens. peer_cb(App, F, A) -> try state_cb(App, F, A) of ModS -> mod_state(App#diameter_app.alias, ModS), true catch E:R -> diameter_lib:error_report({failure, {E, R, ?STACK}}, {App, F, A}), false end. %% --------------------------------------------------------------------------- %% # connection_down/3 %% --------------------------------------------------------------------------- connection_down(#watchdog{state = ?WD_OKAY, peer = TPid} = Wd, #peer{caps = Caps, apps = SApps} = Pr, #state{service_name = SvcName, service = #diameter_service{applications = Apps}, local_peers = LDict} = S) -> report_status(down, Wd, Pr, S, []), remove_local_peer(SApps, {{TPid, Caps}, {SvcName, Apps}}, LDict), diameter_traffic:peer_down(TPid); connection_down(#watchdog{}, #peer{}, _) -> ok; connection_down(#watchdog{state = ?WD_OKAY, peer = TPid} = Wd, To, #state{watchdogT = WatchdogT, peerT = PeerT} = S) when is_atom(To) -> insert(WatchdogT, Wd#watchdog{state = To}), connection_down(Wd, fetch(PeerT, TPid), S); connection_down(#watchdog{}, To, #state{}) when is_atom(To) -> ok. remove_local_peer(SApps, T, LDict) -> lists:foldl(fun(A,D) -> rlp(A, T, D) end, LDict, SApps). rlp({Id, Alias}, {TC, SA}, LDict) -> L = ?Dict:fetch(Alias, LDict), down_conn(Id, Alias, TC, SA), ?Dict:store(Alias, lists:delete(TC, L), LDict). down_conn(Id, Alias, TC, {SvcName, Apps}) -> #diameter_app{id = Id} %% assert = App = find_app(Alias, Apps), peer_cb(App, peer_down, [SvcName, TC]). %% --------------------------------------------------------------------------- %% # watchdog_down/2 %% --------------------------------------------------------------------------- %% Watchdog process has died. watchdog_down(Pid, #state{watchdogT = WatchdogT} = S) -> Wd = fetch(WatchdogT, Pid), ets:delete_object(WatchdogT, Wd), restart(Wd,S), wd_down(Wd,S). %% Watchdog has never reached OKAY ... wd_down(#watchdog{peer = B}, _) when is_boolean(B) -> ok; %% ... or maybe it has. wd_down(#watchdog{peer = TPid} = Wd, #state{peerT = PeerT} = S) -> connection_down(Wd, ?WD_DOWN, S), ets:delete(PeerT, TPid). %% restart/2 restart(Wd, S) -> q_restart(restart(Wd), S). %% restart/1 %% Always try to reconnect. restart(#watchdog{ref = Ref, type = connect = T, options = Opts, started = Time}) -> {Time, {Ref, T, Opts}}; %% Transport connection hasn't yet been accepted ... restart(#watchdog{ref = Ref, type = accept = T, options = Opts, peer = false, started = Time}) -> {Time, {Ref, T, Opts}}; %% ... or it has: a replacement has already been spawned. restart(#watchdog{type = accept}) -> false. %% q_restart/2 %% Start the reconnect timer. q_restart({Time, {_Ref, Type, Opts} = T}, S) -> start_tc(tc(Time, default_tc(Type, Opts)), T, S); q_restart(false, _) -> ok. %% RFC 3588, 2.1: %% %% When no transport connection exists with a peer, an attempt to %% connect SHOULD be periodically made. This behavior is handled via %% the Tc timer, whose recommended value is 30 seconds. There are %% certain exceptions to this rule, such as when a peer has terminated %% the transport connection stating that it does not wish to %% communicate. default_tc(connect, Opts) -> proplists:get_value(reconnect_timer, Opts, ?DEFAULT_TC); default_tc(accept, _) -> 0. %% Bound tc below if the watchdog was restarted recently to avoid %% continuous restarted in case of faulty config or other problems. tc(Time, Tc) -> choose(Tc > ?RESTART_TC orelse timer:now_diff(now(), Time) > 1000*?RESTART_TC, Tc, ?RESTART_TC). start_tc(0, T, S) -> tc_timeout(T, S); start_tc(Tc, T, _) -> erlang:send_after(Tc, self(), {tc_timeout, T}). %% tc_timeout/2 tc_timeout({Ref, _Type, _Opts} = T, #state{service_name = SvcName} = S) -> tc(diameter_config:have_transport(SvcName, Ref), T, S). tc(true, {Ref, Type, Opts}, #state{service_name = SvcName} = S) -> send_event(SvcName, {reconnect, Ref, Opts}), start(Ref, Type, Opts, S); tc(false = No, _, _) -> %% removed No. %% --------------------------------------------------------------------------- %% # close/2 %% --------------------------------------------------------------------------- %% The watchdog doesn't start a new fsm in the accept case, it %% simply stays alive until someone tells it to die in order for %% another watchdog to be able to detect that it should transition %% from initial into reopen rather than okay. That someone is either %% the accepting watchdog upon reception of a CER from the previously %% connected peer, or us after reconnect_timer timeout. close(#watchdog{type = connect}, _) -> ok; close(#watchdog{type = accept, pid = Pid, ref = Ref, options = Opts}, #state{service_name = SvcName}) -> c(Pid, diameter_config:have_transport(SvcName, Ref), Opts). %% Tell watchdog to (maybe) die later ... c(Pid, true, Opts) -> Tc = proplists:get_value(reconnect_timer, Opts, 2*?DEFAULT_TC), erlang:send_after(Tc, Pid, close); %% ... or now. c(Pid, false, _Opts) -> Pid ! close. %% The RFC's only document the behaviour of Tc, our reconnect_timer, %% for the establishment of connections but we also give %% reconnect_timer semantics for a listener, being the time within %% which a new connection attempt is expected of a connecting peer. %% The value should be greater than the peer's Tc + jitter. %% --------------------------------------------------------------------------- %% # reconnect/2 %% --------------------------------------------------------------------------- reconnect(Pid, #state{service_name = SvcName, watchdogT = WatchdogT}) -> #watchdog{ref = Ref, type = connect, options = Opts} = fetch(WatchdogT, Pid), send_event(SvcName, {reconnect, Ref, Opts}). %% --------------------------------------------------------------------------- %% # call_module/4 %% --------------------------------------------------------------------------- %% Backwards compatibility and never documented/advertised. May be %% removed. call_module(Mod, Req, From, #state{service = #diameter_service{applications = Apps}, service_name = Svc} = S) -> case cm([A || A <- Apps, Mod == hd(A#diameter_app.module)], Req, From, Svc) of {reply = T, RC} -> {T, RC, S}; noreply = T -> {T, S}; Reason -> {reply, {error, Reason}, S} end. cm([#diameter_app{alias = Alias} = App], Req, From, Svc) -> Args = [Req, From, Svc], try state_cb(App, handle_call, Args) of {noreply = T, ModS} -> mod_state(Alias, ModS), T; {reply = T, RC, ModS} -> mod_state(Alias, ModS), {T, RC}; T -> diameter_lib:error_report({invalid, T}, {App, handle_call, Args}), invalid catch E: Reason -> diameter_lib:error_report({failure, {E, Reason, ?STACK}}, {App, handle_call, Args}), failure end; cm([], _, _, _) -> unknown; cm([_,_|_], _, _, _) -> multiple. %% --------------------------------------------------------------------------- %% # report_status/5 %% --------------------------------------------------------------------------- report_status(Status, #watchdog{ref = Ref, peer = TPid, type = Type, options = Opts}, #peer{apps = [_|_] = As, caps = Caps}, #state{service_name = SvcName} = S, Extra) -> share_peer(Status, Caps, As, TPid, S), Info = [Status, Ref, {TPid, Caps}, {type(Type), Opts} | Extra], send_event(SvcName, list_to_tuple(Info)). %% send_event/2 send_event(SvcName, Info) -> send_event(#diameter_event{service = SvcName, info = Info}). send_event(#diameter_event{service = SvcName} = E) -> lists:foreach(fun({_, Pid}) -> Pid ! E end, subscriptions(SvcName)). %% --------------------------------------------------------------------------- %% # share_peer/5 %% --------------------------------------------------------------------------- share_peer(up, Caps, Aliases, TPid, #state{options = [_, {_, true} | _], service_name = Svc}) -> diameter_peer:notify(Svc, {peer, TPid, Aliases, Caps}); share_peer(_, _, _, _, _) -> ok. %% --------------------------------------------------------------------------- %% # share_peers/2 %% --------------------------------------------------------------------------- share_peers(Pid, #state{options = [_, {_, true} | _], local_peers = PDict}) -> ?Dict:fold(fun(A,Ps,ok) -> sp(Pid, A, Ps), ok end, ok, PDict); share_peers(_, _) -> ok. sp(Pid, Alias, Peers) -> lists:foreach(fun({P,C}) -> Pid ! {peer, P, [Alias], C} end, Peers). %% --------------------------------------------------------------------------- %% # remote_peer_up/4 %% --------------------------------------------------------------------------- remote_peer_up(Pid, Aliases, Caps, #state{options = [_, _, {_, true} | _], service = Svc, shared_peers = PDict}) -> #diameter_service{applications = Apps} = Svc, Key = #diameter_app.alias, As = lists:filter(fun(A) -> lists:keymember(A, Key, Apps) end, Aliases), rpu(Pid, Caps, PDict, As); remote_peer_up(_, _, _, #state{options = [_, _, {_, false} | _]}) -> ok. rpu(_, _, PDict, []) -> PDict; rpu(Pid, Caps, PDict, Aliases) -> erlang:monitor(process, Pid), T = {Pid, Caps}, lists:foreach(fun(A) -> ?Dict:append(A, T, PDict) end, Aliases). %% --------------------------------------------------------------------------- %% # remote_peer_down/2 %% --------------------------------------------------------------------------- remote_peer_down(Pid, #state{options = [_, _, {_, true} | _], shared_peers = PDict}) -> lists:foreach(fun(A) -> rpd(Pid, A, PDict) end, ?Dict:fetch_keys(PDict)). rpd(Pid, Alias, PDict) -> ?Dict:update(Alias, fun(Ps) -> lists:keydelete(Pid, 1, Ps) end, PDict). %% --------------------------------------------------------------------------- %% pick_peer/4 %% --------------------------------------------------------------------------- pick_peer(#diameter_app{alias = Alias} = App, RealmAndHost, Filter, #state{local_peers = L, shared_peers = S, service_name = SvcName, service = #diameter_service{pid = Pid}}) -> pick_peer(peers(Alias, RealmAndHost, Filter, L), peers(Alias, RealmAndHost, Filter, S), Pid, SvcName, App). %% pick_peer/5 pick_peer([], [], _, _, _) -> false; %% App state is mutable but we're not in the service process: go there. pick_peer(Local, Remote, Pid, _SvcName, #diameter_app{mutable = true} = App) when self() /= Pid -> case call_service(Pid, {pick_peer, Local, Remote, App}) of {TPid, _} = T when is_pid(TPid) -> T; {error, _} -> false end; %% App state isn't mutable or it is and we're in the service process: %% do the deed. pick_peer(Local, Remote, _Pid, SvcName, #diameter_app{alias = Alias, init_state = S, mutable = M} = App) -> Args = [Local, Remote, SvcName], try state_cb(App, pick_peer, Args) of {ok, {TPid, #diameter_caps{}} = T} when is_pid(TPid) -> T; {{TPid, #diameter_caps{}} = T, ModS} when is_pid(TPid), M -> mod_state(Alias, ModS), T; {false = No, ModS} when M -> mod_state(Alias, ModS), No; {ok, false = No} -> No; false = No -> No; {{TPid, #diameter_caps{}} = T, S} when is_pid(TPid) -> T; %% Accept returned state in the immutable {false = No, S} -> %% case as long it isn't changed. No; T -> diameter_lib:error_report({invalid, T, App}, {App, pick_peer, Args}) catch E: Reason -> diameter_lib:error_report({failure, {E, Reason, ?STACK}}, {App, pick_peer, Args}) end. %% peers/4 peers(Alias, RH, Filter, Peers) -> case ?Dict:find(Alias, Peers) of {ok, L} -> ps(L, RH, Filter, {[],[]}); error -> [] end. %% Place a peer whose Destination-Host/Realm matches those of the %% request at the front of the result list. Could add some sort of %% 'sort' option to allow more control. ps([], _, _, {Ys, Ns}) -> lists:reverse(Ys, Ns); ps([{_TPid, #diameter_caps{} = Caps} = TC | Rest], RH, Filter, Acc) -> ps(Rest, RH, Filter, pacc(caps_filter(Caps, RH, Filter), caps_filter(Caps, RH, {all, [host, realm]}), TC, Acc)). pacc(true, true, Peer, {Ts, Fs}) -> {[Peer|Ts], Fs}; pacc(true, false, Peer, {Ts, Fs}) -> {Ts, [Peer|Fs]}; pacc(_, _, _, Acc) -> Acc. %% caps_filter/3 caps_filter(C, RH, {neg, F}) -> not caps_filter(C, RH, F); caps_filter(C, RH, {all, L}) when is_list(L) -> lists:all(fun(F) -> caps_filter(C, RH, F) end, L); caps_filter(C, RH, {any, L}) when is_list(L) -> lists:any(fun(F) -> caps_filter(C, RH, F) end, L); caps_filter(#diameter_caps{origin_host = {_,OH}}, [_,DH], host) -> eq(undefined, DH, OH); caps_filter(#diameter_caps{origin_realm = {_,OR}}, [DR,_], realm) -> eq(undefined, DR, OR); caps_filter(C, _, Filter) -> caps_filter(C, Filter). %% caps_filter/2 caps_filter(_, none) -> true; caps_filter(#diameter_caps{origin_host = {_,OH}}, {host, H}) -> eq(any, H, OH); caps_filter(#diameter_caps{origin_realm = {_,OR}}, {realm, R}) -> eq(any, R, OR); %% Anything else is expected to be an eval filter. Filter failure is %% documented as being equivalent to a non-matching filter. caps_filter(C, T) -> try {eval, F} = T, diameter_lib:eval([F,C]) catch _:_ -> false end. eq(Any, Id, PeerId) -> Any == Id orelse try iolist_to_binary(Id) == iolist_to_binary(PeerId) catch _:_ -> false end. %% OctetString() can be specified as an iolist() so test for string %% rather then term equality. %% transports/1 transports(#state{watchdogT = WatchdogT}) -> ets:select(WatchdogT, [{#watchdog{peer = '$1', _ = '_'}, [{'is_pid', '$1'}], ['$1']}]). %% --------------------------------------------------------------------------- %% # service_info/2 %% --------------------------------------------------------------------------- %% The config passed to diameter:start_service/2. -define(CAP_INFO, ['Origin-Host', 'Origin-Realm', 'Vendor-Id', 'Product-Name', 'Origin-State-Id', 'Host-IP-Address', 'Supported-Vendor-Id', 'Auth-Application-Id', 'Inband-Security-Id', 'Acct-Application-Id', 'Vendor-Specific-Application-Id', 'Firmware-Revision']). %% The config returned by diameter:service_info(SvcName, all). -define(ALL_INFO, [capabilities, applications, transport, pending, options]). %% The rest. -define(OTHER_INFO, [connections, name, peers, statistics]). service_info(Item, S) when is_atom(Item) -> case tagged_info(Item, S) of {_, T} -> T; undefined = No -> No end; service_info(Items, S) -> tagged_info(Items, S). tagged_info(Item, S) when is_atom(Item) -> case complete(Item) of {value, I} -> {I, complete_info(I,S)}; false -> undefined end; tagged_info(TPid, #state{watchdogT = WatchdogT, peerT = PeerT}) when is_pid(TPid) -> try [#peer{watchdog = Pid}] = ets:lookup(PeerT, TPid), [#watchdog{ref = Ref, type = Type, options = Opts}] = ets:lookup(WatchdogT, Pid), [{ref, Ref}, {type, Type}, {options, Opts}] catch error:_ -> [] end; tagged_info(Items, S) when is_list(Items) -> [T || I <- Items, T <- [tagged_info(I,S)], T /= undefined, T /= []]; tagged_info(_, _) -> undefined. complete_info(Item, #state{service = Svc} = S) -> case Item of name -> S#state.service_name; 'Origin-Host' -> (Svc#diameter_service.capabilities) #diameter_caps.origin_host; 'Origin-Realm' -> (Svc#diameter_service.capabilities) #diameter_caps.origin_realm; 'Vendor-Id' -> (Svc#diameter_service.capabilities) #diameter_caps.vendor_id; 'Product-Name' -> (Svc#diameter_service.capabilities) #diameter_caps.product_name; 'Origin-State-Id' -> (Svc#diameter_service.capabilities) #diameter_caps.origin_state_id; 'Host-IP-Address' -> (Svc#diameter_service.capabilities) #diameter_caps.host_ip_address; 'Supported-Vendor-Id' -> (Svc#diameter_service.capabilities) #diameter_caps.supported_vendor_id; 'Auth-Application-Id' -> (Svc#diameter_service.capabilities) #diameter_caps.auth_application_id; 'Inband-Security-Id' -> (Svc#diameter_service.capabilities) #diameter_caps.inband_security_id; 'Acct-Application-Id' -> (Svc#diameter_service.capabilities) #diameter_caps.acct_application_id; 'Vendor-Specific-Application-Id' -> (Svc#diameter_service.capabilities) #diameter_caps.vendor_specific_application_id; 'Firmware-Revision' -> (Svc#diameter_service.capabilities) #diameter_caps.firmware_revision; capabilities -> service_info(?CAP_INFO, S); applications -> info_apps(S); transport -> info_transport(S); options -> info_options(S); pending -> info_pending(S); keys -> ?ALL_INFO ++ ?CAP_INFO ++ ?OTHER_INFO; all -> service_info(?ALL_INFO, S); statistics -> info_stats(S); connections -> info_connections(S); peers -> info_peers(S) end. complete(I) when I == keys; I == all -> {value, I}; complete(Pre) -> P = atom_to_list(Pre), case [I || I <- ?ALL_INFO ++ ?CAP_INFO ++ ?OTHER_INFO, lists:prefix(P, atom_to_list(I))] of [I] -> {value, I}; _ -> false end. %% info_stats/1 info_stats(#state{watchdogT = WatchdogT}) -> MatchSpec = [{#watchdog{ref = '$1', peer = '$2', _ = '_'}, [{'is_pid', '$2'}], [['$1', '$2']]}], try ets:select(WatchdogT, MatchSpec) of L -> diameter_stats:read(lists:append(L)) catch error: badarg -> [] %% service has gone down end. %% info_transport/1 %% %% One entry per configured transport. Statistics for each entry are %% the accumulated values for the ref and associated watchdog/peer %% pids. info_transport(S) -> PeerD = peer_dict(S, config_dict(S)), RefsD = dict:map(fun(_, Ls) -> [P || L <- Ls, {peer, {P,_}} <- L] end, PeerD), Refs = lists:append(dict:fold(fun(R, Ps, A) -> [[R|Ps] | A] end, [], RefsD)), Stats = diameter_stats:read(Refs), dict:fold(fun(R, Ls, A) -> Ps = dict:fetch(R, RefsD), [[{ref, R} | transport(Ls)] ++ [stats([R|Ps], Stats)] | A] end, [], PeerD). %% Only a config entry for a listening transport: use it. transport([[{type, listen}, _] = L]) -> L ++ [{accept, []}]; %% Only one config or peer entry for a connecting transport: use it. transport([[{type, connect} | _] = L]) -> L; %% Peer entries: discard config. Note that the peer entries have %% length at least 3. transport([[_,_] | L]) -> transport(L); %% Possibly many peer entries for a listening transport. Note that all %% have the same options by construction, which is not terribly space %% efficient. transport([[{type, accept}, {options, Opts} | _] | _] = Ls) -> [{type, listen}, {options, Opts}, {accept, [lists:nthtail(2,L) || L <- Ls]}]. peer_dict(#state{watchdogT = WatchdogT, peerT = PeerT}, Dict0) -> try ets:tab2list(WatchdogT) of L -> lists:foldl(fun(T,A) -> peer_acc(PeerT, A, T) end, Dict0, L) catch error: badarg -> Dict0 %% service has gone down end. peer_acc(PeerT, Acc, #watchdog{pid = Pid, type = Type, ref = Ref, options = Opts, state = WS, started = At, peer = TPid}) -> dict:append(Ref, [{type, Type}, {options, Opts}, {watchdog, {Pid, At, WS}} | info_peer(PeerT, TPid, WS)], Acc). info_peer(PeerT, TPid, WS) when is_pid(TPid), WS /= ?WD_DOWN -> try ets:lookup(PeerT, TPid) of T -> info_peer(T) catch error: badarg -> [] %% service has gone down end; info_peer(_, _, _) -> []. %% The point of extracting the config here is so that 'transport' info %% has one entry for each transport ref, the peer table only %% containing entries that have a living watchdog. config_dict(#state{service_name = SvcName}) -> lists:foldl(fun config_acc/2, dict:new(), diameter_config:lookup(SvcName)). config_acc({Ref, T, Opts}, Dict) when T == listen; T == connect -> dict:store(Ref, [[{type, T}, {options, Opts}]], Dict); config_acc(_, Dict) -> Dict. info_peer([#peer{pid = Pid, apps = SApps, caps = Caps, started = T}]) -> [{peer, {Pid, T}}, {apps, SApps}, {caps, info_caps(Caps)} | try [{port, info_port(Pid)}] catch _:_ -> [] end]; info_peer([] = No) -> No. %% Extract information that the processes involved are expected to %% "publish" in their process dictionaries. Simple but backhanded. info_port(Pid) -> {_, PD} = process_info(Pid, dictionary), {_, T} = lists:keyfind({diameter_peer_fsm, start}, 1, PD), {TPid, {_Type, TMod, _Cfg}} = T, {_, TD} = process_info(TPid, dictionary), {_, Data} = lists:keyfind({TMod, info}, 1, TD), [{owner, TPid}, {module, TMod} | try TMod:info(Data) catch _:_ -> [] end]. %% Use the fields names from diameter_caps instead of %% diameter_base_CER to distinguish between the 2-tuple values %% compared to the single capabilities values. Note also that the %% returned list is tagged 'caps' rather than 'capabilities' to %% emphasize the difference. info_caps(#diameter_caps{} = C) -> lists:zip(record_info(fields, diameter_caps), tl(tuple_to_list(C))). info_apps(#state{service = #diameter_service{applications = Apps}}) -> lists:map(fun mk_app/1, Apps). mk_app(#diameter_app{} = A) -> lists:zip(record_info(fields, diameter_app), tl(tuple_to_list(A))). %% info_pending/1 %% %% One entry for each outgoing request whose answer is outstanding. info_pending(#state{} = S) -> diameter_traffic:pending(transports(S)). %% info_connections/1 %% %% One entry per transport connection. Statistics for each entry are %% for the peer pid only. info_connections(S) -> ConnL = conn_list(S), Stats = diameter_stats:read([P || L <- ConnL, {peer, {P,_}} <- L]), [L ++ [stats([P], Stats)] || L <- ConnL, {peer, {P,_}} <- L]. conn_list(S) -> lists:append(dict:fold(fun conn_acc/3, [], peer_dict(S, dict:new()))). conn_acc(Ref, Peers, Acc) -> [[[{ref, Ref} | L] || L <- Peers, lists:keymember(peer, 1, L)] | Acc]. stats(Refs, Stats) -> {statistics, dict:to_list(lists:foldl(fun(R,D) -> stats_acc(R, D, Stats) end, dict:new(), Refs))}. stats_acc(Ref, Dict, Stats) -> lists:foldl(fun({C,N}, D) -> dict:update_counter(C, N, D) end, Dict, proplists:get_value(Ref, Stats, [])). %% info_peers/1 %% %% One entry per peer Origin-Host. Statistics for each entry are %% accumulated values for all peer pids. info_peers(S) -> {PeerD, RefD} = lists:foldl(fun peer_acc/2, {dict:new(), dict:new()}, conn_list(S)), Refs = lists:append(dict:fold(fun(_, Rs, A) -> [Rs|A] end, [], RefD)), Stats = diameter_stats:read(Refs), dict:fold(fun(OH, Cs, A) -> Rs = dict:fetch(OH, RefD), [{OH, [{connections, Cs}, stats(Rs, Stats)]} | A] end, [], PeerD). peer_acc(Peer, {PeerD, RefD}) -> [{TPid, _}, [{origin_host, {_, OH}} | _]] = [proplists:get_value(K, Peer) || K <- [peer, caps]], {dict:append(OH, Peer, PeerD), dict:append(OH, TPid, RefD)}. %% info_options/1 info_options(S) -> S#state.options. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2182 bytes Desc: Kryptograficzna sygnatura S/MIME URL: From bryan@REDACTED Tue May 21 19:59:56 2013 From: bryan@REDACTED (Bryan Fink) Date: Tue, 21 May 2013 13:59:56 -0400 Subject: [erlang-bugs] Supervisor terminate_child race In-Reply-To: <2302FD2F-B3F4-4514-88B0-17082D781D1A@gmail.com> References: <816241784.113785065.1368613048244.JavaMail.root@erlang-solutions.com> <2302FD2F-B3F4-4514-88B0-17082D781D1A@gmail.com> Message-ID: On Wed, May 15, 2013 at 11:11 AM, Tim Watson wrote: > On 15 May 2013, at 14:54, Siri Hansen wrote: > > Then again... it is up to the child's start function to create the link, and > from the supervisor's point of view, the only place to add the monitor would > be when the start function returns - which would be just another place to > get a race :( > > > Well quite. *sigh* My apologies for dropping out of this conversation. I've been on vacation. Before vacation, monitoring from the spawn was the best solution I had come up with as well. But, as has already been pointed out, if it's not done atomically (which I think can be done with a flag in spawn_opt, no?), it's just another place for a race. It has also already been pointed out that changing the supervisor-child contract for startup isn't really an option anyway. The only other possibility I see is to guarantee that if an EXIT message will be delivered, that it is always delivered before any DOWN message. If this were the case, all receive expressions could have clauses for both EXIT and DOWN, and simply use whichever arrived first. Tim's method of checking for EXIT after receiving DOWN would also work in this case. I assume the problem with this guarantee is that these messages are generated by different processes, so typical mailbox ordering rules apply. Fortunately for my use case, I think that simply linking my children to their creating process instead of a supervisor may be a viable option. All of these children are dynamic under a simple_one_for_one supervisor, and I don't care about restart policies. -Bryan From mjtruog@REDACTED Tue May 21 21:45:23 2013 From: mjtruog@REDACTED (Michael Truog) Date: Tue, 21 May 2013 12:45:23 -0700 Subject: [erlang-bugs] escript file operations fail on halt In-Reply-To: <51980DD2.7060501@gmail.com> References: <51980DD2.7060501@gmail.com> Message-ID: <519BCED3.3050508@gmail.com> Just as an update: I found that my issue with file operations was not related to the file port driver not completing async thread jobs, which made more logical sense. The fact remains that two things would have helped make the situation clearer: 1) clearly document the default flush operation for the erlang:halt/1 function 2) add an escript:exit/1 function which just calls erlang:halt/2 with flush == true as a convenience function (so that people are able to have simpler source code and not care about the halt/flush details). Thanks, Michael On 05/18/2013 04:25 PM, Michael Truog wrote: > Hi, > > There is an odd type of failure when: > 1) async threads are enabled by default for the Erlang VM > 2) an escript is used to spawn the Erlang VM > 3) erlang:halt/1 is used to terminate the escript with a known error code > > The erlang:halt/1 and erlang:halt/2 code here: > https://github.com/erlang/otp/blob/maint/erts/emulator/beam/bif.c#L3937 > Makes the default flush parameter false! The default flush parameter is currently undocumented. So, when an escript performs a file operation that depends on the async thread pool (based on the internal Erlang code and configuration) and then attempts to do erlang:halt(integer()), the file operations may not complete or perhaps only partially complete. In my particular use case, I can observe a rename file operation getting stuck inbetween the actual completion of the rename (and I am not using anything but a normal/default Linux filesystem, not NFS). > > It seems important to change the default erlang:halt/1 behaviour for escript usage so that flush is true (I understand fail-fast probably means normal Erlang VM usage shouldn't have flush default to true). An alternative is a new escript function that sets the flush option for the user (which is probably an easier solution to agree on) (e.g., escript:exit/1). > > Thanks, > Michael From watson.timothy@REDACTED Fri May 24 16:28:42 2013 From: watson.timothy@REDACTED (Tim Watson) Date: Fri, 24 May 2013 15:28:42 +0100 Subject: [erlang-bugs] Strange application shutdown deadlock Message-ID: We came across this at a customer's site, where one of the nodes was apparently in the process of stopping and had been in that state for at least 24 hours. The short version is that an application_master appears to be stuck waiting for a child pid (is that the X process, or the root supervisor?) which is *not* linked to it... The application controller is in the process of stopping an application, during which process a `get_child' message appears to have come in to that application's application_master from somewhere - we are *not* running appmon, so I'm really confused how this can happen, as the only other place where I see (indirect) calls are via the sasl release_handler!? At the bottom of this email is a dump for the application_controller and the application_master for the app it is trying to shut down. I can verify that the pid which the application_master is waiting on is definitely not linked to it - i.e., process_info(links, AppMasterPid) doesn't contain the process <0.256.0> that the master appears to be waiting on. My reading of the code is that the application_master cannot end up in get_child_i unless a get_child request was made which arrives whilst it is in its terminate loop. As I said, we're not using appmon, therefore I assume this originated in the sasl application's release_handler_1, though I'm not sure quite which route would take us there. The relevant bit of code in application_master appears to be: get_child_i(Child) -> Child ! {self(), get_child}, receive {Child, GrandChild, Mod} -> {GrandChild, Mod} end. This in turn originates, I'd guess, in the third receive clause of terminate_loop/2. Anyway, should that code not be dealing with a potentially dead pid for Child, either by handling links effectively - perhaps there is an EXIT signal in the mailbox already which is being ignored here in get_child_i/1 - or by some other means? What follows below is the trace/dump output. Feel free to poke me for more info as needed. Cheers, Tim [TRACE/DUMP] pid: <6676.7.0> registered name: application_controller stacktrace: [{application_master,call,2, [{file,"application_master.erl"},{line,75}]}, {application_controller,stop_appl,3, [{file,"application_controller.erl"}, {line,1393}]}, {application_controller,handle_call,3, [{file,"application_controller.erl"}, {line,810}]}, {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]}] ------------------------- Program counter: 0x00007f9bf9a53720 (application_master:call/2 + 288) CP: 0x0000000000000000 (invalid) arity = 0 0x00007f9bd7948360 Return addr 0x00007f9bfb97de40 (application_controller:stop_appl/3 + 176) y(0) #Ref<0.0.20562.258360> y(1) #Ref<0.0.20562.258361> y(2) [] 0x00007f9bd7948380 Return addr 0x00007f9bfb973c68 (application_controller:handle_call/3 + 1392) y(0) temporary y(1) rabbitmq_web_dispatch 0x00007f9bd7948398 Return addr 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272) y(0) {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]} y(1) rabbitmq_web_dispatch y(2) [{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}] y(3) [{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}] 0x00007f9bd79483c0 Return addr 0x00000000008827d8 () y(0) application_controller y(1) {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]} y(2) application_controller y(3) <0.2.0> y(4) {stop_application,rabbitmq_web_dispatch} y(5) {<0.5864.275>,#Ref<0.0.20562.258345>} y(6) Catch 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272) ------------------------- pid: <6676.255.0> registered name: none stacktrace: [{application_master,get_child_i,1, [{file,"application_master.erl"},{line,392}]}, {application_master,handle_msg,2, [{file,"application_master.erl"},{line,216}]}, {application_master,terminate_loop,2, [{file,"application_master.erl"},{line,206}]}, {application_master,terminate,2, [{file,"application_master.erl"},{line,227}]}, {application_master,handle_msg,2, [{file,"application_master.erl"},{line,219}]}, {application_master,main_loop,2, [{file,"application_master.erl"},{line,194}]}, {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}] ------------------------- Program counter: 0x00007f9bf9a570e0 (application_master:get_child_i/1 + 120) CP: 0x0000000000000000 (invalid) arity = 0 0x00007f9c1adc3dc8 Return addr 0x00007f9bf9a54eb0 (application_master:handle_msg/2 + 280) y(0) <0.256.0> 0x00007f9c1adc3dd8 Return addr 0x00007f9bf9a54d20 (application_master:terminate_loop/2 + 520) y(0) #Ref<0.0.20562.258362> y(1) <0.9596.275> y(2) {state,<0.256.0>,{appl_data,rabbitmq_web_dispatch,[],undefined,{rabbit_web_dispatch_app,[]},[rabbit_web_dispatch,rabbit_web_dispatch_app,rabbit_web_dispatch_registry,rabbit_web_dispatch_sup,rabbit_web_dispatch_util,rabbit_webmachine],[],infinity,infinity},[],0,<0.29.0>} 0x00007f9c1adc3df8 Return addr 0x00007f9bf9a55108 (application_master:terminate/2 + 192) y(0) <0.256.0> 0x00007f9c1adc3e08 Return addr 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472) y(0) [] y(1) normal 0x00007f9c1adc3e20 Return addr 0x00007f9bf9a54a60 (application_master:main_loop/2 + 1600) y(0) <0.7.0> y(1) #Ref<0.0.20562.258360> y(2) Catch 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472) 0x00007f9c1adc3e40 Return addr 0x00007f9bfb969420 (proc_lib:init_p_do_apply/3 + 56) y(0) <0.7.0> 0x00007f9c1adc3e50 Return addr 0x00000000008827d8 () y(0) Catch 0x00007f9bfb969440 (proc_lib:init_p_do_apply/3 + 88) ------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 235 bytes Desc: Message signed with OpenPGP using GPGMail URL: From mononcqc@REDACTED Fri May 24 16:45:47 2013 From: mononcqc@REDACTED (Fred Hebert) Date: Fri, 24 May 2013 10:45:47 -0400 Subject: [erlang-bugs] Strange application shutdown deadlock In-Reply-To: References: Message-ID: <20130524144546.GB14817@ferdair.local> Quick question: are you running a release? If so, last time I've seen deadlocks like that was solved by making sure *all* my applications did depend on stdlib and kernel in their app file. When I skipped them, sometimes I'd find that things would lock up. My guess was that dependencies from stdlib or kernel got unloaded before my app and broke something, but I'm not sure -- In my case, I wasn't able to inspect the node as it appeared to be 100% blocked. Adding the apps ended up fixing the problem on the next shutdown. I'm not sure if it might be a good fix for you, but it's a stab in the dark, Regards, Fred. On 05/24, Tim Watson wrote: > We came across this at a customer's site, where one of the nodes was apparently in the process of stopping and had been in that state for at least 24 hours. The short version is that an application_master appears to be stuck waiting for a child pid (is that the X process, or the root supervisor?) which is *not* linked to it... > > The application controller is in the process of stopping an application, during which process a `get_child' message appears to have come in to that application's application_master from somewhere - we are *not* running appmon, so I'm really confused how this can happen, as the only other place where I see (indirect) calls are via the sasl release_handler!? At the bottom of this email is a dump for the application_controller and the application_master for the app it is trying to shut down. I can verify that the pid which the application_master is waiting on is definitely not linked to it - i.e., process_info(links, AppMasterPid) doesn't contain the process <0.256.0> that the master appears to be waiting on. > > My reading of the code is that the application_master cannot end up in get_child_i unless a get_child request was made which arrives whilst it is in its terminate loop. As I said, we're not using appmon, therefore I assume this originated in the sasl application's release_handler_1, though I'm not sure quite which route would take us there. The relevant bit of code in application_master appears to be: > > get_child_i(Child) -> > Child ! {self(), get_child}, > receive > {Child, GrandChild, Mod} -> {GrandChild, Mod} > end. > > This in turn originates, I'd guess, in the third receive clause of terminate_loop/2. Anyway, should that code not be dealing with a potentially dead pid for Child, either by handling links effectively - perhaps there is an EXIT signal in the mailbox already which is being ignored here in get_child_i/1 - or by some other means? > > What follows below is the trace/dump output. Feel free to poke me for more info as needed. > > Cheers, > Tim > > [TRACE/DUMP] > > pid: <6676.7.0> > registered name: application_controller > stacktrace: [{application_master,call,2, > [{file,"application_master.erl"},{line,75}]}, > {application_controller,stop_appl,3, > [{file,"application_controller.erl"}, > {line,1393}]}, > {application_controller,handle_call,3, > [{file,"application_controller.erl"}, > {line,810}]}, > {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]}] > ------------------------- > Program counter: 0x00007f9bf9a53720 (application_master:call/2 + 288) > CP: 0x0000000000000000 (invalid) > arity = 0 > > 0x00007f9bd7948360 Return addr 0x00007f9bfb97de40 (application_controller:stop_appl/3 + 176) > y(0) #Ref<0.0.20562.258360> > y(1) #Ref<0.0.20562.258361> > y(2) [] > > 0x00007f9bd7948380 Return addr 0x00007f9bfb973c68 (application_controller:handle_call/3 + 1392) > y(0) temporary > y(1) rabbitmq_web_dispatch > > 0x00007f9bd7948398 Return addr 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272) > y(0) {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]} > y(1) rabbitmq_web_dispatch > y(2) [{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}] > y(3) [{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}] > > 0x00007f9bd79483c0 Return addr 0x00000000008827d8 () > y(0) application_controller > y(1) {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]} > y(2) application_controller > y(3) <0.2.0> > y(4) {stop_application,rabbitmq_web_dispatch} > y(5) {<0.5864.275>,#Ref<0.0.20562.258345>} > y(6) Catch 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272) > ------------------------- > > pid: <6676.255.0> > registered name: none > stacktrace: [{application_master,get_child_i,1, > [{file,"application_master.erl"},{line,392}]}, > {application_master,handle_msg,2, > [{file,"application_master.erl"},{line,216}]}, > {application_master,terminate_loop,2, > [{file,"application_master.erl"},{line,206}]}, > {application_master,terminate,2, > [{file,"application_master.erl"},{line,227}]}, > {application_master,handle_msg,2, > [{file,"application_master.erl"},{line,219}]}, > {application_master,main_loop,2, > [{file,"application_master.erl"},{line,194}]}, > {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}] > ------------------------- > Program counter: 0x00007f9bf9a570e0 (application_master:get_child_i/1 + 120) > CP: 0x0000000000000000 (invalid) > arity = 0 > > 0x00007f9c1adc3dc8 Return addr 0x00007f9bf9a54eb0 (application_master:handle_msg/2 + 280) > y(0) <0.256.0> > > 0x00007f9c1adc3dd8 Return addr 0x00007f9bf9a54d20 (application_master:terminate_loop/2 + 520) > y(0) #Ref<0.0.20562.258362> > y(1) <0.9596.275> > y(2) {state,<0.256.0>,{appl_data,rabbitmq_web_dispatch,[],undefined,{rabbit_web_dispatch_app,[]},[rabbit_web_dispatch,rabbit_web_dispatch_app,rabbit_web_dispatch_registry,rabbit_web_dispatch_sup,rabbit_web_dispatch_util,rabbit_webmachine],[],infinity,infinity},[],0,<0.29.0>} > > 0x00007f9c1adc3df8 Return addr 0x00007f9bf9a55108 (application_master:terminate/2 + 192) > y(0) <0.256.0> > > 0x00007f9c1adc3e08 Return addr 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472) > y(0) [] > y(1) normal > > 0x00007f9c1adc3e20 Return addr 0x00007f9bf9a54a60 (application_master:main_loop/2 + 1600) > y(0) <0.7.0> > y(1) #Ref<0.0.20562.258360> > y(2) Catch 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472) > > 0x00007f9c1adc3e40 Return addr 0x00007f9bfb969420 (proc_lib:init_p_do_apply/3 + 56) > y(0) <0.7.0> > > 0x00007f9c1adc3e50 Return addr 0x00000000008827d8 () > y(0) Catch 0x00007f9bfb969440 (proc_lib:init_p_do_apply/3 + 88) > ------------------------- > > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://erlang.org/mailman/listinfo/erlang-bugs From watson.timothy@REDACTED Fri May 24 16:55:43 2013 From: watson.timothy@REDACTED (Tim Watson) Date: Fri, 24 May 2013 15:55:43 +0100 Subject: [erlang-bugs] Strange application shutdown deadlock In-Reply-To: <20130524144546.GB14817@ferdair.local> References: <20130524144546.GB14817@ferdair.local> Message-ID: <463AD101-7287-486D-926C-4108C35483B0@gmail.com> Hi Fred, On 24 May 2013, at 15:45, Fred Hebert wrote: > Quick question: are you running a release? > > If so, last time I've seen deadlocks like that was solved by making sure > *all* my applications did depend on stdlib and kernel in their app file. > When I skipped them, sometimes I'd find that things would lock up. > No, unfortunately RabbitMQ doesn't run as part of a release. > My guess was that dependencies from stdlib or kernel got unloaded before > my app and broke something, but I'm not sure -- In my case, I wasn't > able to inspect the node as it appeared to be 100% blocked. > I suppose it's possible that that could happen to us, for a different set of apps. I can't see how the release handler would be involved though, since we start our nodes with start_sasl and launch applications by hand... The code we use to shut applications down explicitly calculates the dependency order itself, so perhaps there's something wrong in there. What we do is essentially this: stop() -> case whereis(rabbit_boot) of undefined -> ok; _ -> await_startup() end, rabbit_log:info("Stopping RabbitMQ~n"), ok = app_utils:stop_applications(app_shutdown_order()). stop_and_halt() -> try stop() after rabbit_misc:local_info_msg("Halting Erlang VM~n", []), init:stop() end, ok. app_shutdown_order() -> Apps = ?APPS ++ rabbit_plugins:active(), app_utils:app_dependency_order(Apps, true). And that app_utils shutdown order is calculated thus: app_dependency_order(RootApps, StripUnreachable) -> {ok, G} = rabbit_misc:build_acyclic_graph( fun (App, _Deps) -> [{App, App}] end, fun (App, Deps) -> [{Dep, App} || Dep <- Deps] end, [{App, app_dependencies(App)} || {App, _Desc, _Vsn} <- application:loaded_applications()]), try case StripUnreachable of true -> digraph:del_vertices(G, digraph:vertices(G) -- digraph_utils:reachable(RootApps, G)); false -> ok end, digraph_utils:topsort(G) after true = digraph:delete(G) end. So even if we've shut things down in the wrong order - which I don't think we have - I still don't see where the `get_child' request comes from if the release_handler isn't involved... Cheers, Tim From daniel.goertzen@REDACTED Fri May 24 17:33:05 2013 From: daniel.goertzen@REDACTED (Daniel Goertzen) Date: Fri, 24 May 2013 10:33:05 -0500 Subject: [erlang-bugs] printing NaN causes exception Message-ID: I am working with a c++ generated floating point data stream that encodes certain events as NaN (not a number). When I try to print out this number, io_lib_format crashes. Here is a [c++11] nif that creates a NaN: static ERL_NIF_TERM quiet_nan(ErlNifEnv* env, int, const ERL_NIF_TERM argv[]) { double num = std::numeric_limits::quiet_NaN(); cerr << "quiet_nan: iostream prints num as " << num << endl; return enif_make_double(env, num); } ... and then when I run this interactively I get: Erlang R15B01 (erts-5.9.1) [source] [smp:3:3] [async-threads:0] Eshell V5.9.1 (abort with ^G) 1> channel_nif:quiet_nan(). quiet_nan: iostream prints num as nan ** exception error: no case clause matching <<127,248,0,0,0,0,0,0>> in function io_lib_format:mantissa_exponent/1 (io_lib_format.erl, line 374) in call from io_lib_format:fwrite_g/1 (io_lib_format.erl, line 365) 2> Expected behaviour is to just print "NaN" or something similar. For my use case I can work around this problem by just using binary representations. For reference, there's a thread about inf and nan about 28 Feb 2012 that focuses on doing match with these numbers. (I do not want to do math, just move them around and print them without crashing) Cheers, Dan. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bob@REDACTED Fri May 24 18:33:02 2013 From: bob@REDACTED (Bob Ippolito) Date: Fri, 24 May 2013 09:33:02 -0700 Subject: [erlang-bugs] printing NaN causes exception In-Reply-To: References: Message-ID: Erlang's floating point type doesn't allow for NaN or Inf. The only way to safely get float specials in and out is to leave them as binary, or match the bit patterns for them and convert to special atoms. It's a pretty sad state. On Friday, May 24, 2013, Daniel Goertzen wrote: > I am working with a c++ generated floating point data stream that encodes > certain events as NaN (not a number). When I try to print out this number, > io_lib_format crashes. > > > Here is a [c++11] nif that creates a NaN: > > static ERL_NIF_TERM quiet_nan(ErlNifEnv* env, int, const ERL_NIF_TERM > argv[]) > { > double num = std::numeric_limits::quiet_NaN(); > cerr << "quiet_nan: iostream prints num as " << num << endl; > return enif_make_double(env, num); > } > > > ... and then when I run this interactively I get: > > Erlang R15B01 (erts-5.9.1) [source] [smp:3:3] [async-threads:0] > > Eshell V5.9.1 (abort with ^G) > 1> channel_nif:quiet_nan(). > quiet_nan: iostream prints num as nan > ** exception error: no case clause > matching <<127,248,0,0,0,0,0,0>> > in function io_lib_format:mantissa_exponent/1 (io_lib_format.erl, > line 374) > in call from io_lib_format:fwrite_g/1 (io_lib_format.erl, line 365) > 2> > > > > Expected behaviour is to just print "NaN" or something similar. For my > use case I can work around this problem by just using binary > representations. > > For reference, there's a thread about inf and nan about 28 Feb 2012 that > focuses on doing match with these numbers. (I do not want to do math, just > move them around and print them without crashing) > > Cheers, > Dan. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From watson.timothy@REDACTED Fri May 24 19:41:34 2013 From: watson.timothy@REDACTED (Tim Watson) Date: Fri, 24 May 2013 18:41:34 +0100 Subject: [erlang-bugs] Strange application shutdown deadlock In-Reply-To: <20130524144546.GB14817@ferdair.local> References: <20130524144546.GB14817@ferdair.local> Message-ID: Gah, sorry folks - this has nothing to do with release handling, that was a red herring. Someone just pointed out that the call to get_child originates in a status check in our code. This still looks like a bug to me though, since if you're going to handle "other" messages in terminate_loop you ought to ensure they can't deadlock the vm's shutdown sequence. Cheers, Tim On 24 May 2013, at 15:45, Fred Hebert wrote: > Quick question: are you running a release? > > If so, last time I've seen deadlocks like that was solved by making sure > *all* my applications did depend on stdlib and kernel in their app file. > When I skipped them, sometimes I'd find that things would lock up. > > My guess was that dependencies from stdlib or kernel got unloaded before > my app and broke something, but I'm not sure -- In my case, I wasn't > able to inspect the node as it appeared to be 100% blocked. > > Adding the apps ended up fixing the problem on the next shutdown. I'm > not sure if it might be a good fix for you, but it's a stab in the dark, > > Regards, > Fred. > > On 05/24, Tim Watson wrote: >> We came across this at a customer's site, where one of the nodes was apparently in the process of stopping and had been in that state for at least 24 hours. The short version is that an application_master appears to be stuck waiting for a child pid (is that the X process, or the root supervisor?) which is *not* linked to it... >> >> The application controller is in the process of stopping an application, during which process a `get_child' message appears to have come in to that application's application_master from somewhere - we are *not* running appmon, so I'm really confused how this can happen, as the only other place where I see (indirect) calls are via the sasl release_handler!? At the bottom of this email is a dump for the application_controller and the application_master for the app it is trying to shut down. I can verify that the pid which the application_master is waiting on is definitely not linked to it - i.e., process_info(links, AppMasterPid) doesn't contain the process <0.256.0> that the master appears to be waiting on. >> >> My reading of the code is that the application_master cannot end up in get_child_i unless a get_child request was made which arrives whilst it is in its terminate loop. As I said, we're not using appmon, therefore I assume this originated in the sasl application's release_handler_1, though I'm not sure quite which route would take us there. The relevant bit of code in application_master appears to be: >> >> get_child_i(Child) -> >> Child ! {self(), get_child}, >> receive >> {Child, GrandChild, Mod} -> {GrandChild, Mod} >> end. >> >> This in turn originates, I'd guess, in the third receive clause of terminate_loop/2. Anyway, should that code not be dealing with a potentially dead pid for Child, either by handling links effectively - perhaps there is an EXIT signal in the mailbox already which is being ignored here in get_child_i/1 - or by some other means? >> >> What follows below is the trace/dump output. Feel free to poke me for more info as needed. >> >> Cheers, >> Tim >> >> [TRACE/DUMP] >> >> pid: <6676.7.0> >> registered name: application_controller >> stacktrace: [{application_master,call,2, >> [{file,"application_master.erl"},{line,75}]}, >> {application_controller,stop_appl,3, >> [{file,"application_controller.erl"}, >> {line,1393}]}, >> {application_controller,handle_call,3, >> [{file,"application_controller.erl"}, >> {line,810}]}, >> {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]}] >> ------------------------- >> Program counter: 0x00007f9bf9a53720 (application_master:call/2 + 288) >> CP: 0x0000000000000000 (invalid) >> arity = 0 >> >> 0x00007f9bd7948360 Return addr 0x00007f9bfb97de40 (application_controller:stop_appl/3 + 176) >> y(0) #Ref<0.0.20562.258360> >> y(1) #Ref<0.0.20562.258361> >> y(2) [] >> >> 0x00007f9bd7948380 Return addr 0x00007f9bfb973c68 (application_controller:handle_call/3 + 1392) >> y(0) temporary >> y(1) rabbitmq_web_dispatch >> >> 0x00007f9bd7948398 Return addr 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272) >> y(0) {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]} >> y(1) rabbitmq_web_dispatch >> y(2) [{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}] >> y(3) [{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}] >> >> 0x00007f9bd79483c0 Return addr 0x00000000008827d8 () >> y(0) application_controller >> y(1) {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]} >> y(2) application_controller >> y(3) <0.2.0> >> y(4) {stop_application,rabbitmq_web_dispatch} >> y(5) {<0.5864.275>,#Ref<0.0.20562.258345>} >> y(6) Catch 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272) >> ------------------------- >> >> pid: <6676.255.0> >> registered name: none >> stacktrace: [{application_master,get_child_i,1, >> [{file,"application_master.erl"},{line,392}]}, >> {application_master,handle_msg,2, >> [{file,"application_master.erl"},{line,216}]}, >> {application_master,terminate_loop,2, >> [{file,"application_master.erl"},{line,206}]}, >> {application_master,terminate,2, >> [{file,"application_master.erl"},{line,227}]}, >> {application_master,handle_msg,2, >> [{file,"application_master.erl"},{line,219}]}, >> {application_master,main_loop,2, >> [{file,"application_master.erl"},{line,194}]}, >> {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}] >> ------------------------- >> Program counter: 0x00007f9bf9a570e0 (application_master:get_child_i/1 + 120) >> CP: 0x0000000000000000 (invalid) >> arity = 0 >> >> 0x00007f9c1adc3dc8 Return addr 0x00007f9bf9a54eb0 (application_master:handle_msg/2 + 280) >> y(0) <0.256.0> >> >> 0x00007f9c1adc3dd8 Return addr 0x00007f9bf9a54d20 (application_master:terminate_loop/2 + 520) >> y(0) #Ref<0.0.20562.258362> >> y(1) <0.9596.275> >> y(2) {state,<0.256.0>,{appl_data,rabbitmq_web_dispatch,[],undefined,{rabbit_web_dispatch_app,[]},[rabbit_web_dispatch,rabbit_web_dispatch_app,rabbit_web_dispatch_registry,rabbit_web_dispatch_sup,rabbit_web_dispatch_util,rabbit_webmachine],[],infinity,infinity},[],0,<0.29.0>} >> >> 0x00007f9c1adc3df8 Return addr 0x00007f9bf9a55108 (application_master:terminate/2 + 192) >> y(0) <0.256.0> >> >> 0x00007f9c1adc3e08 Return addr 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472) >> y(0) [] >> y(1) normal >> >> 0x00007f9c1adc3e20 Return addr 0x00007f9bf9a54a60 (application_master:main_loop/2 + 1600) >> y(0) <0.7.0> >> y(1) #Ref<0.0.20562.258360> >> y(2) Catch 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472) >> >> 0x00007f9c1adc3e40 Return addr 0x00007f9bfb969420 (proc_lib:init_p_do_apply/3 + 56) >> y(0) <0.7.0> >> >> 0x00007f9c1adc3e50 Return addr 0x00000000008827d8 () >> y(0) Catch 0x00007f9bfb969440 (proc_lib:init_p_do_apply/3 + 88) >> ------------------------- >> > > > >> _______________________________________________ >> erlang-bugs mailing list >> erlang-bugs@REDACTED >> http://erlang.org/mailman/listinfo/erlang-bugs > From n.oxyde@REDACTED Fri May 24 21:10:42 2013 From: n.oxyde@REDACTED (Anthony Ramine) Date: Fri, 24 May 2013 21:10:42 +0200 Subject: [erlang-bugs] Fix renaming of bs_put_string instructions Message-ID: Hello, If an Erlang module is compiled to BEAM assembly and the result contains a bs_put_string instruction, the output can't be compiled to binary anymore and the compiler crashes with the following error: $ erlc prs.S Function: compress/1 prs.S:none: internal error in beam_block; crash reason: {{case_clause, {'EXIT', {function_clause, [{beam_utils,live_opt, [[{bs_put_string,1,{string,[0]}}, {bs_init, {f,0}, {bs_append,0,8,{field_flags,[]}}, 0, [{integer,8},{x,0}], {x,1}}, {label,2}], 2, {1,{1,1,nil,nil}}, [{block, [{'%live',2}, {set,[{x,0}],[{x,1}],move}, {'%live',1}]}, return]], [{file,"beam_utils.erl"},{line,639}]}, {beam_utils,live_opt,1, [{file,"beam_utils.erl"},{line,205}]}, {beam_block,function,2, [{file,"beam_block.erl"},{line,38}]}, {lists,mapfoldl,3, [{file,"lists.erl"},{line,1329}]}, {beam_block,module,2, [{file,"beam_block.erl"},{line,29}]}, {compile,'-select_passes/2-anonymous-2-',2, [{file,"compile.erl"},{line,476}]}, {compile,'-internal_comp/4-anonymous-1-',2, [{file,"compile.erl"},{line,276}]}, {compile,fold_comp,3, [{file,"compile.erl"},{line,294}]}]}}}, [{compile,'-select_passes/2-anonymous-2-',2, [{file,"compile.erl"},{line,476}]}, {compile,'-internal_comp/4-anonymous-1-',2, [{file,"compile.erl"},{line,276}]}, {compile,fold_comp,3,[{file,"compile.erl"},{line,294}]}, {compile,internal_comp,4,[{file,"compile.erl"},{line,278}]}, {compile,'-do_compile/2-anonymous-0-',2, [{file,"compile.erl"},{line,152}]}]} The clause was probably commented-out because at this point in the code, no bs_put_string instruction has been generated yet when compiling from Erlang. This bug was reported by Lo?c Hoguin. git fetch https://github.com/nox/otp.git fix-bs_put_string-renaming https://github.com/nox/otp/compare/erlang:maint...fix-bs_put_string-renaming https://github.com/nox/otp/compare/erlang:maint...fix-bs_put_string-renaming.patch Regards, -- Anthony Ramine From bgustavsson@REDACTED Mon May 27 09:47:28 2013 From: bgustavsson@REDACTED (=?UTF-8?Q?Bj=C3=B6rn_Gustavsson?=) Date: Mon, 27 May 2013 09:47:28 +0200 Subject: [erlang-bugs] Fix renaming of bs_put_string instructions In-Reply-To: References: Message-ID: Looks good. Will be graduated after few days of testing in our daily build.s On Fri, May 24, 2013 at 9:10 PM, Anthony Ramine wrote: > Hello, > > If an Erlang module is compiled to BEAM assembly and the result contains a > bs_put_string instruction, the output can't be compiled to binary anymore > and the compiler crashes with the following error: > > $ erlc prs.S > Function: compress/1 > prs.S:none: internal error in beam_block; > crash reason: {{case_clause, > {'EXIT', > {function_clause, > [{beam_utils,live_opt, > [[{bs_put_string,1,{string,[0]}}, > {bs_init, > {f,0}, > {bs_append,0,8,{field_flags,[]}}, > 0, > [{integer,8},{x,0}], > {x,1}}, > {label,2}], > 2, > {1,{1,1,nil,nil}}, > [{block, > [{'%live',2}, > {set,[{x,0}],[{x,1}],move}, > {'%live',1}]}, > return]], > [{file,"beam_utils.erl"},{line,639}]}, > {beam_utils,live_opt,1, > [{file,"beam_utils.erl"},{line,205}]}, > {beam_block,function,2, > [{file,"beam_block.erl"},{line,38}]}, > {lists,mapfoldl,3, > [{file,"lists.erl"},{line,1329}]}, > {beam_block,module,2, > [{file,"beam_block.erl"},{line,29}]}, > {compile,'-select_passes/2-anonymous-2-',2, > [{file,"compile.erl"},{line,476}]}, > {compile,'-internal_comp/4-anonymous-1-',2, > [{file,"compile.erl"},{line,276}]}, > {compile,fold_comp,3, > [{file,"compile.erl"},{line,294}]}]}}}, > [{compile,'-select_passes/2-anonymous-2-',2, > [{file,"compile.erl"},{line,476}]}, > {compile,'-internal_comp/4-anonymous-1-',2, > [{file,"compile.erl"},{line,276}]}, > {compile,fold_comp,3,[{file,"compile.erl"},{line,294}]}, > > {compile,internal_comp,4,[{file,"compile.erl"},{line,278}]}, > {compile,'-do_compile/2-anonymous-0-',2, > [{file,"compile.erl"},{line,152}]}]} > > > The clause was probably commented-out because at this point in the code, > no bs_put_string instruction has been generated yet when compiling from > Erlang. > > This bug was reported by Lo?c Hoguin. > > git fetch https://github.com/nox/otp.gitfix-bs_put_string-renaming > > > https://github.com/nox/otp/compare/erlang:maint...fix-bs_put_string-renaming > > https://github.com/nox/otp/compare/erlang:maint...fix-bs_put_string-renaming.patch > > Regards, > > -- > Anthony Ramine > > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://erlang.org/mailman/listinfo/erlang-bugs > -- Bj?rn Gustavsson, Erlang/OTP, Ericsson AB -------------- next part -------------- An HTML attachment was scrubbed... URL: From anders.otp@REDACTED Mon May 27 10:40:27 2013 From: anders.otp@REDACTED (Anders Svensson) Date: Mon, 27 May 2013 10:40:27 +0200 Subject: [erlang-bugs] Memory leak in diameter_service module in diameter app (otp_R16B) In-Reply-To: <5199F0AC.9090800@comarch.pl> References: <5199E1CA.6020209@comarch.pl> <5199F0AC.9090800@comarch.pl> Message-ID: Hi Aleksander. Yes, it is indeed a bug that was introduced in R16B. The fix was merged into maint on April 12, in this commit: https://github.com/erlang/otp/commit/656b37f1b6fbc3611f5e0f8b8c0e4f61bef9092b The commit for the fix itself points at the one that introduced the error: https://github.com/erlang/otp/commit/c609108ce017069a77708f80dae9e89c45ff222d So, fetch maint and the problem should be solved. Sorry for the slow reply: I've been on vacation. Anders On Mon, May 20, 2013 at 11:45 AM, Aleksander Nycz wrote: > Hello, > > I think there is a problem with resource leak (memory) in diameter_service > module. > > This module is a gen_server, that state contains field watchdogT :: > ets:tid(). > This ets contains info about watchdogs. > > Diameter app service cfg is: > > [{'Origin-Host', HostName}, > {'Origin-Realm', Realm}, > {'Vendor-Id', ...}, > {'Product-Name', ...}, > {'Auth-Application-Id', [?DCCA_APP_ID]}, > {'Supported-Vendor-Id', [...]}, > {application, [{alias, diameterNode}, > {dictionary, dictionaryDCCA}, > {module, dccaCallback}]}, > {restrict_connections, false}] > > After start dimeter app, adding service and transport, diameter_service > state is: > >> diameter_service:state(diameterNode). > #state{id = {1369,41606,329900}, > service_name = diameterNode, > service = #diameter_service{pid = <0.1011.0>, > capabilities = #diameter_caps{...}, > applications = [#diameter_app{...}]}, > watchdogT = 4194395,peerT = 4259932,shared_peers = 4325469, > local_peers = 4391006,monitor = false, > options = [{sequence,{0,32}}, > {share_peers,false}, > {use_shared_peers,false}, > {restrict_connections,false}]} > > and ets 4194395 has one record: > >> ets:tab2list(4194395). > [#watchdog{pid = <0.1013.0>,type = accept, > ref = #Ref<0.0.0.1696>, > options = [{transport_module,diameter_tcp}, > {transport_config,[{reuseaddr,true}, > {ip,{0,0,0,0}}, > {port,4068}]}, > {capabilities_cb,[#Fun]}, > {watchdog_timer,30000}, > {reconnect_timer,60000}], > state = initial, > started = {1369,41606,330086}, > peer = false}] > > Next I run very simple test using seagull symulator. Test scenario is > following: > > 1. seagull: send CER > 2. seagull: recv CEA > 3. seagull: send CCR (init) > 4. seagull: recv CCA (init) > 5. seagull: send CCR (update) > 6. seagull: recv CCR (update) > 7. seagull: send CCR (terminate) > 8. seagull: recv CCA (terminate) > > Durring test there are two watchdogs in ets: > >> ets:tab2list(4194395). > [#watchdog{pid = <0.1816.0>,type = accept, > ref = #Ref<0.0.0.1696>, > options = [{transport_module,diameter_tcp}, > {transport_config,[{reuseaddr,true}, > {ip,{0,0,0,0}}, > {port,4068}]}, > {capabilities_cb,[#Fun]}, > {watchdog_timer,30000}, > {reconnect_timer,60000}], > state = initial, > started = {1369,41823,711370}, > peer = false}, > #watchdog{pid = <0.1013.0>,type = accept, > ref = #Ref<0.0.0.1696>, > options = [{transport_module,diameter_tcp}, > {transport_config,[{reuseaddr,true}, > {ip,{0,0,0,0}}, > {port,4068}]}, > {capabilities_cb,[#Fun]}, > {watchdog_timer,30000}, > {reconnect_timer,60000}], > state = okay, > started = {1369,41606,330086}, > peer = <0.1014.0>}] > > After test but before tw timer elapsed, there is two watchdogs also and this > is ok: > >> ets:tab2list(4194395). > [#watchdog{pid = <0.1816.0>,type = accept, > ref = #Ref<0.0.0.1696>, > options = [{transport_module,diameter_tcp}, > {transport_config,[{reuseaddr,true}, > {ip,{0,0,0,0}}, > {port,4068}]}, > {capabilities_cb,[#Fun]}, > {watchdog_timer,30000}, > {reconnect_timer,60000}], > state = initial, > started = {1369,41823,711370}, > peer = false}, > #watchdog{pid = <0.1013.0>,type = accept, > ref = #Ref<0.0.0.1696>, > options = [{transport_module,diameter_tcp}, > {transport_config,[{reuseaddr,true}, > {ip,{0,0,0,0}}, > {port,4068}]}, > {capabilities_cb,[#Fun]}, > {watchdog_timer,30000}, > {reconnect_timer,60000}], > state = down, > started = {1369,41606,330086}, > peer = <0.1014.0>}] > > But when tm timer elapsed transport and watchdog processes are finished: > >> erlang:is_process_alive(list_to_pid("<0.1014.0>")). > false >> erlang:is_process_alive(list_to_pid("<0.1013.0>")). > false > > and still two watchdogs are in ets: > >> ets:tab2list(4194395). > [#watchdog{pid = <0.1816.0>,type = accept, > ref = #Ref<0.0.0.1696>, > options = [{transport_module,diameter_tcp}, > {transport_config,[{reuseaddr,true}, > {ip,{0,0,0,0}}, > {port,4068}]}, > {capabilities_cb,[#Fun]}, > {watchdog_timer,30000}, > {reconnect_timer,60000}], > state = initial, > started = {1369,41823,711370}, > peer = false}, > #watchdog{pid = <0.1013.0>,type = accept, > ref = #Ref<0.0.0.1696>, > options = [{transport_module,diameter_tcp}, > {transport_config,[{reuseaddr,true}, > {ip,{0,0,0,0}}, > {port,4068}]}, > {capabilities_cb,[#Fun]}, > {watchdog_timer,30000}, > {reconnect_timer,60000}], > state = down, > started = {1369,41606,330086}, > peer = <0.1014.0>}] > > I think watchdog <0.1013.0> should be removed when watchdog process is being > finished. > > I run next test and now there are 3 watchdogs in ets: > >> ets:tab2list(4194395). > [#watchdog{pid = <0.1816.0>,type = accept, > ref = #Ref<0.0.0.1696>, > options = [{transport_module,diameter_tcp}, > {transport_config,[{reuseaddr,true}, > {ip,{0,0,0,0}}, > {port,4068}]}, > {capabilities_cb,[#Fun]}, > {watchdog_timer,30000}, > {reconnect_timer,60000}], > state = down, > started = {1369,41823,711370}, > peer = <0.1817.0>}, > #watchdog{pid = <0.1013.0>,type = accept, > ref = #Ref<0.0.0.1696>, > options = [{transport_module,diameter_tcp}, > {transport_config,[{reuseaddr,true}, > {ip,{0,0,0,0}}, > {port,4068}]}, > {capabilities_cb,[#Fun]}, > {watchdog_timer,30000}, > {reconnect_timer,60000}], > state = down, > started = {1369,41606,330086}, > peer = <0.1014.0>}, > #watchdog{pid = <0.3533.0>,type = accept, > ref = #Ref<0.0.0.1696>, > options = [{transport_module,diameter_tcp}, > {transport_config,[{reuseaddr,true}, > {ip,{0,0,0,0}}, > {port,4068}]}, > {capabilities_cb,[#Fun]}, > {watchdog_timer,30000}, > {reconnect_timer,60000}], > state = initial, > started = {1369,42342,845898}, > peer = false}] > > Watchdog and transport process are not alive: > >> erlang:is_process_alive(list_to_pid("<0.1816.0>")). > false >> erlang:is_process_alive(list_to_pid("<0.1817.0>")). > false > > > I suggest following change in code to correct this problem (file > diameter_service.erl): > > $ diff diameter_service.erl diameter_service.erl_ok > 1006c1006 > < connection_down(#watchdog{state = WS, > --- >> connection_down(#watchdog{state = ?WD_OKAY, > 1015,1017c1015,1021 > < ?WD_OKAY == WS > < andalso > < connection_down(Wd, fetch(PeerT, TPid), S). > --- >> connection_down(Wd, fetch(PeerT, TPid), S); >> >> connection_down(#watchdog{}, >> To, >> #state{}) >> when is_atom(To) -> >> ok. > > You can find this solution in attachement. > > Regards > Aleksander Nycz > > > -- > Aleksander Nycz > Senior Software Engineer > Telco_021 BSS R&D > Comarch SA > Phone: +48 12 646 1216 > Mobile: +48 691 464 275 > website: www.comarch.pl > > > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://erlang.org/mailman/listinfo/erlang-bugs > From anders.otp@REDACTED Mon May 27 17:12:44 2013 From: anders.otp@REDACTED (Anders Svensson) Date: Mon, 27 May 2013 17:12:44 +0200 Subject: [erlang-bugs] Problem with tw timer support in diameter app (otp_R16B) In-Reply-To: <5199E1CA.6020209@comarch.pl> References: <5199E1CA.6020209@comarch.pl> Message-ID: Thanks for the report. The fix should be in the maint branch (destined for R16B01) by the end of the week. Anders On Mon, May 20, 2013 at 10:41 AM, Aleksander Nycz wrote: > Hello, > > I change default value for param restrict_connections from 'nodes' to > 'false'. > After that I run very simple test using seagull symulator. Test scenario was > following: > > 1. seagull: send CER > 2. seagull: recv CEA > 3. seagull: send CCR (init) > 4. seagull: recv CCA (init) > 5. seagull: send CCR (update) > 6. seagull: recv CCR (update) > 7. seagull: send CCR (terminate) > 8. seagull: recv CCA (terminate) > > After step 8. seagull does't send DPR, but just closes transport connection > (TCP) > > On server side every think looks good, but 30 sec. after CCR (terminate) > when tw elapsed, following error message appears in log: > > > 13:40:58.187129: <0.5046.0>: error: error_logger: --:--/--: ** Generic > server <0.5046.0> terminating > ** Last message in was {timeout,#Ref<0.0.0.14845>,tw} > ** When Server state == {watchdog,down,false,30000,0,<0.1009.0>,undefined, > #Ref<0.0.0.14845>,diameter_gen_base_rfc3588, > {recvdata,4259932,diameterNode, > [{diameter_app,diameterNode,dictionaryDCCA, > [dccaCallback], > diameterNode,4,false, > [{answer_errors,report}, > {request_errors,answer_3xxx}]}], > {0,32}}, > {0,32}, > {false,false}, > false} > ** Reason for termination == > ** {function_clause, > [{diameter_watchdog,set_watchdog, > [stop], > [{file,"base/diameter_watchdog.erl"},{line,451}]}, > {diameter_watchdog,handle_info,2, > [{file,"base/diameter_watchdog.erl"},{line,211}]}, > {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]}, > {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]} > > 13:40:58.187500: <0.5046.0>: error: error_logger: --:--/--: > [crash_report][[[{initial_call,{diameter_watchdog,init,['Argument__1']}}, > {pid,<0.5046.0>}, > {registered_name,[]}, > > {error_info,{exit,{function_clause,[{diameter_watchdog,set_watchdog,[stop],[{file,"base/diameter_watchdog.erl"},{line,451}]}, > > {diameter_watchdog,handle_info,2,[{file,"base/diameter_watchdog.erl"},{line,211}]}, > > {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]}, > > {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}, > > [{gen_server,terminate,6,[{file,"gen_server.erl"},{line,737}]}, > > {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}, > {ancestors,[diameter_watchdog_sup,diameter_sup,<0.946.0>]}, > {messages,[]}, > {links,[<0.954.0>]}, > {dictionary,[{random_seed,{15047,18051,14647}}, > {{diameter_watchdog,restart}, > {{accept,#Ref<0.0.0.1696>}, > [{transport_module,diameter_tcp}, > > {transport_config,[{reuseaddr,true},{ip,{0,0,0,0}},{port,4068}]}, > > {capabilities_cb,[#Fun]}, > {watchdog_timer,30000}, > {reconnect_timer,60000}], > {diameter_service,<0.1009.0>, > > {diameter_caps,"zyndram.krakow.comarch","krakow.comarch",[],25429,"Comarch > DIAMETER Server",[], > > [12645,10415,8164], > [4], > > [],[],[],[],[]}, > > [{diameter_app,diameterNode,dictionaryDCCA, > > [dccaCallback], > > diameterNode,4,false, > > [{answer_errors,report},{request_errors,answer_3xxx}]}]}}}, > {{diameter_watchdog,dwr}, > > ['DWR',{'Origin-Host',"zyndram.krakow.comarch"},{'Origin-Realm',"krakow.comarch"},{'Origin-State-Id',[]}]}]}, > {trap_exit,false}, > {status,running}, > {heap_size,75025}, > {stack_size,24}, > {reductions,294}], > []]] > 13:40:58.189060: <0.954.0>: error: error_logger: --:--/--: > [supervisor_report][[{supervisor,{local,diameter_watchdog_sup}}, > {errorContext,child_terminated}, > > {reason,{function_clause,[{diameter_watchdog,set_watchdog,[stop],[{file,"base/diameter_watchdog.erl"},{line,451}]}, > > {diameter_watchdog,handle_info,2,[{file,"base/diameter_watchdog.erl"},{line,211}]}, > > {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]}, > > {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}, > {offender,[{pid,<0.5046.0>}, > {name,diameter_watchdog}, > > {mfargs,{diameter_watchdog,start_link,undefined}}, > {restart_type,temporary}, > {shutdown,1000}, > {child_type,worker}]}]] > > You can check, that function set_watchdog should be called with param > #watchdog{}, but 'stop' param is used instead. > As a result function_clause exception is thrown. > > I suggest following change in code to correct this problem (file > diameter_watchdog.erl): > > $ diff diameter_watchdog.erl_org diameter_watchdog.erl > 385a386,393 >> transition({timeout, TRef, tw}, #watchdog{tref = TRef, status = T} = S) >> when T == initial; >> T == down -> >> case restart(S) of >> stop -> stop; >> #watchdog{} = NewS -> set_watchdog(NewS) >> end; >> > > You can find this solution in attachement. > > Best regards > Aleksander Nycz > > -- > Aleksander Nycz > Senior Software Engineer > Telco_021 BSS R&D > Comarch SA > Phone: +48 12 646 1216 > Mobile: +48 691 464 275 > website: www.comarch.pl > > > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://erlang.org/mailman/listinfo/erlang-bugs > From andrew.pennebaker@REDACTED Wed May 29 03:30:21 2013 From: andrew.pennebaker@REDACTED (Andrew Pennebaker) Date: Tue, 28 May 2013 21:30:21 -0400 Subject: [erlang-bugs] erl -s crashes Message-ID: When I try to `erl -s` any module, Erlang crashes. Dump attached. Trace: $ cat hello.erl %% 22 Feb 2011 -module(hello). -author("andrew.pennebaker@REDACTED"). -export([main/1]). main(_) -> io:format("Hello World!~n", []). $ erlc hello.erl $ erl -s hello Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:2:2] [async-threads:0] [hipe] [kernel-poll:false] [dtrace] {"init terminating in do_boot",{undef,[{hello,start,[],[]},{init,start_it,1,[]},{init,start_em,1,[]}]}} Crash dump was written to: erl_crash.dump init terminating in do_boot () System: $ specs erlang os Specs: specs 0.4 https://github.com/mcandre/specs#readme rebar -V rebar 2.1.0-pre R15B03 20130528_213220 git 2.1.0-pre-46-g78fa8fc erl -eval 'erlang:display(erlang:system_info(otp_release)), halt().' -noshell "R15B03" system_profiler SPSoftwareDataType | grep 'System Version' System Version: OS X 10.8.3 (12D78) -- Cheers, Andrew Pennebaker www.yellosoft.us -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: erl_crash.dump Type: application/octet-stream Size: 341771 bytes Desc: not available URL: From vladdu55@REDACTED Wed May 29 10:51:35 2013 From: vladdu55@REDACTED (Vlad Dumitrescu) Date: Wed, 29 May 2013 10:51:35 +0200 Subject: [erlang-bugs] erl -s crashes In-Reply-To: References: Message-ID: Hi! This part of the error message On Wed, May 29, 2013 at 3:30 AM, Andrew Pennebaker < andrew.pennebaker@REDACTED> wrote: > ,{undef,[{hello,start,[],[]} > tells you that the system tried to execute hello:start(), which is the behaviour when only specifying the module name after -s >From the docs: -s Mod [Func [Arg1, Arg2, ...]](init flag) Makes init call the specified function. Func defaults to start. If no arguments are provided, the function is assumed to be of arity 0. Otherwise it is assumed to be of arity 1, taking the list [Arg1,Arg2,...] as argument. All arguments are passed as atoms. See init(3) . regards, Vlad -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.zhuravlev@REDACTED Wed May 29 10:53:29 2013 From: a.zhuravlev@REDACTED (Alexander Zhuravlev) Date: Wed, 29 May 2013 12:53:29 +0400 Subject: [erlang-bugs] erl -s crashes In-Reply-To: References: Message-ID: <20130529085329.GA8854@zmac.js-kit.local> On Tue, May 28, 2013 at 09:30:21PM -0400, Andrew Pennebaker wrote: > When I try to `erl -s` any module, Erlang crashes. Dump attached. > > Trace: > > $ cat hello.erl > %% 22 Feb 2011 > > -module(hello). > -author("andrew.pennebaker@REDACTED"). > -export([main/1]). > > main(_) -> io:format("Hello World!~n", []). > > $ erlc hello.erl > > $ erl -s hello > Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:2:2] > [async-threads:0] [hipe] [kernel-poll:false] [dtrace] > > {"init terminating in > do_boot",{undef,[{hello,start,[],[]},{init,start_it,1,[]},{init,start_em,1,[]}]}} > > Crash dump was written to: erl_crash.dump > init terminating in do_boot () You need to check description of the erl's "-s" flag. It accepts a module and an optional function name (by default set to "start" if not specified). zmac:~> cat hello.erl -module(hello). -export([main/0]). main() -> io:format("Hello World!~n", []). zmac:~> erl -s hello main -s erlang halt Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false] [dtrace] Hello World! > > System: > > $ specs erlang os > Specs: > > specs 0.4 > https://github.com/mcandre/specs#readme > > rebar -V > rebar 2.1.0-pre R15B03 20130528_213220 git 2.1.0-pre-46-g78fa8fc > > erl -eval 'erlang:display(erlang:system_info(otp_release)), halt().' > -noshell > "R15B03" > > system_profiler SPSoftwareDataType | grep 'System Version' > System Version: OS X 10.8.3 (12D78) > > -- > Cheers, > > Andrew Pennebaker > www.yellosoft.us > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://erlang.org/mailman/listinfo/erlang-bugs -- Alexander Zhuravlev From watson.timothy@REDACTED Wed May 29 11:12:24 2013 From: watson.timothy@REDACTED (Tim Watson) Date: Wed, 29 May 2013 10:12:24 +0100 Subject: [erlang-bugs] Strange application shutdown deadlock In-Reply-To: References: <20130524144546.GB14817@ferdair.local> Message-ID: <7E9616D5-BD4E-4F87-9EDA-A3AF2262FE05@gmail.com> Any word from the OTP folks on this one? On 24 May 2013, at 18:41, Tim Watson wrote: > Gah, sorry folks - this has nothing to do with release handling, that was a red herring. Someone just pointed out that the call to get_child originates in a status check in our code. > > This still looks like a bug to me though, since if you're going to handle "other" messages in terminate_loop you ought to ensure they can't deadlock the vm's shutdown sequence. > > Cheers, > Tim > > On 24 May 2013, at 15:45, Fred Hebert wrote: > >> Quick question: are you running a release? >> >> If so, last time I've seen deadlocks like that was solved by making sure >> *all* my applications did depend on stdlib and kernel in their app file. >> When I skipped them, sometimes I'd find that things would lock up. >> >> My guess was that dependencies from stdlib or kernel got unloaded before >> my app and broke something, but I'm not sure -- In my case, I wasn't >> able to inspect the node as it appeared to be 100% blocked. >> >> Adding the apps ended up fixing the problem on the next shutdown. I'm >> not sure if it might be a good fix for you, but it's a stab in the dark, >> >> Regards, >> Fred. >> >> On 05/24, Tim Watson wrote: >>> We came across this at a customer's site, where one of the nodes was apparently in the process of stopping and had been in that state for at least 24 hours. The short version is that an application_master appears to be stuck waiting for a child pid (is that the X process, or the root supervisor?) which is *not* linked to it... >>> >>> The application controller is in the process of stopping an application, during which process a `get_child' message appears to have come in to that application's application_master from somewhere - we are *not* running appmon, so I'm really confused how this can happen, as the only other place where I see (indirect) calls are via the sasl release_handler!? At the bottom of this email is a dump for the application_controller and the application_master for the app it is trying to shut down. I can verify that the pid which the application_master is waiting on is definitely not linked to it - i.e., process_info(links, AppMasterPid) doesn't contain the process <0.256.0> that the master appears to be waiting on. >>> >>> My reading of the code is that the application_master cannot end up in get_child_i unless a get_child request was made which arrives whilst it is in its terminate loop. As I said, we're not using appmon, therefore I assume this originated in the sasl application's release_handler_1, though I'm not sure quite which route would take us there. The relevant bit of code in application_master appears to be: >>> >>> get_child_i(Child) -> >>> Child ! {self(), get_child}, >>> receive >>> {Child, GrandChild, Mod} -> {GrandChild, Mod} >>> end. >>> >>> This in turn originates, I'd guess, in the third receive clause of terminate_loop/2. Anyway, should that code not be dealing with a potentially dead pid for Child, either by handling links effectively - perhaps there is an EXIT signal in the mailbox already which is being ignored here in get_child_i/1 - or by some other means? >>> >>> What follows below is the trace/dump output. Feel free to poke me for more info as needed. >>> >>> Cheers, >>> Tim >>> >>> [TRACE/DUMP] >>> >>> pid: <6676.7.0> >>> registered name: application_controller >>> stacktrace: [{application_master,call,2, >>> [{file,"application_master.erl"},{line,75}]}, >>> {application_controller,stop_appl,3, >>> [{file,"application_controller.erl"}, >>> {line,1393}]}, >>> {application_controller,handle_call,3, >>> [{file,"application_controller.erl"}, >>> {line,810}]}, >>> {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]}] >>> ------------------------- >>> Program counter: 0x00007f9bf9a53720 (application_master:call/2 + 288) >>> CP: 0x0000000000000000 (invalid) >>> arity = 0 >>> >>> 0x00007f9bd7948360 Return addr 0x00007f9bfb97de40 (application_controller:stop_appl/3 + 176) >>> y(0) #Ref<0.0.20562.258360> >>> y(1) #Ref<0.0.20562.258361> >>> y(2) [] >>> >>> 0x00007f9bd7948380 Return addr 0x00007f9bfb973c68 (application_controller:handle_call/3 + 1392) >>> y(0) temporary >>> y(1) rabbitmq_web_dispatch >>> >>> 0x00007f9bd7948398 Return addr 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272) >>> y(0) {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]} >>> y(1) rabbitmq_web_dispatch >>> y(2) [{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}] >>> y(3) [{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}] >>> >>> 0x00007f9bd79483c0 Return addr 0x00000000008827d8 () >>> y(0) application_controller >>> y(1) {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]} >>> y(2) application_controller >>> y(3) <0.2.0> >>> y(4) {stop_application,rabbitmq_web_dispatch} >>> y(5) {<0.5864.275>,#Ref<0.0.20562.258345>} >>> y(6) Catch 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272) >>> ------------------------- >>> >>> pid: <6676.255.0> >>> registered name: none >>> stacktrace: [{application_master,get_child_i,1, >>> [{file,"application_master.erl"},{line,392}]}, >>> {application_master,handle_msg,2, >>> [{file,"application_master.erl"},{line,216}]}, >>> {application_master,terminate_loop,2, >>> [{file,"application_master.erl"},{line,206}]}, >>> {application_master,terminate,2, >>> [{file,"application_master.erl"},{line,227}]}, >>> {application_master,handle_msg,2, >>> [{file,"application_master.erl"},{line,219}]}, >>> {application_master,main_loop,2, >>> [{file,"application_master.erl"},{line,194}]}, >>> {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}] >>> ------------------------- >>> Program counter: 0x00007f9bf9a570e0 (application_master:get_child_i/1 + 120) >>> CP: 0x0000000000000000 (invalid) >>> arity = 0 >>> >>> 0x00007f9c1adc3dc8 Return addr 0x00007f9bf9a54eb0 (application_master:handle_msg/2 + 280) >>> y(0) <0.256.0> >>> >>> 0x00007f9c1adc3dd8 Return addr 0x00007f9bf9a54d20 (application_master:terminate_loop/2 + 520) >>> y(0) #Ref<0.0.20562.258362> >>> y(1) <0.9596.275> >>> y(2) {state,<0.256.0>,{appl_data,rabbitmq_web_dispatch,[],undefined,{rabbit_web_dispatch_app,[]},[rabbit_web_dispatch,rabbit_web_dispatch_app,rabbit_web_dispatch_registry,rabbit_web_dispatch_sup,rabbit_web_dispatch_util,rabbit_webmachine],[],infinity,infinity},[],0,<0.29.0>} >>> >>> 0x00007f9c1adc3df8 Return addr 0x00007f9bf9a55108 (application_master:terminate/2 + 192) >>> y(0) <0.256.0> >>> >>> 0x00007f9c1adc3e08 Return addr 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472) >>> y(0) [] >>> y(1) normal >>> >>> 0x00007f9c1adc3e20 Return addr 0x00007f9bf9a54a60 (application_master:main_loop/2 + 1600) >>> y(0) <0.7.0> >>> y(1) #Ref<0.0.20562.258360> >>> y(2) Catch 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472) >>> >>> 0x00007f9c1adc3e40 Return addr 0x00007f9bfb969420 (proc_lib:init_p_do_apply/3 + 56) >>> y(0) <0.7.0> >>> >>> 0x00007f9c1adc3e50 Return addr 0x00000000008827d8 () >>> y(0) Catch 0x00007f9bfb969440 (proc_lib:init_p_do_apply/3 + 88) >>> ------------------------- >>> >> >> >> >>> _______________________________________________ >>> erlang-bugs mailing list >>> erlang-bugs@REDACTED >>> http://erlang.org/mailman/listinfo/erlang-bugs >> From Anders.Ramsell@REDACTED Wed May 29 20:02:23 2013 From: Anders.Ramsell@REDACTED (Anders.Ramsell@REDACTED) Date: Wed, 29 May 2013 18:02:23 +0000 Subject: [erlang-bugs] Compiler/linter bug breaking unused variable warnings Message-ID: <82DC27D088947C4D943175FDA0DA60F411526021@EXMB13TSTRZ2.tcad.telia.se> When a function creates a record and more than one field is bound to the value of a list comprehension the compiler/linter fails to generate warnings for unused variables in that function. I just tested this on R16B and the problem is still there. I use the following module to test this: --8<---------------------------------------------------------- -module(missing_warning). -export([test_missing_warning/2, test_with_warning1/2, test_with_warning2/2 ]). -record(data, {aList, bList}). test_missing_warning(Data, KeyList) -> %% Data never used - no warning. KeyList2 = filter(KeyList), %% KeyList2 never used - no warning. #data{aList = [Key || Key <- KeyList], bList = [Key || Key <- KeyList]}. test_with_warning1(Data, KeyList) -> %% Data never used - get warning. KeyList2 = filter(KeyList), %% KeyList2 never used - get warning. #data{aList = [Key || Key <- KeyList]}. %% Only one LC in the record. test_with_warning2(Data, KeyList) -> %% Data never used - get warning. KeyList2 = filter(KeyList), %% KeyList2 never used - get warning. {data, [Key || Key <- KeyList], %% Not in a record. [Key || Key <- KeyList]}. filter(L) -> L. --8<---------------------------------------------------------- In all three test functions the variables Data (in the function head) and KeyList2 (in the function body) are unused. Compiling the module should produce six warnings but I only get four. You get the same result with other "advanced" calls like lists:map(fun(Key) -> Key end, KeyList) so it's not limited to list comprehensions. If the fields are bound to e.g. the variable KeyList directly the warnings work just fine. /Anders From andrew@REDACTED Wed May 29 21:46:24 2013 From: andrew@REDACTED (Andrew Thompson) Date: Wed, 29 May 2013 15:46:24 -0400 Subject: [erlang-bugs] Eunit, test generators and code:purge() Message-ID: <20130529194624.GE31341@hijacked.us> So, I've been chasing a failure in a test suite for the last couple days. Turns out, the problem is the test suite does this: * Test module A, with a test generator function * Test module B, and meck module A Eunit's runner is holding a reference to something in module A (probably a fun), so when meck does a purge on A as part of test B, the code server kills the eunit test runner process. This bug was actually reported three years ago: http://erlang.org/pipermail/erlang-bugs/2010-June/001844.html But it still affects at least R15B03, which is what I'm using. I have a slightly modified version of b_mod that proves that eunit is holding a ref to something from a_mod: -module(b_mod). -include_lib("eunit/include/eunit.hrl"). second_test() -> ?debugFmt("I am ~p ~p~n", [self(), erlang:process_info(self())]), true = code:delete(a_mod), ?debugFmt("processes using a_mod: ~p~n", [[P || P <- processes(), erlang:check_process_code(P, a_mod)]]), true = code:soft_purge(a_mod), ok. I looked into trying to patch this, but the eunit code is too convoluted for me to understand where it is holding the problematic reference. Andrew From joearms@REDACTED Wed May 29 15:02:26 2013 From: joearms@REDACTED (Joe Armstrong) Date: Wed, 29 May 2013 15:02:26 +0200 Subject: [erlang-bugs] ioi:columns bug Message-ID: io:columns() does not work in a process spawned from the command line -module(bug). -compile(export_all). test() -> io:format("~p~n", [io:columns()]). When I run this in a shell it works 2> c(bug). {ok,bug} 3> bug:test(). {ok,132} ok But not from the command line > erl -s bug test Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> {error,enotsup} Something is strange - the group leader of a process launched from the command line is different to the shell group leader. But I can do io:format from a command launched from the command line --- what's happening here? /Joe From ml.jc.campagne@REDACTED Thu May 30 10:17:47 2013 From: ml.jc.campagne@REDACTED (Jean-Charles Campagne) Date: Thu, 30 May 2013 10:17:47 +0200 Subject: [erlang-bugs] ioi:columns bug In-Reply-To: References: Message-ID: <96CFB37C-B16F-40DF-B83A-3E65DDAA6990@gmail.com> Hi Joe, Not sure what is going here either, I stumble upon the same issue like this. I did not have the opportunity to get to the bottom of it though. However, using "-noshell" does not generate an error. $ erl -s bug test -noshell {ok,143} Then again that might be incompatible with what you are trying to achieve. Hope that sheds some light. It worked for me as I did not need to have a shell in the end. Also, I noticed that specifying 'standard_err' as IoDevice works (but not 'standard_io'), as in: ====================================================== -module(bug_err). -compile(export_all). test_err() -> io:format("~p~n", [io:columns(standard_error)]). ====================================================== $ erl -s bug_err test_err Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:2:2] [async-threads:0] [hipe] [kernel-poll:false] {ok,143} Eshell V5.9.3.1 (abort with ^G) 1> ====================================================== My guess is standard_io somehow is not opened/accessible. My 2cents. Regards, Jc On 29 mai 2013, at 15:02, Joe Armstrong wrote: > io:columns() does not work in a process spawned from the command line > > -module(bug). > -compile(export_all). > > test() -> > io:format("~p~n", [io:columns()]). > > When I run this in a shell it works > > 2> c(bug). > {ok,bug} > 3> bug:test(). > {ok,132} > ok > > But not from the command line > >> erl -s bug test > Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] > [async-threads:10] [hipe] [kernel-poll:false] > > Eshell V5.10.1 (abort with ^G) > 1> {error,enotsup} > > Something is strange - the group leader of a process launched from the > command line is different to the shell group leader. But I can do io:format > from a command launched from the command line --- what's happening here? > > /Joe > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://erlang.org/mailman/listinfo/erlang-bugs From magnus@REDACTED Thu May 30 12:30:28 2013 From: magnus@REDACTED (Magnus Henoch) Date: Thu, 30 May 2013 11:30:28 +0100 Subject: [erlang-bugs] Eunit, test generators and code:purge() In-Reply-To: <20130529194624.GE31341@hijacked.us> (Andrew Thompson's message of "Wed, 29 May 2013 15:46:24 -0400") References: <20130529194624.GE31341@hijacked.us> Message-ID: Andrew Thompson writes: > So, I've been chasing a failure in a test suite for the last couple > days. Turns out, the problem is the test suite does this: > > * Test module A, with a test generator function > * Test module B, and meck module A > > Eunit's runner is holding a reference to something in module A (probably > a fun), so when meck does a purge on A as part of test B, the code > server kills the eunit test runner process. This bug was actually > reported three years ago: > > http://erlang.org/pipermail/erlang-bugs/2010-June/001844.html > > But it still affects at least R15B03, which is what I'm using. > > I have a slightly modified version of b_mod that proves that eunit is > holding a ref to something from a_mod: > > -module(b_mod). > > -include_lib("eunit/include/eunit.hrl"). > > second_test() -> > ?debugFmt("I am ~p ~p~n", [self(), erlang:process_info(self())]), > true = code:delete(a_mod), > ?debugFmt("processes using a_mod: ~p~n", [[P || P <- processes(), erlang:check_process_code(P, a_mod)]]), > true = code:soft_purge(a_mod), > ok. > > > I looked into trying to patch this, but the eunit code is too convoluted > for me to understand where it is holding the problematic reference. I've had the same problem, and somehow discovered that it works if the test generator function in A has a title. That is, instead of: my_test_() -> ?_test(do_something()). write: my_test_() -> {"do something", ?_test(do_something())}. That led me to think that Eunit holds on to the fun object as a "name" if the test has no explicit title. Regards, Magnus From andrew@REDACTED Thu May 30 15:14:28 2013 From: andrew@REDACTED (Andrew Thompson) Date: Thu, 30 May 2013 09:14:28 -0400 Subject: [erlang-bugs] Eunit, test generators and code:purge() In-Reply-To: References: <20130529194624.GE31341@hijacked.us> Message-ID: <20130530131428.GF31341@hijacked.us> On Thu, May 30, 2013 at 11:30:28AM +0100, Magnus Henoch wrote: > I've had the same problem, and somehow discovered that it works if the > test generator function in A has a title. Interesting idea. Unfortunately, the test I have is using a setup fixture (with named tests), so your workaround doesn't seem to apply here. Andrew From magnus@REDACTED Thu May 30 15:48:45 2013 From: magnus@REDACTED (Magnus Henoch) Date: Thu, 30 May 2013 14:48:45 +0100 Subject: [erlang-bugs] Eunit, test generators and code:purge() In-Reply-To: <20130530131428.GF31341@hijacked.us> (Andrew Thompson's message of "Thu, 30 May 2013 09:14:28 -0400") References: <20130529194624.GE31341@hijacked.us> <20130530131428.GF31341@hijacked.us> Message-ID: Andrew Thompson writes: > On Thu, May 30, 2013 at 11:30:28AM +0100, Magnus Henoch wrote: >> I've had the same problem, and somehow discovered that it works if the >> test generator function in A has a title. > > Interesting idea. Unfortunately, the test I have is using a setup > fixture (with named tests), so your workaround doesn't seem to apply > here. I found that the same workaround worked for setup fixtures: my_test_() -> {"this title saves the test", setup, Setup, Cleanup, Tests}. /m From andrew@REDACTED Thu May 30 16:21:36 2013 From: andrew@REDACTED (Andrew Thompson) Date: Thu, 30 May 2013 10:21:36 -0400 Subject: [erlang-bugs] Eunit, test generators and code:purge() In-Reply-To: References: <20130529194624.GE31341@hijacked.us> <20130530131428.GF31341@hijacked.us> Message-ID: <20130530142136.GG31341@hijacked.us> On Thu, May 30, 2013 at 02:48:45PM +0100, Magnus Henoch wrote: > I found that the same workaround worked for setup fixtures: > > my_test_() -> > {"this title saves the test", setup, Setup, Cleanup, Tests}. > Thank you! I had no idea that was even a valid fixture, but it worked! Andrew From n.oxyde@REDACTED Fri May 31 00:46:57 2013 From: n.oxyde@REDACTED (Anthony Ramine) Date: Fri, 31 May 2013 00:46:57 +0200 Subject: [erlang-bugs] Compiler/linter bug breaking unused variable warnings In-Reply-To: <82DC27D088947C4D943175FDA0DA60F411526021@EXMB13TSTRZ2.tcad.telia.se> References: <82DC27D088947C4D943175FDA0DA60F411526021@EXMB13TSTRZ2.tcad.telia.se> Message-ID: <5312CBA2-4C31-46FF-9E8A-74589DB5349D@gmail.com> Hello, Smaller test case reproducing the bug, without KeyList2 nor filter/1: -8<-- -module(missing_warning). -export([test_missing_warning/2, test_with_warning1/2, test_with_warning2/2 ]). -record(data, {aList, bList}). test_missing_warning(Data, KeyList) -> %% Data, KeyList never used - no warning. #data{aList = [Key || Key <- []], bList = [Key || Key <- []]}. test_with_warning1(Data, KeyList) -> %% Data, KeyList never used - get warning. #data{aList = [Key || Key <- []]}. %% Only one LC in the record. test_with_warning2(Data, KeyList) -> %% Data, KeyList never used - get warning. {data, [Key || Key <- []], %% Not in a record. [Key || Key <- []]}. -->8- Regards, -- Anthony Ramine Le 29 mai 2013 ? 20:02, a ?crit : > > When a function creates a record and more than one field is bound to the value of a list comprehension the compiler/linter fails to generate warnings for unused variables in that function. I just tested this on R16B and the problem is still there. > > I use the following module to test this: > > --8<---------------------------------------------------------- > -module(missing_warning). > > -export([test_missing_warning/2, > test_with_warning1/2, > test_with_warning2/2 > ]). > > -record(data, {aList, bList}). > > test_missing_warning(Data, KeyList) -> %% Data never used - no warning. > KeyList2 = filter(KeyList), %% KeyList2 never used - no warning. > #data{aList = [Key || Key <- KeyList], > bList = [Key || Key <- KeyList]}. > > test_with_warning1(Data, KeyList) -> %% Data never used - get warning. > KeyList2 = filter(KeyList), %% KeyList2 never used - get warning. > #data{aList = [Key || Key <- KeyList]}. %% Only one LC in the record. > > test_with_warning2(Data, KeyList) -> %% Data never used - get warning. > KeyList2 = filter(KeyList), %% KeyList2 never used - get warning. > {data, > [Key || Key <- KeyList], %% Not in a record. > [Key || Key <- KeyList]}. > > filter(L) -> L. > --8<---------------------------------------------------------- > > In all three test functions the variables Data (in the function head) and KeyList2 (in the function body) are unused. > Compiling the module should produce six warnings but I only get four. > You get the same result with other "advanced" calls like > lists:map(fun(Key) -> Key end, KeyList) > so it's not limited to list comprehensions. > If the fields are bound to e.g. the variable KeyList directly the warnings work just fine. > > /Anders > > _______________________________________________ > erlang-bugs mailing list > erlang-bugs@REDACTED > http://erlang.org/mailman/listinfo/erlang-bugs