From fritchie@REDACTED  Wed May  1 00:13:15 2013
From: fritchie@REDACTED (Scott Lystig Fritchie)
Date: Tue, 30 Apr 2013 17:13:15 -0500
Subject: [erlang-bugs] Schedulers getting "stuck", part II
In-Reply-To: Message of "Tue, 30 Apr 2013 15:34:09 CDT."
 <78245.1367354049@snookles.snookles.com> 
Message-ID: <83533.1367359995@snookles.snookles.com>

Patrik, there are a couple of synthetic load cases that have an end
result of what we occasionally see Riak and Riak CS doing in the wild.
Manymany thanks to Joseph Blomstedt for inventing these two modules.

  test10.erl:
    https://gist.github.com/jtuple/0d9ca553b7e58adcb6f4
  test11:erl:
    https://gist.github.com/jtuple/8f12ce9c21471f5d6f01

Both can be used by running the 'go/0' function.

The test10:go() function creates an oscillation between a couple of
workloads: one that tends toward scheduler collapse, and one that tends
to wake them up again.

The test11:go() function uses only a single load that tends toward
scheduler collapse.

Both of them fail mostly regularly on my 8 core MBP using R15B01,
R15B03, and R16B.

The io:format() messages are sent while load is not running, with very
generous pauses before starting the next phase of workload.  If you call
io:format() during unfairly-scheduled workload (which these tests excel
at doing), the messages can be delayed by dozens of seconds.

Note that these synthetic tests are using two different functions to
cause scheduler collapse: test10.erl with crypto:md5_update/2, a NIF,
and test11.erl with erlang:external_size/1, a BIF.  It's quite likely
that erlang:term_to_binary/1 is similarly effective/buggy.

Neither of them fails when using this patch on any of those three VM
versions:

    https://github.com/slfritchie/otp/compare/erlang:maint...disable-scheduler-sleeps
  or
    https://github.com/slfritchie/otp/tree/disable-scheduler-sleeps

... when also using "+scl false +zdnfgtse 500:500".

-Scott


From watson.timothy@REDACTED  Wed May  1 13:32:42 2013
From: watson.timothy@REDACTED (Tim Watson)
Date: Wed, 1 May 2013 12:32:42 +0100
Subject: [erlang-bugs] Schedulers getting "stuck", part II
In-Reply-To: <83533.1367359995@snookles.snookles.com>
References: <83533.1367359995@snookles.snookles.com>
Message-ID: <C5E152B9-D12A-41F7-AC5E-28196624FD36@gmail.com>

On 30 Apr 2013, at 23:13, Scott Lystig Fritchie <fritchie@REDACTED> wrote:
> 
> ... when also using "+scl false +zdnfgtse 500:500".
> 

Does dnfgtse stand for what I think it does? :)

From spawn.think@REDACTED  Wed May  1 15:45:30 2013
From: spawn.think@REDACTED (Ahmed Omar)
Date: Wed, 1 May 2013 15:45:30 +0200
Subject: [erlang-bugs] Crash in mnesia_controller with function clause
	exception from is_tab_blocked
Message-ID: <CALxv_xZHpJXLLCoyPRnOdg8=WnsGH7jvviW-KO-NgbyF8sgw_g@mail.gmail.com>

Observed on startup of a node in the cluster the following the crash report

2013-04-29 15:33:12 =ERROR REPORT====
Mnesia('ejabberd@REDACTED'): ** ERROR ** (core dumped to file:
"/var/lib/ejabberd/MnesiaCore.ejabberd@REDACTED")
 ** FATAL ** mnesia_controller crashed:
{function_clause,[{mnesia_controller,is_tab_blocked,[{blocked,{blocked,[{'ejabberd@REDACTED
',disc_only_copies},{'ejabberd@REDACTED',disc_only_copies}]


The exception can be reproduced by the following steps:

mnesia:start(),
mnesia:create_table(test1, []),
mnesia_controller:block_table(test1),
mnesia_controller:block_table(test1),
mnesia_controller:add_active_replica(test1,node()).

I'm preparing a patch to submit

Best Regards,
Ahmed Omar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130501/453cd7f6/attachment.htm>

From isreal-erlang-bugs-at-erlang.org@REDACTED  Wed May  1 18:35:35 2013
From: isreal-erlang-bugs-at-erlang.org@REDACTED (David Buckley)
Date: Wed, 1 May 2013 17:35:35 +0100
Subject: [erlang-bugs] Bug in unicode characters_to_list trap
Message-ID: <20130501163535.GA29904@cirno.fluorescence.co.uk>

Simple test session:

[ 17:28 ] bucko@REDACTED:~% erl
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:4:4] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.1  (abort with ^G)
1> <<_, RR/binary>> = <<$a,164,161,$b>>.
<<"a??b">>
2> RR.
<<"??b">>
3> unicode:characters_to_list(RR).      
{error,[],<<"a??">>}
4> unicode:characters_to_list(list_to_binary(binary_to_list(RR))).
{error,[],<<"??b">>}

I'm using Debian's default erlang build, but I've verified the bug on
various others, and can't see it in the release notes.

Description: The latter two calls should return the dame value, as
list_to_binary(binary_to_list(RR)) =:= RR.

I would guess that the code in erlang's guts is taking the falure offset
into the binary part as an offset into the full binary. At least, the
return values are consistent with this.

Workaround is just to call list_to_binary(binary_to_list()) on your data
before calling unicode:characters_to_list on it. Or manually offsetting
into the binary yourself in the case of a failed parse.

-- 
David Buckley


From pan@REDACTED  Thu May  2 10:48:10 2013
From: pan@REDACTED (Patrik Nyblom)
Date: Thu, 2 May 2013 10:48:10 +0200
Subject: [erlang-bugs] Bug in unicode characters_to_list trap
In-Reply-To: <20130501163535.GA29904@cirno.fluorescence.co.uk>
References: <20130501163535.GA29904@cirno.fluorescence.co.uk>
Message-ID: <5182284A.50603@erlang.org>

Hi David!

On 05/01/2013 06:35 PM, David Buckley wrote:
> Simple test session:
>
> [ 17:28 ] bucko@REDACTED:~% erl
> Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:4:4] [async-threads:0] [hipe] [kernel-poll:false]
>
> Eshell V5.9.1  (abort with ^G)
> 1> <<_, RR/binary>> = <<$a,164,161,$b>>.
> <<"a??b">>
> 2> RR.
> <<"??b">>
> 3> unicode:characters_to_list(RR).
> {error,[],<<"a??">>}
> 4> unicode:characters_to_list(list_to_binary(binary_to_list(RR))).
> {error,[],<<"??b">>}
Yep - that's a bug, no doubt...
Can you try a source code patch when I've found a cure?
>
> I'm using Debian's default erlang build, but I've verified the bug on
> various others, and can't see it in the release notes.
>
> Description: The latter two calls should return the dame value, as
> list_to_binary(binary_to_list(RR)) =:= RR.
>
> I would guess that the code in erlang's guts is taking the falure offset
> into the binary part as an offset into the full binary. At least, the
> return values are consistent with this.
Good guess, I agree.
>
> Workaround is just to call list_to_binary(binary_to_list()) on your data
> before calling unicode:characters_to_list on it. Or manually offsetting
> into the binary yourself in the case of a failed parse.
>
Thanks!
/Patrik


From pan@REDACTED  Thu May  2 15:56:36 2013
From: pan@REDACTED (Patrik Nyblom)
Date: Thu, 2 May 2013 15:56:36 +0200
Subject: [erlang-bugs] Bug in unicode characters_to_list trap
In-Reply-To: <5182284A.50603@erlang.org>
References: <20130501163535.GA29904@cirno.fluorescence.co.uk>
 <5182284A.50603@erlang.org>
Message-ID: <51827094.8070907@erlang.org>

Hi again!

On 05/02/2013 10:48 AM, Patrik Nyblom wrote:
> Hi David!
>
> On 05/01/2013 06:35 PM, David Buckley wrote:
>> Simple test session:
>>
>> [ 17:28 ] bucko@REDACTED:~% erl
>> Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:4:4] 
>> [async-threads:0] [hipe] [kernel-poll:false]
>>
>> Eshell V5.9.1  (abort with ^G)
>> 1> <<_, RR/binary>> = <<$a,164,161,$b>>.
>> <<"a??b">>
>> 2> RR.
>> <<"??b">>
>> 3> unicode:characters_to_list(RR).
>> {error,[],<<"a??">>}
>> 4> unicode:characters_to_list(list_to_binary(binary_to_list(RR))).
>> {error,[],<<"??b">>}
> Yep - that's a bug, no doubt...
> Can you try a source code patch when I've found a cure?
A small patch is attached, the full patch will of course also cointain a 
test case, but this is tha minimal fix.
It would be great if you would also test it, i will meanwhile prepare a 
fix in maint...
>>
>> I'm using Debian's default erlang build, but I've verified the bug on
>> various others, and can't see it in the release notes.
>>
>> Description: The latter two calls should return the dame value, as
>> list_to_binary(binary_to_list(RR)) =:= RR.
>>
>> I would guess that the code in erlang's guts is taking the falure offset
>> into the binary part as an offset into the full binary. At least, the
>> return values are consistent with this.
> Good guess, I agree.
And, you were absolutely right!
>>
>> Workaround is just to call list_to_binary(binary_to_list()) on your data
>> before calling unicode:characters_to_list on it. Or manually offsetting
>> into the binary yourself in the case of a failed parse.
>>
> Thanks!
> /Patrik
Cheers,
Patrik
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs

-------------- next part --------------
A non-text attachment was scrubbed...
Name: unicode_rest.diff
Type: text/x-patch
Size: 765 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130502/c34e1320/attachment.bin>

From bgustavsson@REDACTED  Thu May  2 17:42:54 2013
From: bgustavsson@REDACTED (=?UTF-8?Q?Bj=C3=B6rn_Gustavsson?=)
Date: Thu, 2 May 2013 17:42:54 +0200
Subject: [erlang-bugs] [erlang-patches] Bit string generators,
 unsized binaries, modules and the REPL
In-Reply-To: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com>
References: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com>
Message-ID: <CA+yh78TE3Yx+do_ynrJMBQQcwLUPjGvYv210P+bQpeVPNweicg@mail.gmail.com>

On Sun, Mar 31, 2013 at 4:22 PM, Anthony Ramine <n.oxyde@REDACTED> wrote:

>
> This patch implements this new error and simplifies how v3_core works with
> forbidden unsized tail segments in patterns of bit string generators.
>
>         git fetch https://github.com/nox/otp illegal-bitstring-gen-pattern
>
>
> https://github.com/nox/otp/compare/erlang:maint...illegal-bitstring-gen-pattern
>
> https://github.com/nox/otp/compare/erlang:maint...illegal-bitstring-gen-pattern.patch


There is a major and a minor issue.

The major issue is that the test suites bs_bincomp_SUITE.erl
(compiler application) and erl_eval_SUITE.erl (stdlib application)
no longer compiles.

The minor issue is that erl_eval and eval_bits have assertions
to reject bad inputs in case the abstract code has not been
verified by erl_lint. The assertion can be written like this to
reject unsized tails in binary generators:

diff --git a/lib/stdlib/src/eval_bits.erl b/lib/stdlib/src/eval_bits.erl
index e49cbc1..56be5a6 100644
--- a/lib/stdlib/src/eval_bits.erl
+++ b/lib/stdlib/src/eval_bits.erl
@@ -193,6 +193,13 @@ bin_gen_field({bin_element,Line,VE,Size0,Options0},
     V = erl_eval:partial_eval(VE),
     NewV = coerce_to_float(V, Type),
     match_check_size(Mfun, Size1, BBs0),
+    case Size1 of
+       {atom,_,all} ->
+           %% An unsized field is forbidden in a generator.
+           throw(invalid);
+       _ ->
+           ok
+    end,
     {value, Size, _BBs} = Efun(Size1, BBs0),
     bin_gen_field1(Bin, Type, Size, Unit, Sign, Endian, NewV, Bs0, BBs0,
Mfun).


-- 
Bj?rn Gustavsson, Erlang/OTP, Ericsson AB
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130502/615124fa/attachment.htm>

From n.oxyde@REDACTED  Thu May  2 21:12:34 2013
From: n.oxyde@REDACTED (Anthony Ramine)
Date: Thu, 2 May 2013 21:12:34 +0200
Subject: [erlang-bugs] [erlang-patches] Bit string generators,
	unsized binaries, modules and the REPL
In-Reply-To: <CA+yh78TE3Yx+do_ynrJMBQQcwLUPjGvYv210P+bQpeVPNweicg@mail.gmail.com>
References: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com>
 <CA+yh78TE3Yx+do_ynrJMBQQcwLUPjGvYv210P+bQpeVPNweicg@mail.gmail.com>
Message-ID: <E38D5CB3-588F-47AD-AE6B-975D0DC1455B@gmail.com>

Hello Bj?rn,

Both issues fixed. Please refetch.

Regards,

-- 
Anthony Ramine

Le 2 mai 2013 ? 17:42, Bj?rn Gustavsson a ?crit :

> There is a major and a minor issue.


From stefan.zegenhagen@REDACTED  Fri May  3 10:34:40 2013
From: stefan.zegenhagen@REDACTED (Stefan Zegenhagen)
Date: Fri, 03 May 2013 10:34:40 +0200
Subject: [erlang-bugs] Strange thing in lib/kernel/src/group.erl
Message-ID: <1367570080.31752.25.camel@ax-sze>

Dear all,

I've stumbled across a small issue in the implementation of the process
group server.

The code in group.erl spawns a server process that monitors the exit of
either the shell and the user_drv that started it. In the regular
server_loop, exits of the user_drv (Drv) are handled as follows:

  receive
  ...
  {'EXIT',Drv,R} ->
      exit(R);


When a blocking io_request is being executed, the following code is
executed instead:

  %% 'kill' instead of R, since the shell is not always in
  %% a state where it is ready to handle a termination
  %% message.
  exit_shell(kill),
  exit(R)

Besides the behaviour being inconsistent, it also means that our shell
process monitor receives the 'killed' exit reason more often than the
real exit reason, which defeats our custom error handling and logging.

Looking at the comment above the exit_shell(kill) statement, there seems
to have been a reason to put it there at some time. Looking at the code
in the io module that does those io_requests, it should not be
necessary.

I'm unsure whether it is safe to remove the exit_shell(kill) statement
or whether something would terribly break. However, not receiving the
correct exit reason does give us a headache.


Kind regards,

-- 
Dr. Stefan Zegenhagen

arcutronix GmbH
Garbsener Landstr. 10
30419 Hannover
Germany

Tel:   +49 511 277-2734
Fax:   +49 511 277-2709
Email: stefan.zegenhagen@REDACTED
Web:   www.arcutronix.com

*Synchronize the Ethernet*

General Managers: Dipl. Ing. Juergen Schroeder, Dr. Josef Gfrerer -
Legal Form: GmbH, Registered office: Hannover, HRB 202442, Amtsgericht
Hannover; Ust-Id: DE257551767.

Please consider the environment before printing this message.


From bgustavsson@REDACTED  Fri May  3 11:18:49 2013
From: bgustavsson@REDACTED (=?UTF-8?Q?Bj=C3=B6rn_Gustavsson?=)
Date: Fri, 3 May 2013 11:18:49 +0200
Subject: [erlang-bugs] [erlang-patches] Bit string generators,
 unsized binaries, modules and the REPL
In-Reply-To: <E38D5CB3-588F-47AD-AE6B-975D0DC1455B@gmail.com>
References: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com>
 <CA+yh78TE3Yx+do_ynrJMBQQcwLUPjGvYv210P+bQpeVPNweicg@mail.gmail.com>
 <E38D5CB3-588F-47AD-AE6B-975D0DC1455B@gmail.com>
Message-ID: <CA+yh78QfLCCRT_EOba79hUXiQO25hRcemtF0-P5WH=b6tavqrg@mail.gmail.com>

On Thu, May 2, 2013 at 9:12 PM, Anthony Ramine <n.oxyde@REDACTED> wrote:

> Hello Bj?rn,
>
> Both issues fixed. Please refetch.
>
>
>
No, bs_bincomp_SUITE still does not compile. You have removed the tail/1
function, but not the export of it.

Another thing is that the modification of bs_bincomp_SUITE is done in the
wrong commit. It should be done in the same commit that makes tails
illegal. (That may cause problems when running 'git bisect'.)

-- 
Bj?rn Gustavsson, Erlang/OTP, Ericsson AB
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130503/589659b3/attachment.htm>

From n.oxyde@REDACTED  Fri May  3 11:32:13 2013
From: n.oxyde@REDACTED (Anthony Ramine)
Date: Fri, 3 May 2013 11:32:13 +0200
Subject: [erlang-bugs] [erlang-patches] Bit string generators,
	unsized binaries, modules and the REPL
In-Reply-To: <CA+yh78QfLCCRT_EOba79hUXiQO25hRcemtF0-P5WH=b6tavqrg@mail.gmail.com>
References: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com>
 <CA+yh78TE3Yx+do_ynrJMBQQcwLUPjGvYv210P+bQpeVPNweicg@mail.gmail.com>
 <E38D5CB3-588F-47AD-AE6B-975D0DC1455B@gmail.com>
 <CA+yh78QfLCCRT_EOba79hUXiQO25hRcemtF0-P5WH=b6tavqrg@mail.gmail.com>
Message-ID: <3BA7F90A-E656-4CF3-A27F-00D721B94BF7@gmail.com>

Hello Bj?rn,

Silly me, to validate my changes I run erts/compiler/test/bs_bincomp_SUITE.erl instead of the one in lib/compiler/test/. I always forget these damn export attributes.

Both issues really fixed for good now.

Please refetch.

Regards,

-- 
Anthony Ramine

Le 3 mai 2013 ? 11:18, Bj?rn Gustavsson a ?crit :

> 
> No, bs_bincomp_SUITE still does not compile. You have removed the tail/1
> function, but not the export of it.


From mononcqc@REDACTED  Fri May  3 16:42:01 2013
From: mononcqc@REDACTED (Fred Hebert)
Date: Fri, 3 May 2013 10:42:01 -0400
Subject: [erlang-bugs] Strange thing in lib/kernel/src/group.erl
In-Reply-To: <1367570080.31752.25.camel@ax-sze>
References: <1367570080.31752.25.camel@ax-sze>
Message-ID: <20130503144159.GA57046@ferdair.local>

Hi,

The reason I could see to have exit_shell(kill) (which in turn calls
exit/2 on the shell iff a shell is attached) is that you want, out of
all doubt, to get rid of the shell.

The group.erl module is entirely distinct from the shell implementation.
For example, most shells use shell.erl, but the one used by the SSH
daemon has a custom one going that's different, and there are also
concepts such as safe shells.

In the event that some shell implementation traps exits (and they
shoulda be expected to do so if they want to handle the 'interrupt'
signal, which is necessary to deal with some ^G commands such as 'i', in
any special manner), if the shell is currently blocked in an IO request,
it will *never* see the 'EXIT' signal given it is busy waiting on
another message, namely the IO Request's response.

Because of this specific reason, it might be necessary to kill the
shell with the 'kill' signal, which cannot be trapped. We just can't
assume that the other shell will receive it.

Now granted, I think it could be possible to send both exit signals
there (exit_shell(R), exit_shell(kill), exit(R)) just in case in order
to allow more obvious exit messages, but I'm not sure it would
necessarily be worth it. Someone from the OTP team (or Robert) could
voice their opinion there.

Regards,
Fred.

On 05/03, Stefan Zegenhagen wrote:
> Dear all,
> 
> I've stumbled across a small issue in the implementation of the process
> group server.
> 
> The code in group.erl spawns a server process that monitors the exit of
> either the shell and the user_drv that started it. In the regular
> server_loop, exits of the user_drv (Drv) are handled as follows:
> 
>   receive
>   ...
>   {'EXIT',Drv,R} ->
>       exit(R);
> 
> 
> When a blocking io_request is being executed, the following code is
> executed instead:
> 
>   %% 'kill' instead of R, since the shell is not always in
>   %% a state where it is ready to handle a termination
>   %% message.
>   exit_shell(kill),
>   exit(R)
> 
> Besides the behaviour being inconsistent, it also means that our shell
> process monitor receives the 'killed' exit reason more often than the
> real exit reason, which defeats our custom error handling and logging.
> 
> Looking at the comment above the exit_shell(kill) statement, there seems
> to have been a reason to put it there at some time. Looking at the code
> in the io module that does those io_requests, it should not be
> necessary.
> 
> I'm unsure whether it is safe to remove the exit_shell(kill) statement
> or whether something would terribly break. However, not receiving the
> correct exit reason does give us a headache.
> 
> 
> Kind regards,
> 
> -- 
> Dr. Stefan Zegenhagen
> 
> arcutronix GmbH
> Garbsener Landstr. 10
> 30419 Hannover
> Germany
> 
> Tel:   +49 511 277-2734
> Fax:   +49 511 277-2709
> Email: stefan.zegenhagen@REDACTED
> Web:   www.arcutronix.com
> 
> *Synchronize the Ethernet*
> 
> General Managers: Dipl. Ing. Juergen Schroeder, Dr. Josef Gfrerer -
> Legal Form: GmbH, Registered office: Hannover, HRB 202442, Amtsgericht
> Hannover; Ust-Id: DE257551767.
> 
> Please consider the environment before printing this message.
> 
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs


From stefan.zegenhagen@REDACTED  Mon May  6 11:25:17 2013
From: stefan.zegenhagen@REDACTED (Stefan Zegenhagen)
Date: Mon, 06 May 2013 11:25:17 +0200
Subject: [erlang-bugs] Strange thing in lib/kernel/src/group.erl
In-Reply-To: <20130503144159.GA57046@ferdair.local>
References: <1367570080.31752.25.camel@ax-sze>
 <20130503144159.GA57046@ferdair.local>
Message-ID: <1367832317.31752.63.camel@ax-sze>

Hi,

thank you very much for the response.

> The group.erl module is entirely distinct from the shell implementation.
> For example, most shells use shell.erl, but the one used by the SSH
> daemon has a custom one going that's different, and there are also
> concepts such as safe shells.

In fact, we've written our own shell ;-) and provide a command line
interface to view / change device settings after logon.

> In the event that some shell implementation traps exits (and they
> shoulda be expected to do so if they want to handle the 'interrupt'
> signal, which is necessary to deal with some ^G commands such as 'i', in
> any special manner), if the shell is currently blocked in an IO request,
> it will *never* see the 'EXIT' signal given it is busy waiting on
> another message, namely the IO Request's response.

> Because of this specific reason, it might be necessary to kill the
> shell with the 'kill' signal, which cannot be trapped. We just can't
> assume that the other shell will receive it.

Unfortunately, this is only half the truth. I/O requests will usually
not cause the shell to not listen to 'EXIT' requests from the
group_leader() because I/O requests are implemented as message exchange
between those two. Additionally, the io.erl module (which does the I/O
requests), is terribly careful to not miss any exit signal sent by the
group_leader() / I/O channel. It does the following:
 - create a monitor for the group_leader() (or the supplied I/O channel)
 - send an {io_request, *} message to the I/O channel
 - listen for
   * an {io_reply, *}
   * the 'DOWN' message from the process monitor
   * any 'EXIT' message from the I/O channel

If any matching 'DOWN' or 'EXIT' message is received, the correponding
opposite is fetched from the message queue as well and {error,
terminated} is returned to the caller. This is already bad by itself
because it drops the (possibly important) reason of the error.

In conclusion, by using the io module for input/output a shell can never
get stuck in a state where it is unkillable by doing an I/O request.
But it is true that an I/O request blocks the shell for calls/messages
from *OTHER* processes than the I/O channel.

I can see that it might be wanted to get rid of the shell for sure. One
might imagine a case where the shell is trapping exits but "refuses to
die" in response to a trappable exit signal. But then, it is not clear
to me, why the same measure (e.g. exit_shell(kill)) is not taken in the
case where the group.erl's server process is *NOT* executing an I/O
request right now and the shell might truely be blocked by activities
that prevent it from reacting on the exit signal.


But back to the original issue: there are several, discinct reasons why
we might need to forcedly terminate a shell session *AND* to do an
appropriate logging IFF a user is currently logged on (for
security/auditing reasons), e.g.:
 - the serial cable is being unplugged while a user is logged on
 - someone tries to interfere with the system by sending huge amounts
   of binary data over the serial port (possible denial-of-service)
 - ...

Our user_drv.erl replacement exits with an appropriate reason in those
cases and our shell implementation needs to know the exit reason to do
the right thing depending on the situation. This is currently impossible
and I was wondering whether anything could be done about it.

> Now granted, I think it could be possible to send both exit signals
> there (exit_shell(R), exit_shell(kill), exit(R)) just in case in order
> to allow more obvious exit messages, but I'm not sure it would
> necessarily be worth it. Someone from the OTP team (or Robert) could
> voice their opinion there.

Whether this works would certainly depend on the timing. The shell
process should be given enough time to have a chance to process the
first exit signal before being forcedly killed by the second one. Can
this be guaranteed?


Kind regards,

-- 
Dr. Stefan Zegenhagen

arcutronix GmbH
Garbsener Landstr. 10
30419 Hannover
Germany

Tel:   +49 511 277-2734
Fax:   +49 511 277-2709
Email: stefan.zegenhagen@REDACTED
Web:   www.arcutronix.com

*Synchronize the Ethernet*

General Managers: Dipl. Ing. Juergen Schroeder, Dr. Josef Gfrerer -
Legal Form: GmbH, Registered office: Hannover, HRB 202442, Amtsgericht
Hannover; Ust-Id: DE257551767.

Please consider the environment before printing this message.


From mononcqc@REDACTED  Mon May  6 15:34:21 2013
From: mononcqc@REDACTED (Fred Hebert)
Date: Mon, 6 May 2013 09:34:21 -0400
Subject: [erlang-bugs] Strange thing in lib/kernel/src/group.erl
In-Reply-To: <1367832317.31752.63.camel@ax-sze>
References: <1367570080.31752.25.camel@ax-sze>
 <20130503144159.GA57046@ferdair.local>
 <1367832317.31752.63.camel@ax-sze>
Message-ID: <20130506133420.GA64025@ferdair.local>

On 05/06, Stefan Zegenhagen wrote:
> 
> Unfortunately, this is only half the truth. I/O requests will usually
> not cause the shell to not listen to 'EXIT' requests from the
> group_leader() because I/O requests are implemented as message exchange
> between those two. Additionally, the io.erl module (which does the I/O
> requests), is terribly careful to not miss any exit signal sent by the
> group_leader() / I/O channel. It does the following:
>  - create a monitor for the group_leader() (or the supplied I/O channel)
>  - send an {io_request, *} message to the I/O channel
>  - listen for
>    * an {io_reply, *}
>    * the 'DOWN' message from the process monitor
>    * any 'EXIT' message from the I/O channel
> 
> If any matching 'DOWN' or 'EXIT' message is received, the correponding
> opposite is fetched from the message queue as well and {error,
> terminated} is returned to the caller. This is already bad by itself
> because it drops the (possibly important) reason of the error.
> 
> In conclusion, by using the io module for input/output a shell can never
> get stuck in a state where it is unkillable by doing an I/O request.
> But it is true that an I/O request blocks the shell for calls/messages
> from *OTHER* processes than the I/O channel.

The key point here is 'usually'. In practice, with the 'io' module,
things are gonna be safe. I think most if not all functions of the
'file' module also make use of the io protocol to write to files through
the 'io' module directly and are generally safe for that.

However, I'm looking at it only from within the group.erl implementation
and the documented protocol at
http://erlang.org/doc/apps/stdlib/io_protocol.html (in Erlang, if it's
not documented, it doesn't exist). If you're basing yourself only on the
protocol, you can't assume the other side will monitor you, although
it's probably what any reasonable Erlang programmer would do.

I'm guessing that if the shell had documentation and a notice warning
for this usage, there would be no argument that could be made against
it.

> 
> I can see that it might be wanted to get rid of the shell for sure. One
> might imagine a case where the shell is trapping exits but "refuses to
> die" in response to a trappable exit signal. But then, it is not clear
> to me, why the same measure (e.g. exit_shell(kill)) is not taken in the
> case where the group.erl's server process is *NOT* executing an I/O
> request right now and the shell might truely be blocked by activities
> that prevent it from reacting on the exit signal.
> 

It is indeed not very clear. My guess would be that you can make
assumptions about your part of the communication and protocol, but not
the others.

A simpler explanation is probably that sometimes back, there was a
problem with either implementation and it was simpler to fix with a kill
than by adding other ways to handling code (say, before monitoring was
added to the language, but while trap_exits were available).

If this is the case, then there would be no reason to keep things the
way they are right now IMO, and it would be possible to go with the
other exit.

> 
> But back to the original issue: there are several, discinct reasons why
> we might need to forcedly terminate a shell session *AND* to do an
> appropriate logging IFF a user is currently logged on (for
> security/auditing reasons), e.g.:
>  - the serial cable is being unplugged while a user is logged on
>  - someone tries to interfere with the system by sending huge amounts
>    of binary data over the serial port (possible denial-of-service)
>  - ...
> 
> Our user_drv.erl replacement exits with an appropriate reason in those
> cases and our shell implementation needs to know the exit reason to do
> the right thing depending on the situation. This is currently impossible
> and I was wondering whether anything could be done about it.

That is definitely a nice use case and I would be personally more open
to allowing that than leaving the 'kill' here. I am however not in the
OTP team, and do not know everything that has to do with the shell, so
this is only my personal opinion.

A possible workaround if things do not come to fruition would be to add
layers of indirection -- a process that monitors the shell and the
group.erl process and reports the most useful message. Ideally this
would not need to be written, although it might still be needed if you
deal with older implementations after the fix.

> 
> Whether this works would certainly depend on the timing. The shell
> process should be given enough time to have a chance to process the
> first exit signal before being forcedly killed by the second one. Can
> this be guaranteed?
> 

The two-kill approach should work well in the event where the other
process is not trapping exits. In that case, the order of signals should
be guaranteed, and the first one will kill the process cleanly.

If the process is trapping exits, though, then the first (non-kill)
signal will be converted to a message and you're absolutely unlikely to
be able to have the time to process the first one before being killed by
the second one.

The cleanest solution is obviously to be able to just exit/2 with the
right reason.

I don't know if the OTP team has managed to transfer all the changelogs
relating to the shells when they moved over to git, but I'd be
interested to figure out if the exit(Pid,kill) in there is older than
monitors -- if so, it would mean that it was probably a workaround for
the io module which is no longer necessary today (because it can monitor
without altering links or exits being trapped).


Regards,
Fred.


From stefan.zegenhagen@REDACTED  Mon May  6 15:54:35 2013
From: stefan.zegenhagen@REDACTED (Stefan Zegenhagen)
Date: Mon, 06 May 2013 15:54:35 +0200
Subject: [erlang-bugs] Strange thing in lib/kernel/src/group.erl
In-Reply-To: <20130506133420.GA64025@ferdair.local>
References: <1367570080.31752.25.camel@ax-sze>
 <20130503144159.GA57046@ferdair.local> <1367832317.31752.63.camel@ax-sze>
 <20130506133420.GA64025@ferdair.local>
Message-ID: <1367848475.31752.86.camel@ax-sze>

Hi,


thanks again for the detailed answer.


> > I can see that it might be wanted to get rid of the shell for sure. One
> > might imagine a case where the shell is trapping exits but "refuses to
> > die" in response to a trappable exit signal. But then, it is not clear
> > to me, why the same measure (e.g. exit_shell(kill)) is not taken in the
> > case where the group.erl's server process is *NOT* executing an I/O
> > request right now and the shell might truely be blocked by activities
> > that prevent it from reacting on the exit signal.
> > 
> 
> It is indeed not very clear. My guess would be that you can make
> assumptions about your part of the communication and protocol, but not
> the others.
> 
> A simpler explanation is probably that sometimes back, there was a
> problem with either implementation and it was simpler to fix with a kill
> than by adding other ways to handling code (say, before monitoring was
> added to the language, but while trap_exits were available).
> 
> If this is the case, then there would be no reason to keep things the
> way they are right now IMO, and it would be possible to go with the
> other exit.

I guess we'll have to wait for the OTP team to have a look at this,
then ;-)


> > But back to the original issue: there are several, discinct reasons why
> > we might need to forcedly terminate a shell session *AND* to do an
> > appropriate logging IFF a user is currently logged on (for
> > security/auditing reasons), e.g.:
> >  - the serial cable is being unplugged while a user is logged on
> >  - someone tries to interfere with the system by sending huge amounts
> >    of binary data over the serial port (possible denial-of-service)
> >  - ...
> > 
> > Our user_drv.erl replacement exits with an appropriate reason in those
> > cases and our shell implementation needs to know the exit reason to do
> > the right thing depending on the situation. This is currently impossible
> > and I was wondering whether anything could be done about it.
> 
> That is definitely a nice use case and I would be personally more open
> to allowing that than leaving the 'kill' here. I am however not in the
> OTP team, and do not know everything that has to do with the shell, so
> this is only my personal opinion.
> 
> A possible workaround if things do not come to fruition would be to add
> layers of indirection -- a process that monitors the shell and the
> group.erl process and reports the most useful message. Ideally this
> would not need to be written, although it might still be needed if you
> deal with older implementations after the fix.

I had thought of that as well, but tried to avoid that because
personally, I do not feel comfortable with this (that our session
monitor would need to know the PID of the shell's group leader). But
that's merely a matter of taste and if there is need, it can overrule
the headaches :-)

> > 
> > Whether this works would certainly depend on the timing. The shell
> > process should be given enough time to have a chance to process the
> > first exit signal before being forcedly killed by the second one. Can
> > this be guaranteed?
> > 
> 
> The two-kill approach should work well in the event where the other
> process is not trapping exits. In that case, the order of signals should
> be guaranteed, and the first one will kill the process cleanly.
> 
> If the process is trapping exits, though, then the first (non-kill)
> signal will be converted to a message and you're absolutely unlikely to
> be able to have the time to process the first one before being killed by
> the second one.

Unfortunately, since we want to provide <CTRL>+C interrupt
possibilities, we need to trap exits.


> The cleanest solution is obviously to be able to just exit/2 with the
> right reason.
> 
> I don't know if the OTP team has managed to transfer all the changelogs
> relating to the shells when they moved over to git, but I'd be
> interested to figure out if the exit(Pid,kill) in there is older than
> monitors -- if so, it would mean that it was probably a workaround for
> the io module which is no longer necessary today (because it can monitor
> without altering links or exits being trapped).


This would be interesting to know, indeed ;-)

I'm just wondering if there's a better chance of getting the change if
it is made configurable via "io:setopt([{safe_exit_code, true}])"? In
any case I would not mind to create the patch.


Kind regards,
-- 
Dr. Stefan Zegenhagen

arcutronix GmbH
Garbsener Landstr. 10
30419 Hannover
Germany

Tel:   +49 511 277-2734
Fax:   +49 511 277-2709
Email: stefan.zegenhagen@REDACTED
Web:   www.arcutronix.com

*Synchronize the Ethernet*

General Managers: Dipl. Ing. Juergen Schroeder, Dr. Josef Gfrerer -
Legal Form: GmbH, Registered office: Hannover, HRB 202442, Amtsgericht
Hannover; Ust-Id: DE257551767.

Please consider the environment before printing this message.


From fredrik@REDACTED  Tue May  7 13:38:03 2013
From: fredrik@REDACTED (Fredrik)
Date: Tue, 7 May 2013 13:38:03 +0200
Subject: [erlang-bugs] [erlang-patches] Bit string generators,
 unsized binaries, modules and the REPL
In-Reply-To: <3BA7F90A-E656-4CF3-A27F-00D721B94BF7@gmail.com>
References: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com>
 <CA+yh78TE3Yx+do_ynrJMBQQcwLUPjGvYv210P+bQpeVPNweicg@mail.gmail.com>
 <E38D5CB3-588F-47AD-AE6B-975D0DC1455B@gmail.com>
 <CA+yh78QfLCCRT_EOba79hUXiQO25hRcemtF0-P5WH=b6tavqrg@mail.gmail.com>
 <3BA7F90A-E656-4CF3-A27F-00D721B94BF7@gmail.com>
Message-ID: <5188E79B.7030501@erlang.org>

On 05/03/2013 11:32 AM, Anthony Ramine wrote:
> Hello Bj?rn,
>
> Silly me, to validate my changes I run erts/compiler/test/bs_bincomp_SUITE.erl instead of the one in lib/compiler/test/. I always forget these damn export attributes.
>
> Both issues really fixed for good now.
>
> Please refetch.
>
> Regards,
>
Hello Anthony,
This patch is failing small_SUITE:bin_compr testcase in dialyzer 
application.
Could you have a look at it?
Thanks,

-- 

BR Fredrik Gustafsson
Erlang OTP Team


From pan@REDACTED  Tue May  7 14:23:22 2013
From: pan@REDACTED (Patrik Nyblom)
Date: Tue, 7 May 2013 14:23:22 +0200
Subject: [erlang-bugs] Schedulers getting "stuck", part II
In-Reply-To: <83533.1367359995@snookles.snookles.com>
References: <83533.1367359995@snookles.snookles.com>
Message-ID: <5188F23A.7050707@erlang.org>

Hi Scott (and Joe)!

Thank you for these tests!

I would say Joe's comment at the end of the test10 gist says it all, and 
is spot on:

"This isn't just a NIF problem. Any code that sits in C land and doesn't 
accurately contribute towards scheduler reductions can case this. So, 
BIFs that don't estimate work and perform BIF_TRAPs are also bad. Turns 
out that that the commonly used |term_to_binary| and |external_size| 
BIFs have this problem. "

Joe points out a couple of misbehaving BIF's and NIF's which will cause 
this, breaking the scheduling algorithm. I bet there's more of them. I 
can see several problems that needs to be fixed:

1) OTP should of course not have code (BIF's or NIF's or whatever) that 
does not even bump reductions or trap properly.
2) If writing NIF's, you should have a way to monitor the scheduler 
behavior to easily find long schedules. DTrace is nice, but not 
available everywhere...
3) If writing NIF's, you should have a simple way to put the execution 
of your code in a separate worker thread.

The answer to (1) is that we continue (or intensify) our work when it 
comes to adding proper reductions and trapping to BIF's (and NIF's). A 
first step would be to just add proper reductions to all relevant BIF's, 
which is fairly easy to do. Whenever there's a BIF whose work depends on 
the size of the input, it should also at least add a cost to the process 
that's proportional. Some old BIF's does not do even that, which really 
needs to be fixed. Contributions are always welcome... term_to_binary 
and external_size are already being worked on, but there's most probably 
more problem BIF's out there...

One step towards (2) is the ability to monitor long schedules in the 
system. I've extended erlang:system_monitor/2 to have an option to 
monitor all schedules and port operations that run for more than a 
specified amount of wall clock time. That should at least help in 
identifying such problems (the code is not in maint yet, but will be 
soon). More monitoring options, to see the scheduler behavior may be 
needed, but this is at least a start. As an example, monitoring long 
schedules in test10, will inform you that the processes run 
uninterrupted for a whopping 1,5 *seconds*. Just adding reduction cost 
to the md5 calls will reduce this to a tenth of the scheduling time of 
course.

The answer to (3) is "dirty schedulers", which is in the roadmap for R17.

I think all three things need to be done for the scheduling to work 
properly, but not only for that. A schedule that takes too long, also 
breaks real time properties of the VM, so fixing this by poking the 
schedulers to wake up at certain intervals just handles one symptom, but 
does not remove the cause and does not cure the impact on real time 
behavior...

So - it's not the scheduling algorithms as such that results in this 
problem, it's still a problem with uninterrupted C-code. These examples 
shows that some (or many) of our BIF's need to be fixed, that we need to 
intensify the work on monitoring options and that we need dirty 
schedulers. At least that's how I see it.

Cheers,
Patrik

On 05/01/2013 12:13 AM, Scott Lystig Fritchie wrote:
> Patrik, there are a couple of synthetic load cases that have an end
> result of what we occasionally see Riak and Riak CS doing in the wild.
> Manymany thanks to Joseph Blomstedt for inventing these two modules.
>
>    test10.erl:
>      https://gist.github.com/jtuple/0d9ca553b7e58adcb6f4
>    test11:erl:
>      https://gist.github.com/jtuple/8f12ce9c21471f5d6f01
>
> Both can be used by running the 'go/0' function.
>
> The test10:go() function creates an oscillation between a couple of
> workloads: one that tends toward scheduler collapse, and one that tends
> to wake them up again.
>
> The test11:go() function uses only a single load that tends toward
> scheduler collapse.
>
> Both of them fail mostly regularly on my 8 core MBP using R15B01,
> R15B03, and R16B.
>
> The io:format() messages are sent while load is not running, with very
> generous pauses before starting the next phase of workload.  If you call
> io:format() during unfairly-scheduled workload (which these tests excel
> at doing), the messages can be delayed by dozens of seconds.
>
> Note that these synthetic tests are using two different functions to
> cause scheduler collapse: test10.erl with crypto:md5_update/2, a NIF,
> and test11.erl with erlang:external_size/1, a BIF.  It's quite likely
> that erlang:term_to_binary/1 is similarly effective/buggy.
>
> Neither of them fails when using this patch on any of those three VM
> versions:
>
>      https://github.com/slfritchie/otp/compare/erlang:maint...disable-scheduler-sleeps
>    or
>      https://github.com/slfritchie/otp/tree/disable-scheduler-sleeps
>
> ... when also using "+scl false +zdnfgtse 500:500".
>
> -Scott

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130507/50246ab7/attachment.htm>

From n.oxyde@REDACTED  Tue May  7 14:37:23 2013
From: n.oxyde@REDACTED (Anthony Ramine)
Date: Tue, 7 May 2013 14:37:23 +0200
Subject: [erlang-bugs] [erlang-patches] Bit string generators,
	unsized binaries, modules and the REPL
In-Reply-To: <5188E79B.7030501@erlang.org>
References: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com>
 <CA+yh78TE3Yx+do_ynrJMBQQcwLUPjGvYv210P+bQpeVPNweicg@mail.gmail.com>
 <E38D5CB3-588F-47AD-AE6B-975D0DC1455B@gmail.com>
 <CA+yh78QfLCCRT_EOba79hUXiQO25hRcemtF0-P5WH=b6tavqrg@mail.gmail.com>
 <3BA7F90A-E656-4CF3-A27F-00D721B94BF7@gmail.com>
 <5188E79B.7030501@erlang.org>
Message-ID: <63BB7979-D344-4C71-99C1-2399EB604F58@gmail.com>

Hello,

I can't build the PLT anymore for stdlib because it complains that it (obviously) can't find the abstract code in the BEAM file. I have two solutions:

* I can write a script that takes the .beam file compiled from the .S file and a .abstr file compiled from a kinda-equivalent .erl file and uses beam_lib to add an abstract code chunk.

* I can make Dialyzer ignores BEAM files for which there is 'from_asm' in the compile options.

Would you be against such a modification, Kostis?

Regards,

-- 
Anthony Ramine

Le 7 mai 2013 ? 13:38, Fredrik a ?crit :

> On 05/03/2013 11:32 AM, Anthony Ramine wrote:
>> Hello Bj?rn,
>> 
>> Silly me, to validate my changes I run erts/compiler/test/bs_bincomp_SUITE.erl instead of the one in lib/compiler/test/. I always forget these damn export attributes.
>> 
>> Both issues really fixed for good now.
>> 
>> Please refetch.
>> 
>> Regards,
>> 
> Hello Anthony,
> This patch is failing small_SUITE:bin_compr testcase in dialyzer application.
> Could you have a look at it?
> Thanks,
> 
> -- 
> 
> BR Fredrik Gustafsson
> Erlang OTP Team
> 


From n.oxyde@REDACTED  Tue May  7 21:32:52 2013
From: n.oxyde@REDACTED (Anthony Ramine)
Date: Tue, 7 May 2013 21:32:52 +0200
Subject: [erlang-bugs] [erlang-patches] Bit string generators,
	unsized binaries, modules and the REPL
In-Reply-To: <5188E79B.7030501@erlang.org>
References: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com>
 <CA+yh78TE3Yx+do_ynrJMBQQcwLUPjGvYv210P+bQpeVPNweicg@mail.gmail.com>
 <E38D5CB3-588F-47AD-AE6B-975D0DC1455B@gmail.com>
 <CA+yh78QfLCCRT_EOba79hUXiQO25hRcemtF0-P5WH=b6tavqrg@mail.gmail.com>
 <3BA7F90A-E656-4CF3-A27F-00D721B94BF7@gmail.com>
 <5188E79B.7030501@erlang.org>
Message-ID: <FC092819-C8F7-43CA-B531-E48BD1CF4CF0@gmail.com>

Hello Fredrik,

I removed the Dialyzer patch as it tests a now-forbidden expression. Please refetch.

Regards,

-- 
Anthony Ramine

Le 7 mai 2013 ? 13:38, Fredrik a ?crit :

> On 05/03/2013 11:32 AM, Anthony Ramine wrote:
>> Hello Bj?rn,
>> 
>> Silly me, to validate my changes I run erts/compiler/test/bs_bincomp_SUITE.erl instead of the one in lib/compiler/test/. I always forget these damn export attributes.
>> 
>> Both issues really fixed for good now.
>> 
>> Please refetch.
>> 
>> Regards,
>> 
> Hello Anthony,
> This patch is failing small_SUITE:bin_compr testcase in dialyzer application.
> Could you have a look at it?
> Thanks,
> 
> -- 
> 
> BR Fredrik Gustafsson
> Erlang OTP Team
> 


From elinsn@REDACTED  Wed May  8 09:55:55 2013
From: elinsn@REDACTED (Sergey Yelin)
Date: Wed, 8 May 2013 07:55:55 +0000 (UTC)
Subject: [erlang-bugs] Invitation to connect on LinkedIn
Message-ID: <1744636659.17853080.1367999755861.JavaMail.app@ela4-app0134.prod>

LinkedIn
------------


I'd like to add you to my professional network on LinkedIn.

- Sergey

Sergey Yelin
Tech Lead at EXANTE, The Organic Choice in Finance
Russian Federation

Confirm that you know Sergey Yelin:
https://www.linkedin.com/e/e17gao-hgg7qf63-4y/isd/13086505733/rnhjn1nh/?hs=false&tok=38pQn9T-BXVBI1

--
You are receiving Invitation to Connect emails. Click to unsubscribe:
http://www.linkedin.com/e/e17gao-hgg7qf63-4y/qx6Gu6PrqWdpIYhF_xtVrrPrqWdpIYR4Y3j/goo/erlang-bugs%40erlang%2Eorg/20061/I4333492851_1/?hs=false&tok=2vwUt2PxpXVBI1

(c) 2012 LinkedIn Corporation. 2029 Stierlin Ct, Mountain View, CA 94043, USA.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130508/29ee3a4b/attachment.htm>

From fredrik@REDACTED  Wed May  8 10:18:45 2013
From: fredrik@REDACTED (Fredrik)
Date: Wed, 8 May 2013 10:18:45 +0200
Subject: [erlang-bugs] [erlang-patches] Bit string generators,
 unsized binaries, modules and the REPL
In-Reply-To: <FC092819-C8F7-43CA-B531-E48BD1CF4CF0@gmail.com>
References: <649B6ECF-85AD-40BB-9CB1-C04DC348C499@gmail.com>
 <CA+yh78TE3Yx+do_ynrJMBQQcwLUPjGvYv210P+bQpeVPNweicg@mail.gmail.com>
 <E38D5CB3-588F-47AD-AE6B-975D0DC1455B@gmail.com>
 <CA+yh78QfLCCRT_EOba79hUXiQO25hRcemtF0-P5WH=b6tavqrg@mail.gmail.com>
 <3BA7F90A-E656-4CF3-A27F-00D721B94BF7@gmail.com>
 <5188E79B.7030501@erlang.org>
 <FC092819-C8F7-43CA-B531-E48BD1CF4CF0@gmail.com>
Message-ID: <518A0A65.7020206@erlang.org>

On 05/07/2013 09:32 PM, Anthony Ramine wrote:
> Hello Fredrik,
>
> I removed the Dialyzer patch as it tests a now-forbidden expression. Please refetch.
>
> Regards,
>
Hello Anthony,
Re-fetched.
Thanks,

-- 

BR Fredrik Gustafsson
Erlang OTP Team


From snar@REDACTED  Wed May  8 12:32:13 2013
From: snar@REDACTED (Alexandre Snarskii)
Date: Wed, 8 May 2013 14:32:13 +0400
Subject: [erlang-bugs] minor bug in erl_interface.
Message-ID: <20130508103213.GB46550@snar.spb.ru>


Hi!

During valgrinding freeswitch compiled with mod_erlang_event
valgrind output was flooded with messages like the following: 

==96247== Warning: invalid file descriptor -2 in syscall close()
==96247==    at 0x7CA9D3: __sys_close (in /usr/lib32/libc.so.7)
==96247==    by 0xBC66F3: ei_accept_tmo (in /usr/local/lib/freeswitch/mod/mod_er
lang_event.so)
==96247==    by 0xBBF34F: mod_erlang_event_runtime (mod_erlang_event.c:1957)
==96247==    by 0x145DF8: switch_loadable_module_exec (switch_loadable_module.c:
98)
==96247==    by 0x1F17B3: dummy_worker (thread.c:138)
==96247==    by 0x367F19: ??? (in /usr/lib32/libthr.so.3)

According to sources (./lib/erl_interface/src/connect/ei_connect.c), -2 is 
a timeout indication from ei_accept_t:

    if ((fd = ei_accept_t(lfd, (struct sockaddr*) &cli_addr, 
        &cli_addr_len, ms )) < 0) {
        EI_TRACE_ERR0("ei_accept","<- ACCEPT socket accept failed");
        erl_errno = (fd == -2) ? ETIMEDOUT : EIO;
        goto error;
    }
[....]
error:
    EI_TRACE_ERR0("ei_accept","<- ACCEPT failed");
    closesocket(fd);
    return ERL_ERROR;
} /* ei_accept */

and closesocket on unix systems is defined as just close(2), so any 
timeout or error on accept causes closing invalid file descriptor.

Patch is obvious: 

    EI_TRACE_ERR0("ei_accept","<- ACCEPT failed");
-   closesocket(fd);
+   if (fd>=0) 
+      closesocket(fd);
    return ERL_ERROR;

-- 
In theory, there is no difference between theory and practice. 
But, in practice, there is. 


From fredrik@REDACTED  Wed May  8 15:44:24 2013
From: fredrik@REDACTED (Fredrik)
Date: Wed, 8 May 2013 15:44:24 +0200
Subject: [erlang-bugs] minor bug in erl_interface.
In-Reply-To: <20130508103213.GB46550@snar.spb.ru>
References: <20130508103213.GB46550@snar.spb.ru>
Message-ID: <518A56B8.9000907@erlang.org>

On 05/08/2013 12:32 PM, Alexandre Snarskii wrote:
> Hi!
>
> During valgrinding freeswitch compiled with mod_erlang_event
> valgrind output was flooded with messages like the following:
>
> ==96247== Warning: invalid file descriptor -2 in syscall close()
> ==96247==    at 0x7CA9D3: __sys_close (in /usr/lib32/libc.so.7)
> ==96247==    by 0xBC66F3: ei_accept_tmo (in /usr/local/lib/freeswitch/mod/mod_er
> lang_event.so)
> ==96247==    by 0xBBF34F: mod_erlang_event_runtime (mod_erlang_event.c:1957)
> ==96247==    by 0x145DF8: switch_loadable_module_exec (switch_loadable_module.c:
> 98)
> ==96247==    by 0x1F17B3: dummy_worker (thread.c:138)
> ==96247==    by 0x367F19: ??? (in /usr/lib32/libthr.so.3)
>
> According to sources (./lib/erl_interface/src/connect/ei_connect.c), -2 is
> a timeout indication from ei_accept_t:
>
>      if ((fd = ei_accept_t(lfd, (struct sockaddr*)&cli_addr,
>          &cli_addr_len, ms ))<  0) {
>          EI_TRACE_ERR0("ei_accept","<- ACCEPT socket accept failed");
>          erl_errno = (fd == -2) ? ETIMEDOUT : EIO;
>          goto error;
>      }
> [....]
> error:
>      EI_TRACE_ERR0("ei_accept","<- ACCEPT failed");
>      closesocket(fd);
>      return ERL_ERROR;
> } /* ei_accept */
>
> and closesocket on unix systems is defined as just close(2), so any
> timeout or error on accept causes closing invalid file descriptor.
>
> Patch is obvious:
>
>      EI_TRACE_ERR0("ei_accept","<- ACCEPT failed");
> -   closesocket(fd);
> +   if (fd>=0)
> +      closesocket(fd);
>      return ERL_ERROR;
>
Hello Alexandre,
I am making a patch out of this and putting it into testing.
Thanks for noticing and reporting :)

-- 

BR Fredrik Gustafsson
Erlang OTP Team


From n.oxyde@REDACTED  Thu May  9 15:03:08 2013
From: n.oxyde@REDACTED (Anthony Ramine)
Date: Thu, 9 May 2013 15:03:08 +0200
Subject: [erlang-bugs] Properly guard WIDE_TAG use with HAVE_WCWIDTH in
	ttsl_drv
Message-ID: <BF65ADF7-F313-4037-B9B3-ED6A3282C78D@gmail.com>

Hello,

I forgot to guard two lines of code where WIDE_TAG is used, crashing the compile process if wcwidth() is unavailable.

	git fetch https://github.com/nox/otp.git fix-wcwidth

	https://github.com/nox/otp/compare/erlang:maint...fix-wcwidth
	https://github.com/nox/otp/compare/erlang:maint...fix-wcwidth.patch

Regards,

-- 
Anthony Ramine


From robert.virding@REDACTED  Thu May  9 22:13:33 2013
From: robert.virding@REDACTED (Robert Virding)
Date: Thu, 9 May 2013 21:13:33 +0100 (BST)
Subject: [erlang-bugs] Strange thing in lib/kernel/src/group.erl
In-Reply-To: <1367848475.31752.86.camel@ax-sze>
Message-ID: <420733710.105442810.1368130413119.JavaMail.root@erlang-solutions.com>

The shell might be trapping exits and ignore exit messages or running code which does the same. The minimal case: 

1> process_flag(trap_exit, true).
false

In which case the only guaranteed method to kill it is to send the kill signal. Unfortunately you cannot (must not) assume that code is well-behaved even if it has been written with the best intentions. Erlang's error handling mechanism is based on this assumption.

Robert

----- Original Message -----
> From: "Stefan Zegenhagen" <stefan.zegenhagen@REDACTED>
> To: "Fred Hebert" <mononcqc@REDACTED>
> Cc: erlang-bugs@REDACTED
> Sent: Monday, 6 May, 2013 3:54:35 PM
> Subject: Re: [erlang-bugs] Strange thing in lib/kernel/src/group.erl
> 
> Hi,
> 
> 
> thanks again for the detailed answer.
> 
> 
> > > I can see that it might be wanted to get rid of the shell for
> > > sure. One
> > > might imagine a case where the shell is trapping exits but
> > > "refuses to
> > > die" in response to a trappable exit signal. But then, it is not
> > > clear
> > > to me, why the same measure (e.g. exit_shell(kill)) is not taken
> > > in the
> > > case where the group.erl's server process is *NOT* executing an
> > > I/O
> > > request right now and the shell might truely be blocked by
> > > activities
> > > that prevent it from reacting on the exit signal.
> > > 
> > 
> > It is indeed not very clear. My guess would be that you can make
> > assumptions about your part of the communication and protocol, but
> > not
> > the others.
> > 
> > A simpler explanation is probably that sometimes back, there was a
> > problem with either implementation and it was simpler to fix with a
> > kill
> > than by adding other ways to handling code (say, before monitoring
> > was
> > added to the language, but while trap_exits were available).
> > 
> > If this is the case, then there would be no reason to keep things
> > the
> > way they are right now IMO, and it would be possible to go with the
> > other exit.
> 
> I guess we'll have to wait for the OTP team to have a look at this,
> then ;-)
> 
> 
> > > But back to the original issue: there are several, discinct
> > > reasons why
> > > we might need to forcedly terminate a shell session *AND* to do
> > > an
> > > appropriate logging IFF a user is currently logged on (for
> > > security/auditing reasons), e.g.:
> > >  - the serial cable is being unplugged while a user is logged on
> > >  - someone tries to interfere with the system by sending huge
> > >  amounts
> > >    of binary data over the serial port (possible
> > >    denial-of-service)
> > >  - ...
> > > 
> > > Our user_drv.erl replacement exits with an appropriate reason in
> > > those
> > > cases and our shell implementation needs to know the exit reason
> > > to do
> > > the right thing depending on the situation. This is currently
> > > impossible
> > > and I was wondering whether anything could be done about it.
> > 
> > That is definitely a nice use case and I would be personally more
> > open
> > to allowing that than leaving the 'kill' here. I am however not in
> > the
> > OTP team, and do not know everything that has to do with the shell,
> > so
> > this is only my personal opinion.
> > 
> > A possible workaround if things do not come to fruition would be to
> > add
> > layers of indirection -- a process that monitors the shell and the
> > group.erl process and reports the most useful message. Ideally this
> > would not need to be written, although it might still be needed if
> > you
> > deal with older implementations after the fix.
> 
> I had thought of that as well, but tried to avoid that because
> personally, I do not feel comfortable with this (that our session
> monitor would need to know the PID of the shell's group leader). But
> that's merely a matter of taste and if there is need, it can overrule
> the headaches :-)
> 
> > > 
> > > Whether this works would certainly depend on the timing. The
> > > shell
> > > process should be given enough time to have a chance to process
> > > the
> > > first exit signal before being forcedly killed by the second one.
> > > Can
> > > this be guaranteed?
> > > 
> > 
> > The two-kill approach should work well in the event where the other
> > process is not trapping exits. In that case, the order of signals
> > should
> > be guaranteed, and the first one will kill the process cleanly.
> > 
> > If the process is trapping exits, though, then the first (non-kill)
> > signal will be converted to a message and you're absolutely
> > unlikely to
> > be able to have the time to process the first one before being
> > killed by
> > the second one.
> 
> Unfortunately, since we want to provide <CTRL>+C interrupt
> possibilities, we need to trap exits.
> 
> 
> > The cleanest solution is obviously to be able to just exit/2 with
> > the
> > right reason.
> > 
> > I don't know if the OTP team has managed to transfer all the
> > changelogs
> > relating to the shells when they moved over to git, but I'd be
> > interested to figure out if the exit(Pid,kill) in there is older
> > than
> > monitors -- if so, it would mean that it was probably a workaround
> > for
> > the io module which is no longer necessary today (because it can
> > monitor
> > without altering links or exits being trapped).
> 
> 
> This would be interesting to know, indeed ;-)
> 
> I'm just wondering if there's a better chance of getting the change
> if
> it is made configurable via "io:setopt([{safe_exit_code, true}])"? In
> any case I would not mind to create the patch.
> 
> 
> Kind regards,
> --
> Dr. Stefan Zegenhagen
> 
> arcutronix GmbH
> Garbsener Landstr. 10
> 30419 Hannover
> Germany
> 
> Tel:   +49 511 277-2734
> Fax:   +49 511 277-2709
> Email: stefan.zegenhagen@REDACTED
> Web:   www.arcutronix.com
> 
> *Synchronize the Ethernet*
> 
> General Managers: Dipl. Ing. Juergen Schroeder, Dr. Josef Gfrerer -
> Legal Form: GmbH, Registered office: Hannover, HRB 202442,
> Amtsgericht
> Hannover; Ust-Id: DE257551767.
> 
> Please consider the environment before printing this message.
> 
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
> 


From stefan.zegenhagen@REDACTED  Fri May 10 10:23:51 2013
From: stefan.zegenhagen@REDACTED (Stefan Zegenhagen)
Date: Fri, 10 May 2013 10:23:51 +0200
Subject: [erlang-bugs] Strange thing in lib/kernel/src/group.erl
In-Reply-To: <420733710.105442810.1368130413119.JavaMail.root@erlang-solutions.com>
References: <420733710.105442810.1368130413119.JavaMail.root@erlang-solutions.com>
Message-ID: <1368174231.31752.112.camel@ax-sze>

Dear Robert,

> The shell might be trapping exits and ignore exit messages or running code which does the same. The minimal case: 
> 
> 1> process_flag(trap_exit, true).
> false
> 
> In which case the only guaranteed method to kill it is to send the kill signal. Unfortunately you cannot (must not) assume that code is well-behaved even if it has been written with the best intentions. Erlang's error handling mechanism is based on this assumption.

I fully agree with you. There's just two things that are difficult for
me to understand and indicate that a change to the current behaviour
might be necessary:

1) The error handling isn't consistently that rigorous. In my opinion,
it's the less critical path that uses the definite exit path, whereas
other exit paths simply assume that the client code is well-behaved.

2) Even for well-behaved code there is *ALWAYS* a penalty by not being
able to reliably retrieve the exit reason.

There may be solutions to the problem that satisfy all needs. Two spring
into my mind almost immediately:

a) Make the behaviour configurable by introducing an io:setopt() option,
but let the default behaviour as-is.

b) Assume that the code is well-behaved. Send a regular exit signal with
correct reason and check that the shell process really exits. If it does
not within a certain amount of time, forcedly kill it.


I would be willing to prepare a patch, but before doing so, I wanted to
get an overview of the possible solutions and which of them might be
acceptable to the erlang community.


Kind regards,

-- 
Dr. Stefan Zegenhagen

arcutronix GmbH
Garbsener Landstr. 10
30419 Hannover
Germany

Tel:   +49 511 277-2734
Fax:   +49 511 277-2709
Email: stefan.zegenhagen@REDACTED
Web:   www.arcutronix.com

*Synchronize the Ethernet*

General Managers: Dipl. Ing. Juergen Schroeder, Dr. Josef Gfrerer -
Legal Form: GmbH, Registered office: Hannover, HRB 202442, Amtsgericht
Hannover; Ust-Id: DE257551767.

Please consider the environment before printing this message.


From n.oxyde@REDACTED  Sun May 12 17:36:45 2013
From: n.oxyde@REDACTED (Anthony Ramine)
Date: Sun, 12 May 2013 17:36:45 +0200
Subject: [erlang-bugs] Lift limitation to FD_SETSIZE file descriptors on Mac
	OS X in erl_poll
Message-ID: <5B66A627-9663-4745-8282-9A4BFAB23229@gmail.com>

Hello,

I've written a patch that makes erl_poll uses _DARWIN_UNLIMITED_SELECT on Mac OS X. This constant makes select() work with more than FD_SETSIZE file descriptors, all that is needed is to manually manage the fd_set values.

I've run port_SUITE.iter_max_ports test case and got a maximum of 2422 ports instead of 502 before.

Cc'ing Joel Reymont and Max Lapshin because I know they both encountered that problem.

	git fetch https://github.com/nox/otp.git darwin-unlimited-select

	https://github.com/nox/otp/compare/erlang:maint...darwin-unlimited-select
	https://github.com/nox/otp/compare/erlang:maint...darwin-unlimited-select.patch

Regards,

-- 
Anthony Ramine


From fredrik@REDACTED  Mon May 13 10:10:35 2013
From: fredrik@REDACTED (Fredrik)
Date: Mon, 13 May 2013 10:10:35 +0200
Subject: [erlang-bugs] [erlang-patches] Properly guard WIDE_TAG use with
 HAVE_WCWIDTH in ttsl_drv
In-Reply-To: <BF65ADF7-F313-4037-B9B3-ED6A3282C78D@gmail.com>
References: <BF65ADF7-F313-4037-B9B3-ED6A3282C78D@gmail.com>
Message-ID: <51909FFB.3050908@erlang.org>

On 05/09/2013 03:03 PM, Anthony Ramine wrote:
> Hello,
>
> I forgot to guard two lines of code where WIDE_TAG is used, crashing the compile process if wcwidth() is unavailable.
>
> 	git fetch https://github.com/nox/otp.git fix-wcwidth
>
> 	https://github.com/nox/otp/compare/erlang:maint...fix-wcwidth
> 	https://github.com/nox/otp/compare/erlang:maint...fix-wcwidth.patch
>
> Regards,
>
Hello Anthony,
I've fetched your branch, it should be visible in the 'pu' branch shortly.
Thanks,

-- 

BR Fredrik Gustafsson
Erlang OTP Team


From fredrik@REDACTED  Mon May 13 10:27:46 2013
From: fredrik@REDACTED (Fredrik)
Date: Mon, 13 May 2013 10:27:46 +0200
Subject: [erlang-bugs] [erlang-patches] Lift limitation to FD_SETSIZE
 file descriptors on Mac OS X in erl_poll
In-Reply-To: <5B66A627-9663-4745-8282-9A4BFAB23229@gmail.com>
References: <5B66A627-9663-4745-8282-9A4BFAB23229@gmail.com>
Message-ID: <5190A402.4090406@erlang.org>

On 05/12/2013 05:36 PM, Anthony Ramine wrote:
> Hello,
>
> I've written a patch that makes erl_poll uses _DARWIN_UNLIMITED_SELECT on Mac OS X. This constant makes select() work with more than FD_SETSIZE file descriptors, all that is needed is to manually manage the fd_set values.
>
> I've run port_SUITE.iter_max_ports test case and got a maximum of 2422 ports instead of 502 before.
>
> Cc'ing Joel Reymont and Max Lapshin because I know they both encountered that problem.
>
> 	git fetch https://github.com/nox/otp.git darwin-unlimited-select
>
> 	https://github.com/nox/otp/compare/erlang:maint...darwin-unlimited-select
> 	https://github.com/nox/otp/compare/erlang:maint...darwin-unlimited-select.patch
>
> Regards,
>
Hello Anthony,
I've fetched your branch and it is now located in the 'pu' branch.
Thanks,

-- 

BR Fredrik Gustafsson
Erlang OTP Team


From erlangsiri@REDACTED  Mon May 13 17:41:43 2013
From: erlangsiri@REDACTED (Siri Hansen)
Date: Mon, 13 May 2013 17:41:43 +0200
Subject: [erlang-bugs] Supervisor terminate_child race
In-Reply-To: <83357CE5-7BFB-4857-82ED-33AC842ACBD8@gmail.com>
References: <CAOLR_oYy7Yc6chh0oQuQ0QhR23eoi19p=mY2aBU2CqPFQ_XbPw@mail.gmail.com>
 <A915CA6D-979B-4915-80C7-20A88257F4E8@gmail.com>
 <CAOLR_oZtQ+LiUgnr=A2-SC6_BPA++T3MNSTxvu=PepRMgsKKrA@mail.gmail.com>
 <CAOLR_oYbs994CnvQmXHzzjpHpaKfrTjQ=A94U6wvT2tynoSqwQ@mail.gmail.com>
 <CAGqERUHb8k46pxwhxTrq_a0_A7HB9qz1u-g6c_S9TZ_tRcbVdg@mail.gmail.com>
 <CAOLR_oYE7BOKjNXtEG7Qr41FsA3zEsG9iOXvk4ZwnXOMUOtXcA@mail.gmail.com>
 <3161FF70-B6D0-4565-8664-2FCB9F96E08D@gmail.com>
 <CAOLR_oZiVF5W=8YNPD7ipVs=tMsuOTDLqJii2+eXeeW09L65Gw@mail.gmail.com>
 <83357CE5-7BFB-4857-82ED-33AC842ACBD8@gmail.com>
Message-ID: <CAGqERUF3sj-nyoJS6stZeoh3w9S=Ya9Q==WoA1sjkcRe3q6-jw@mail.gmail.com>

Bryan and Tim, your analysis is very good, and the problem is complicated.
I don't see a "water tight" solution right now, and I can not spend too
much time pondering without having a real priority for this case. I have
written a ticket for it, and it will be prioritized along with all other
backlog items. Any further thoughts and contributions will be very much
appreciated :)
Thanks again
/siri


2013/4/30 Tim Watson <watson.timothy@REDACTED>

> Hi Bryan,
>
> On 30 Apr 2013, at 18:34, Bryan Fink wrote:
>
>
> But twiddling the timing there is just as racy, as you've noticed, right?
>
>
> Correct. The length of the timeout is irrelevant. The EXIT signal is
> not guaranteed to arrive within any specific amount of time.
>
>
> Indeed. Almost a halting problem this isn't it. :)
>
>
> Isn't the point that the EXIT signal might /never/ come, if the child
> un-links, or might come *after* the 'DOWN' if the race you've located
> occurs? Surely you've got to be able to handle either case?
>
>
> Yes, the point of the monitor is to handle the case where the EXIT
> never comes (because the child unlinks). It is not the case, however,
> that the EXIT always arrives after the DOWN in the race I'm seeing.
> They might both be delayed.
>
>
> Waiting without a timeout for the 'DOWN' is acceptable, because you've got
> a guarantee (via the runtime) the it *will* arrive, no matter what state
> the target process was in when you created the monitor. Waiting some
> arbitrary time for the 'EXIT' is a real problem though, because you could
> wait forever.
>
> Handling either order is important, but the problem with this race is
> that only the EXIT message contains the actual exit reason when this
> happens. The 'noproc' in the DOWN is just saying that there was no
> process to monitor.
>
>
> Indeed. But it could equally be true that the 'EXIT' signal was never
> dispatched, because the child process unlinked before it died; You can't
> wait forever for the 'EXIT' after you've seen a 'DOWN' with 'noproc' as the
> reason, so now you've got to choose how long to wait, but whatever timing
> works for one particular case isn't going to solve the general problem.
>
>
> We ran into something similar with our supervisor2 fork a while back,
> whilst terminating (multiple) simple children:
> http://hg.rabbitmq.com/rabbitmq-server/rev/812d71d0716c . That code is
> somewhat different though, not only because it was terminating multiple
> children (during shutdown) but also because it explicitly unlinks from the
> child *after* creating the monitor, and /still/ allowed for an EXIT signal
> to have made its way into the mailbox unexpectedly.
>
>
> The monitor_child/1 function also unlinks from the child after
> creating the monitor. That patch looks a little bit like the fixes I
> was trying. Basically it's checking for an EXIT message after
> receiving the DOWN, just in case one is in the mailbox, yes?
>
>
> That's correct.
>
> The problem is that it might still miss an EXIT, because it might still
> not have arrived yet, even though it will later.
>
>
> Yes that's definitely true and we were aware of that problem, however
> since we know we cannot wait for the 'EXIT' forever and whatever arbitrary
> timeout we choose is just someone else's race condition, we decided that if
> the EXIT signal wasn't delivered expediently to the process' mailbox, that
> loosing the real exit reason was something we could live with in the worst
> case.
>
> Since we've started merging the R15/R16 changes in though, that code has
> disappeared so we're in the same boat as you guys. :)
>
> Cheers,
> Tim
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130513/ee36a7ec/attachment.htm>

From erlangsiri@REDACTED  Tue May 14 16:43:14 2013
From: erlangsiri@REDACTED (Siri Hansen)
Date: Tue, 14 May 2013 16:43:14 +0200
Subject: [erlang-bugs] Supervisor terminate_child race
In-Reply-To: <CAGqERUF3sj-nyoJS6stZeoh3w9S=Ya9Q==WoA1sjkcRe3q6-jw@mail.gmail.com>
References: <CAOLR_oYy7Yc6chh0oQuQ0QhR23eoi19p=mY2aBU2CqPFQ_XbPw@mail.gmail.com>
 <A915CA6D-979B-4915-80C7-20A88257F4E8@gmail.com>
 <CAOLR_oZtQ+LiUgnr=A2-SC6_BPA++T3MNSTxvu=PepRMgsKKrA@mail.gmail.com>
 <CAOLR_oYbs994CnvQmXHzzjpHpaKfrTjQ=A94U6wvT2tynoSqwQ@mail.gmail.com>
 <CAGqERUHb8k46pxwhxTrq_a0_A7HB9qz1u-g6c_S9TZ_tRcbVdg@mail.gmail.com>
 <CAOLR_oYE7BOKjNXtEG7Qr41FsA3zEsG9iOXvk4ZwnXOMUOtXcA@mail.gmail.com>
 <3161FF70-B6D0-4565-8664-2FCB9F96E08D@gmail.com>
 <CAOLR_oZiVF5W=8YNPD7ipVs=tMsuOTDLqJii2+eXeeW09L65Gw@mail.gmail.com>
 <83357CE5-7BFB-4857-82ED-33AC842ACBD8@gmail.com>
 <CAGqERUF3sj-nyoJS6stZeoh3w9S=Ya9Q==WoA1sjkcRe3q6-jw@mail.gmail.com>
Message-ID: <CAGqERUFHdNWhh6=_-W2GxSYzybZw46WBeH0=tvwFq-pCGoxjNw@mail.gmail.com>

Just a thought: would it be an option (and would it help) to monitor each
child from birth?
/siri


2013/5/13 Siri Hansen <erlangsiri@REDACTED>

> Bryan and Tim, your analysis is very good, and the problem is complicated.
> I don't see a "water tight" solution right now, and I can not spend too
> much time pondering without having a real priority for this case. I have
> written a ticket for it, and it will be prioritized along with all other
> backlog items. Any further thoughts and contributions will be very much
> appreciated :)
> Thanks again
> /siri
>
>
> 2013/4/30 Tim Watson <watson.timothy@REDACTED>
>
>> Hi Bryan,
>>
>> On 30 Apr 2013, at 18:34, Bryan Fink wrote:
>>
>>
>> But twiddling the timing there is just as racy, as you've noticed, right?
>>
>>
>> Correct. The length of the timeout is irrelevant. The EXIT signal is
>> not guaranteed to arrive within any specific amount of time.
>>
>>
>> Indeed. Almost a halting problem this isn't it. :)
>>
>>
>> Isn't the point that the EXIT signal might /never/ come, if the child
>> un-links, or might come *after* the 'DOWN' if the race you've located
>> occurs? Surely you've got to be able to handle either case?
>>
>>
>> Yes, the point of the monitor is to handle the case where the EXIT
>> never comes (because the child unlinks). It is not the case, however,
>> that the EXIT always arrives after the DOWN in the race I'm seeing.
>> They might both be delayed.
>>
>>
>> Waiting without a timeout for the 'DOWN' is acceptable, because you've
>> got a guarantee (via the runtime) the it *will* arrive, no matter what
>> state the target process was in when you created the monitor. Waiting some
>> arbitrary time for the 'EXIT' is a real problem though, because you could
>> wait forever.
>>
>> Handling either order is important, but the problem with this race is
>> that only the EXIT message contains the actual exit reason when this
>> happens. The 'noproc' in the DOWN is just saying that there was no
>> process to monitor.
>>
>>
>> Indeed. But it could equally be true that the 'EXIT' signal was never
>> dispatched, because the child process unlinked before it died; You can't
>> wait forever for the 'EXIT' after you've seen a 'DOWN' with 'noproc' as the
>> reason, so now you've got to choose how long to wait, but whatever timing
>> works for one particular case isn't going to solve the general problem.
>>
>>
>> We ran into something similar with our supervisor2 fork a while back,
>> whilst terminating (multiple) simple children:
>> http://hg.rabbitmq.com/rabbitmq-server/rev/812d71d0716c . That code is
>> somewhat different though, not only because it was terminating multiple
>> children (during shutdown) but also because it explicitly unlinks from the
>> child *after* creating the monitor, and /still/ allowed for an EXIT signal
>> to have made its way into the mailbox unexpectedly.
>>
>>
>> The monitor_child/1 function also unlinks from the child after
>> creating the monitor. That patch looks a little bit like the fixes I
>> was trying. Basically it's checking for an EXIT message after
>> receiving the DOWN, just in case one is in the mailbox, yes?
>>
>>
>> That's correct.
>>
>> The problem is that it might still miss an EXIT, because it might still
>> not have arrived yet, even though it will later.
>>
>>
>> Yes that's definitely true and we were aware of that problem, however
>> since we know we cannot wait for the 'EXIT' forever and whatever arbitrary
>> timeout we choose is just someone else's race condition, we decided that if
>> the EXIT signal wasn't delivered expediently to the process' mailbox, that
>> loosing the real exit reason was something we could live with in the worst
>> case.
>>
>> Since we've started merging the R15/R16 changes in though, that code has
>> disappeared so we're in the same boat as you guys. :)
>>
>> Cheers,
>> Tim
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130514/b416bdd8/attachment.htm>

From watson.timothy@REDACTED  Wed May 15 10:03:31 2013
From: watson.timothy@REDACTED (Tim Watson)
Date: Wed, 15 May 2013 09:03:31 +0100
Subject: [erlang-bugs] Supervisor terminate_child race
In-Reply-To: <CAGqERUFHdNWhh6=_-W2GxSYzybZw46WBeH0=tvwFq-pCGoxjNw@mail.gmail.com>
References: <CAOLR_oYy7Yc6chh0oQuQ0QhR23eoi19p=mY2aBU2CqPFQ_XbPw@mail.gmail.com>
 <A915CA6D-979B-4915-80C7-20A88257F4E8@gmail.com>
 <CAOLR_oZtQ+LiUgnr=A2-SC6_BPA++T3MNSTxvu=PepRMgsKKrA@mail.gmail.com>
 <CAOLR_oYbs994CnvQmXHzzjpHpaKfrTjQ=A94U6wvT2tynoSqwQ@mail.gmail.com>
 <CAGqERUHb8k46pxwhxTrq_a0_A7HB9qz1u-g6c_S9TZ_tRcbVdg@mail.gmail.com>
 <CAOLR_oYE7BOKjNXtEG7Qr41FsA3zEsG9iOXvk4ZwnXOMUOtXcA@mail.gmail.com>
 <3161FF70-B6D0-4565-8664-2FCB9F96E08D@gmail.com>
 <CAOLR_oZiVF5W=8YNPD7ipVs=tMsuOTDLqJii2+eXeeW09L65Gw@mail.gmail.com>
 <83357CE5-7BFB-4857-82ED-33AC842ACBD8@gmail.com>
 <CAGqERUF3sj-nyoJS6stZeoh3w9S=Ya9Q==WoA1sjkcRe3q6-jw@mail.gmail.com>
 <CAGqERUFHdNWhh6=_-W2GxSYzybZw46WBeH0=tvwFq-pCGoxjNw@mail.gmail.com>
Message-ID: <05D1CF78-A894-4C1C-8848-4F98F80707EF@gmail.com>

Switching to monitors is, IMHO a better approach, since using both is prone to races and links are open to be interfered with.

Are there any disadvantages I've not thought of though? Or are you suggesting to do both from birth?

On 14 May 2013, at 15:43, Siri Hansen <erlangsiri@REDACTED> wrote:

> Just a thought: would it be an option (and would it help) to monitor each child from birth?
> /siri
> 
> 
> 2013/5/13 Siri Hansen <erlangsiri@REDACTED>
> Bryan and Tim, your analysis is very good, and the problem is complicated. I don't see a "water tight" solution right now, and I can not spend too much time pondering without having a real priority for this case. I have written a ticket for it, and it will be prioritized along with all other backlog items. Any further thoughts and contributions will be very much appreciated :)
> Thanks again
> /siri
> 
> 
> 2013/4/30 Tim Watson <watson.timothy@REDACTED>
> Hi Bryan,
> 
> On 30 Apr 2013, at 18:34, Bryan Fink wrote:
>>> 
>>> But twiddling the timing there is just as racy, as you've noticed, right?
>> 
>> Correct. The length of the timeout is irrelevant. The EXIT signal is
>> not guaranteed to arrive within any specific amount of time.
>> 
> 
> Indeed. Almost a halting problem this isn't it. :)
> 
>>> 
>>> Isn't the point that the EXIT signal might /never/ come, if the child un-links, or might come *after* the 'DOWN' if the race you've located occurs? Surely you've got to be able to handle either case?
>> 
>> Yes, the point of the monitor is to handle the case where the EXIT
>> never comes (because the child unlinks). It is not the case, however,
>> that the EXIT always arrives after the DOWN in the race I'm seeing.
>> They might both be delayed.
>> 
> 
> Waiting without a timeout for the 'DOWN' is acceptable, because you've got a guarantee (via the runtime) the it *will* arrive, no matter what state the target process was in when you created the monitor. Waiting some arbitrary time for the 'EXIT' is a real problem though, because you could wait forever.
> 
>> Handling either order is important, but the problem with this race is
>> that only the EXIT message contains the actual exit reason when this
>> happens. The 'noproc' in the DOWN is just saying that there was no
>> process to monitor.
> 
> Indeed. But it could equally be true that the 'EXIT' signal was never dispatched, because the child process unlinked before it died; You can't wait forever for the 'EXIT' after you've seen a 'DOWN' with 'noproc' as the reason, so now you've got to choose how long to wait, but whatever timing works for one particular case isn't going to solve the general problem.
> 
>> 
>>> We ran into something similar with our supervisor2 fork a while back, whilst terminating (multiple) simple children: http://hg.rabbitmq.com/rabbitmq-server/rev/812d71d0716c . That code is somewhat different though, not only because it was terminating multiple children (during shutdown) but also because it explicitly unlinks from the child *after* creating the monitor, and /still/ allowed for an EXIT signal to have made its way into the mailbox unexpectedly.
>> 
>> The monitor_child/1 function also unlinks from the child after
>> creating the monitor. That patch looks a little bit like the fixes I
>> was trying. Basically it's checking for an EXIT message after
>> receiving the DOWN, just in case one is in the mailbox, yes?
> 
> That's correct. 
> 
>> The problem is that it might still miss an EXIT, because it might still
>> not have arrived yet, even though it will later.
>> 
> 
> Yes that's definitely true and we were aware of that problem, however since we know we cannot wait for the 'EXIT' forever and whatever arbitrary timeout we choose is just someone else's race condition, we decided that if the EXIT signal wasn't delivered expediently to the process' mailbox, that loosing the real exit reason was something we could live with in the worst case.
> 
> Since we've started merging the R15/R16 changes in though, that code has disappeared so we're in the same boat as you guys. :)
> 
> Cheers,
> Tim
> 
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130515/4ee7b304/attachment.htm>

From robert.virding@REDACTED  Wed May 15 12:17:28 2013
From: robert.virding@REDACTED (Robert Virding)
Date: Wed, 15 May 2013 11:17:28 +0100 (BST)
Subject: [erlang-bugs] Supervisor terminate_child race
In-Reply-To: <05D1CF78-A894-4C1C-8848-4F98F80707EF@gmail.com>
Message-ID: <816241784.113785065.1368613048244.JavaMail.root@erlang-solutions.com>

Do you mean only using monitors in the supervisor, and no links? If so that would not work as you would then not get an exit signal automatically sent to the child when the supervisor dies. Which you do want. Or have I misunderstood you? 

Robert 

----- Original Message -----

> From: "Tim Watson" <watson.timothy@REDACTED>
> To: "Siri Hansen" <erlangsiri@REDACTED>
> Cc: erlang-bugs@REDACTED
> Sent: Wednesday, 15 May, 2013 10:03:31 AM
> Subject: Re: [erlang-bugs] Supervisor terminate_child race

> Switching to monitors is, IMHO a better approach, since using both is
> prone to races and links are open to be interfered with.

> Are there any disadvantages I've not thought of though? Or are you
> suggesting to do both from birth?

> On 14 May 2013, at 15:43, Siri Hansen < erlangsiri@REDACTED > wrote:

> > Just a thought: would it be an option (and would it help) to
> > monitor
> > each child from birth?
> 
> > /siri
> 

> > 2013/5/13 Siri Hansen < erlangsiri@REDACTED >
> 

> > > Bryan and Tim, your analysis is very good, and the problem is
> > > complicated. I don't see a "water tight" solution right now, and
> > > I
> > > can not spend too much time pondering without having a real
> > > priority
> > > for this case. I have written a ticket for it, and it will be
> > > prioritized along with all other backlog items. Any further
> > > thoughts
> > > and contributions will be very much appreciated :)
> > 
> 
> > > Thanks again
> > 
> 
> > > /siri
> > 
> 

> > > 2013/4/30 Tim Watson < watson.timothy@REDACTED >
> > 
> 

> > > > Hi Bryan,
> > > 
> > 
> 

> > > > On 30 Apr 2013, at 18:34, Bryan Fink wrote:
> > > 
> > 
> 
> > > > > > But twiddling the timing there is just as racy, as you've
> > > > > > noticed,
> > > > > > right?
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > Correct. The length of the timeout is irrelevant. The EXIT
> > > > > signal
> > > > > is
> > > > 
> > > 
> > 
> 
> > > > > not guaranteed to arrive within any specific amount of time.
> > > > 
> > > 
> > 
> 

> > > > Indeed. Almost a halting problem this isn't it. :)
> > > 
> > 
> 

> > > > > > Isn't the point that the EXIT signal might /never/ come, if
> > > > > > the
> > > > > > child
> > > > > > un-links, or might come *after* the 'DOWN' if the race
> > > > > > you've
> > > > > > located occurs? Surely you've got to be able to handle
> > > > > > either
> > > > > > case?
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > Yes, the point of the monitor is to handle the case where the
> > > > > EXIT
> > > > 
> > > 
> > 
> 
> > > > > never comes (because the child unlinks). It is not the case,
> > > > > however,
> > > > 
> > > 
> > 
> 
> > > > > that the EXIT always arrives after the DOWN in the race I'm
> > > > > seeing.
> > > > 
> > > 
> > 
> 
> > > > > They might both be delayed.
> > > > 
> > > 
> > 
> 

> > > > Waiting without a timeout for the 'DOWN' is acceptable, because
> > > > you've got a guarantee (via the runtime) the it *will* arrive,
> > > > no
> > > > matter what state the target process was in when you created
> > > > the
> > > > monitor. Waiting some arbitrary time for the 'EXIT' is a real
> > > > problem though, because you could wait forever.
> > > 
> > 
> 

> > > > > Handling either order is important, but the problem with this
> > > > > race
> > > > > is
> > > > 
> > > 
> > 
> 
> > > > > that only the EXIT message contains the actual exit reason
> > > > > when
> > > > > this
> > > > 
> > > 
> > 
> 
> > > > > happens. The 'noproc' in the DOWN is just saying that there
> > > > > was
> > > > > no
> > > > 
> > > 
> > 
> 
> > > > > process to monitor.
> > > > 
> > > 
> > 
> 

> > > > Indeed. But it could equally be true that the 'EXIT' signal was
> > > > never
> > > > dispatched, because the child process unlinked before it died;
> > > > You
> > > > can't wait forever for the 'EXIT' after you've seen a 'DOWN'
> > > > with
> > > > 'noproc' as the reason, so now you've got to choose how long to
> > > > wait, but whatever timing works for one particular case isn't
> > > > going
> > > > to solve the general problem.
> > > 
> > 
> 

> > > > > > We ran into something similar with our supervisor2 fork a
> > > > > > while
> > > > > > back,
> > > > > > whilst terminating (multiple) simple children:
> > > > > > http://hg.rabbitmq.com/rabbitmq-server/rev/812d71d0716c .
> > > > > > That
> > > > > > code
> > > > > > is somewhat different though, not only because it was
> > > > > > terminating
> > > > > > multiple children (during shutdown) but also because it
> > > > > > explicitly
> > > > > > unlinks from the child *after* creating the monitor, and
> > > > > > /still/
> > > > > > allowed for an EXIT signal to have made its way into the
> > > > > > mailbox
> > > > > > unexpectedly.
> > > > > 
> > > > 
> > > 
> > 
> 

> > > > > The monitor_child/1 function also unlinks from the child
> > > > > after
> > > > 
> > > 
> > 
> 
> > > > > creating the monitor. That patch looks a little bit like the
> > > > > fixes
> > > > > I
> > > > 
> > > 
> > 
> 
> > > > > was trying. Basically it's checking for an EXIT message after
> > > > 
> > > 
> > 
> 
> > > > > receiving the DOWN, just in case one is in the mailbox, yes?
> > > > 
> > > 
> > 
> 
> > > > That's correct.
> > > 
> > 
> 

> > > > > The problem is that it might still miss an EXIT, because it
> > > > > might
> > > > > still
> > > > 
> > > 
> > 
> 
> > > > > not have arrived yet, even though it will later.
> > > > 
> > > 
> > 
> 

> > > > Yes that's definitely true and we were aware of that problem,
> > > > however
> > > > since we know we cannot wait for the 'EXIT' forever and
> > > > whatever
> > > > arbitrary timeout we choose is just someone else's race
> > > > condition,
> > > > we decided that if the EXIT signal wasn't delivered expediently
> > > > to
> > > > the process' mailbox, that loosing the real exit reason was
> > > > something we could live with in the worst case.
> > > 
> > 
> 

> > > > Since we've started merging the R15/R16 changes in though, that
> > > > code
> > > > has disappeared so we're in the same boat as you guys. :)
> > > 
> > 
> 

> > > > Cheers,
> > > 
> > 
> 
> > > > Tim
> > > 
> > 
> 

> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130515/263951e6/attachment.htm>

From watson.timothy@REDACTED  Wed May 15 12:53:04 2013
From: watson.timothy@REDACTED (Tim Watson)
Date: Wed, 15 May 2013 11:53:04 +0100
Subject: [erlang-bugs] Supervisor terminate_child race
In-Reply-To: <816241784.113785065.1368613048244.JavaMail.root@erlang-solutions.com>
References: <816241784.113785065.1368613048244.JavaMail.root@erlang-solutions.com>
Message-ID: <B312FED6-749E-477C-8B6A-A219D27D398C@gmail.com>

On 15 May 2013, at 11:17, Robert Virding wrote:

> Do you mean only using monitors in the supervisor, and no links? If so that would not work as you would then not get an exit signal automatically sent to the child when the supervisor dies.
> Which you do want. Or have I misunderstood you?
> 

Oh gosh, how embarrasing. I was thinking in terms of Uni-directional Links (viz A Unified Semantics for Future Erlang, Svensson et al), and linking child to parent (so as to propagate supervisor exits) but not the other way around. Of course we can't do that - just ignore this suggestion. [note: I've been implementing the supervisor API for cloud haskell in my spare time and got confused between those semantics (viz http://haskell-distributed.github.io/static/semantics.pdf) and what I do for a day job in the *real world*].

But switching all the supervisor's signal handling to rely on monitor notifications rather than trapped exits (which might be ignored) sounds good to me. The use of linking would be there to guarantee supervisor death is propagated correctly, but we could switch away from handling child 'EXIT' signals to handling 'DOWN' notifications instead. This would IMO be a bit cleaner.

Cheers,
Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130515/460d2b2b/attachment.htm>

From erlangsiri@REDACTED  Wed May 15 15:54:54 2013
From: erlangsiri@REDACTED (Siri Hansen)
Date: Wed, 15 May 2013 15:54:54 +0200
Subject: [erlang-bugs] Supervisor terminate_child race
In-Reply-To: <B312FED6-749E-477C-8B6A-A219D27D398C@gmail.com>
References: <816241784.113785065.1368613048244.JavaMail.root@erlang-solutions.com>
 <B312FED6-749E-477C-8B6A-A219D27D398C@gmail.com>
Message-ID: <CAGqERUEGJyobVeuDHRnCGRxTPFP6qKqCChT+VT4vnauy1ySTew@mail.gmail.com>

Then again... it is up to the child's start function to create the link,
and from the supervisor's point of view, the only place to add the monitor
would be when the start function returns - which would be just another
place to get a race :(


2013/5/15 Tim Watson <watson.timothy@REDACTED>

> On 15 May 2013, at 11:17, Robert Virding wrote:
>
> Do you mean only using monitors in the supervisor, and no links? If so
> that would not work as you would then not get an exit signal automatically
> sent to the child when the supervisor dies.
>
> Which you do want. Or have I misunderstood you?
>
>
> Oh gosh, how embarrasing. I was thinking in terms of Uni-directional
> Links (viz A Unified Semantics for Future Erlang, Svensson et al), and
> linking child to parent (so as to propagate supervisor exits) but not the
> other way around. Of course we can't do that - just ignore this suggestion.
> [note: I've been implementing the supervisor API for cloud haskell in my
> spare time and got confused between those semantics (viz
> http://haskell-distributed.github.io/static/semantics.pdf) and what I do
> for a day job in the *real world*].
>
> But switching all the supervisor's signal handling to rely on monitor
> notifications rather than trapped exits (which might be ignored) sounds
> good to me. The use of linking would be there to guarantee supervisor death
> is propagated correctly, but we could switch away from handling child
> 'EXIT' signals to handling 'DOWN' notifications instead. This would IMO be
> a bit cleaner.
>
> Cheers,
> Tim
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130515/97d8af03/attachment.htm>

From watson.timothy@REDACTED  Wed May 15 17:11:17 2013
From: watson.timothy@REDACTED (Tim Watson)
Date: Wed, 15 May 2013 16:11:17 +0100
Subject: [erlang-bugs] Supervisor terminate_child race
In-Reply-To: <CAGqERUEGJyobVeuDHRnCGRxTPFP6qKqCChT+VT4vnauy1ySTew@mail.gmail.com>
References: <816241784.113785065.1368613048244.JavaMail.root@erlang-solutions.com>
 <B312FED6-749E-477C-8B6A-A219D27D398C@gmail.com>
 <CAGqERUEGJyobVeuDHRnCGRxTPFP6qKqCChT+VT4vnauy1ySTew@mail.gmail.com>
Message-ID: <2302FD2F-B3F4-4514-88B0-17082D781D1A@gmail.com>

On 15 May 2013, at 14:54, Siri Hansen wrote:

> Then again... it is up to the child's start function to create the link, and from the supervisor's point of view, the only place to add the monitor would be when the start function returns - which would be just another place to get a race :(
> 

Well quite. *sigh*

Perhaps what we do in cloud haskell might be instructive after all, though the approach runs counter to the APIs which its OTP forebears use. Our supervisor performs the actual `spawn' itself, so the child spec provides as its startup term (which is roughly equivalent to an MFArgs tuple), not a function which spawns the new process, but rather the code for its process' main loop. The disadvantage here is that the API is more constrained (in terms of what the main loop looks like), but the advantage is the supervisor can insert arbitrary code into the start phase of the child's server loop. Thus our supervisor performs the link (from child to parent) and forces the child process to wait until the monitor is set up correctly, before actually entering its loop. This ensures we don't end up with a race with regards startup, linking and monitor establishment. There relevant bit of the code looks roughly like this:

wrapClosure proc spec' =
  let chId = childKey spec' in do
    supervisor <- getSelfPid
    pid <- spawnLocal $ do
      link supervisor    -- die if our parent dies
      () <- expect        -- wait for a start signal
      proc >>= checkExitType chId  -- evaluate the child's loop
    void $ monitor pid  -- synchronous call to establish a monitor
    send pid ()    -- tell the child to go into its main loop
    return $ Right $ ChildRunning pid

Of course, because of this design, our gen_server API looks completely different! The start function, for example, doesn't spawn a process, but rather evaluates the `init' callback and enters the gen server's main loop (or crashes) immediately with the return value, leaving the `spawn' part to its clients. The supervisor is, of course, one of these clients: In fact our supervisor, like it's OTP inspiration, is itself a gen_server (we call them managed processes), and thus its start function never returns either:

-- | Starts a supervisor. 
...
start :: RestartStrategy -> [ChildSpec] -> ManagedProcessLoop SupervisorState
start strategy' specs' = ManagedProcess.start (strategy', specs') supInit serverDefinition

Now obviously, given that Erlang has been used in the real world for decades, we can't go changing gen server's start_link or the supervisor child spec APIs. But is there a way to achieve something similar without carving things up too much? I'm struggling to think of one, but it would good if we could avoid the race altogether.

Cheers,
Tim

> 
> 2013/5/15 Tim Watson <watson.timothy@REDACTED>
> On 15 May 2013, at 11:17, Robert Virding wrote:
> 
>> Do you mean only using monitors in the supervisor, and no links? If so that would not work as you would then not get an exit signal automatically sent to the child when the supervisor dies.
>> Which you do want. Or have I misunderstood you?
>> 
> 
> Oh gosh, how embarrasing. I was thinking in terms of Uni-directional Links (viz A Unified Semantics for Future Erlang, Svensson et al), and linking child to parent (so as to propagate supervisor exits) but not the other way around. Of course we can't do that - just ignore this suggestion. [note: I've been implementing the supervisor API for cloud haskell in my spare time and got confused between those semantics (viz http://haskell-distributed.github.io/static/semantics.pdf) and what I do for a day job in the *real world*].
> 
> But switching all the supervisor's signal handling to rely on monitor notifications rather than trapped exits (which might be ignored) sounds good to me. The use of linking would be there to guarantee supervisor death is propagated correctly, but we could switch away from handling child 'EXIT' signals to handling 'DOWN' notifications instead. This would IMO be a bit cleaner.
> 
> Cheers,
> Tim
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130515/954b1ace/attachment.htm>

From essen@REDACTED  Thu May 16 19:02:57 2013
From: essen@REDACTED (=?ISO-8859-1?Q?Lo=EFc_Hoguin?=)
Date: Thu, 16 May 2013 19:02:57 +0200
Subject: [erlang-bugs] Wrong type for ssl key option
Message-ID: <51951141.9000302@ninenines.eu>

Type ssl_option() says: {key, Der::binary()}

Documentation says: {key, {'RSAPrivateKey'| 'DSAPrivateKey' | 
'PrivateKeyInfo', der_encoded()}}

I believe the documentation is correct and the code wrong.

Please confirm.

-- 
Lo?c Hoguin
Erlang Cowboy
Nine Nines
http://ninenines.eu


From daimon@REDACTED  Fri May 17 07:15:14 2013
From: daimon@REDACTED (Masatake Daimon)
Date: Fri, 17 May 2013 14:15:14 +0900
Subject: [erlang-bugs] Fix {stream, {self, once}} in httpc
Message-ID: <5195BCE2.9000704@ymir.co.jp>

Hello,

Previously the only difference between {stream, self} and {stream,
{self, once}} was an extra Pid in the stream_start message due to a
bug in httpc_handler. It was immediately sending a bunch of messages
till the end instead of waiting for httpc:stream_next/1 being called.

Before applying this patch:
https://gist.github.com/phonohawk/5589337#file-erl-before-log

After:
https://gist.github.com/phonohawk/5589337#file-erl-after-log

     git fetch git://github.com/phonohawk/otp.git httpc-stream-once-fix

 
https://github.com/phonohawk/otp/compare/erlang:maint...httpc-stream-once-fix
 
https://github.com/phonohawk/otp/compare/erlang:maint...httpc-stream-once-fix.patch

Regards,

-- 
?? ?? <daimon@REDACTED>


From daimon@REDACTED  Fri May 17 07:30:22 2013
From: daimon@REDACTED (Masatake Daimon)
Date: Fri, 17 May 2013 14:30:22 +0900
Subject: [erlang-bugs] Fix {stream, {self, once}} in httpc
In-Reply-To: <5195BCE2.9000704@ymir.co.jp>
References: <5195BCE2.9000704@ymir.co.jp>
Message-ID: <5195C06E.5080502@ymir.co.jp>

Oops. I meant to send this to erlang-patches. Sorry for the noise.

On 05/17/13 14:15, Masatake Daimon wrote:
> Hello,
>
> Previously the only difference between {stream, self} and {stream,
> {self, once}} was an extra Pid in the stream_start message due to a
> bug in httpc_handler. It was immediately sending a bunch of messages
> till the end instead of waiting for httpc:stream_next/1 being called.
>
> Before applying this patch:
> https://gist.github.com/phonohawk/5589337#file-erl-before-log
>
> After:
> https://gist.github.com/phonohawk/5589337#file-erl-after-log
>
>      git fetch git://github.com/phonohawk/otp.git httpc-stream-once-fix
>
>
> https://github.com/phonohawk/otp/compare/erlang:maint...httpc-stream-once-fix
>
>
> https://github.com/phonohawk/otp/compare/erlang:maint...httpc-stream-once-fix.patch
>
>
> Regards,
>


-- 
?? ?? <daimon@REDACTED>


From mjtruog@REDACTED  Fri May 17 08:03:24 2013
From: mjtruog@REDACTED (Michael Truog)
Date: Thu, 16 May 2013 23:03:24 -0700
Subject: [erlang-bugs] syntax_tools anonymous function error
Message-ID: <5195C82C.104@gmail.com>

Hi,

I had syntax_tools break on this code "fun M:F/2" with this stack trace:
  in function  erl_syntax_lib:analyze_function_name/1 (erl_syntax_lib.erl, line 1500)
  in call from igor:transform_implicit_fun/3 (igor.erl, line 1807)
  in call from igor:transform_list/3 (igor.erl, line 1748)
  in call from igor:transform_1/3 (igor.erl, line 1741)
  in call from igor:default_transform/3 (igor.erl, line 1733)
  in call from igor:transform_list/3 (igor.erl, line 1748)
  in call from igor:transform_1/3 (igor.erl, line 1741)
  in call from igor:transform_1/3 (igor.erl, line 1742)

Thanks,
Michael


From daimon@REDACTED  Fri May 17 10:55:48 2013
From: daimon@REDACTED (Masatake Daimon)
Date: Fri, 17 May 2013 17:55:48 +0900
Subject: [erlang-bugs] Compiler crash with 'inline_list_funcs' and "fun
	Fun/Arity" notation
Message-ID: <5195F094.2010003@ymir.co.jp>

Hello,

Compiling the following module makes the compiler crash. I'm using
R16B.

===== test.erl =====
-module(test).
-compile(inline).
-compile(inline_list_funcs).
-export([foo/0]).

foo() ->
     lists:map(fun bar/1, [1]).

bar(X) -> X.

===== the crash ====
% erlc test.erl
test: function '-foo/0-lists^map/1-0-'/1+15:
   Internal consistency check failed - please report this bug.
   Instruction: {move,{x,0},{yy,0}}
   Error:       {invalid_store,{yy,0},term}:


Note that the problem disappears with any of these changes:

* Commenting out "-compile(inline)."
* Commenting out "-compile(inline_list_funcs)."
* Changing the definition of foo/0 to:
       foo() ->
           lists:map(fun bar/1, []).   % [] instead of [1]
* Changing the definition of foo/0 to:
       foo() ->
           lists:map(fun (A) -> bar(A) end, [1]).

Regards,
-- 
?? ?? <daimon@REDACTED>


From ingela.anderton.andin@REDACTED  Fri May 17 15:57:09 2013
From: ingela.anderton.andin@REDACTED (Ingela Anderton Andin)
Date: Fri, 17 May 2013 15:57:09 +0200
Subject: [erlang-bugs] Wrong type for ssl key option
In-Reply-To: <51951141.9000302@ninenines.eu>
References: <51951141.9000302@ninenines.eu>
Message-ID: <51963735.9000108@erix.ericsson.se>

Hi!

Lo?c Hoguin wrote:
> Type ssl_option() says: {key, Der::binary()}
>
> Documentation says: {key, {'RSAPrivateKey'| 'DSAPrivateKey' | 
> 'PrivateKeyInfo', der_encoded()}}
>
> I believe the documentation is correct and the code wrong.
>
> Please confirm.
>

You are correct the dialyzer spec is incorrect!

Regards Ingela Erlang/OTP team - Ericsson AB


From mjtruog@REDACTED  Fri May 17 18:26:36 2013
From: mjtruog@REDACTED (Michael Truog)
Date: Fri, 17 May 2013 09:26:36 -0700
Subject: [erlang-bugs] crash without crash dump
Message-ID: <51965A3C.3070205@gmail.com>

Hi,

I am not sure about the impact of this problem, however, it may have a larger impact.  When killing the application_controller process with the -heart option being used, no crash dump is produced:

$ erl
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:8:8] [async-threads:10] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
1> exit(whereis(application_controller), kill).
*** ERROR: Shell process terminated! ***
{"Kernel pid terminated",application_controller,killed}

Crash dump was written to: erl_crash.dump
Kernel pid terminated (application_controller) (killed)

(erl_crash.dump file exists)

$ erl -heart
heart_beat_kill_pid = 24300
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:8:8] [async-threads:10] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
1> exit(whereis(application_controller), kill).
*** ERROR: Shell process terminated! ***
{"Kernel pid terminated",application_controller,killed}
heart: Fri May 17 09:20:10 2013: Erlang is crashing .. (waiting for crash dump file)
heart: Fri May 17 09:20:10 2013: Would reboot. Terminating.
Kernel pid terminated (application_controller) (killed)

(erl_crash.dump file does not exist!)

Thanks,
Michael


From n.oxyde@REDACTED  Fri May 17 20:06:35 2013
From: n.oxyde@REDACTED (Anthony Ramine)
Date: Fri, 17 May 2013 20:06:35 +0200
Subject: [erlang-bugs] Compiler crash with 'inline_list_funcs' and "fun
	Fun/Arity" notation
In-Reply-To: <5195F094.2010003@ymir.co.jp>
References: <5195F094.2010003@ymir.co.jp>
Message-ID: <A1329D25-EE7E-4C1E-B20A-F9DDEAD284E0@gmail.com>

Hello,

Shorter test case, showing the problem comes from the inline itself and not inline_list_funcs:

-module(test).
-compile(inline).
-export([foo/0]).

foo() ->
    F = fun bar/1,
    fun (X) when X =:= F -> X end.

bar(X) -> X.

If you run the core_lint pass, you can see where the problem comes from:

$ erlc +clint test.erl
test: illegal guard expression in foo/0

The inliner inlines `when 'erlang':'=:='(X, F)` to `'erlang':'=:='(X, 'bar'/1)` but local fun references can't appear in guards.

I'll try to make a patch.

Regards,

-- 
Anthony Ramine

Le 17 mai 2013 ? 10:55, Masatake Daimon a ?crit :

> Hello,
> 
> Compiling the following module makes the compiler crash. I'm using
> R16B.
> 
> ===== test.erl =====
> -module(test).
> -compile(inline).
> -compile(inline_list_funcs).
> -export([foo/0]).
> 
> foo() ->
>    lists:map(fun bar/1, [1]).
> 
> bar(X) -> X.
> 
> ===== the crash ====
> % erlc test.erl
> test: function '-foo/0-lists^map/1-0-'/1+15:
>  Internal consistency check failed - please report this bug.
>  Instruction: {move,{x,0},{yy,0}}
>  Error:       {invalid_store,{yy,0},term}:
> 
> 
> Note that the problem disappears with any of these changes:
> 
> * Commenting out "-compile(inline)."
> * Commenting out "-compile(inline_list_funcs)."
> * Changing the definition of foo/0 to:
>      foo() ->
>          lists:map(fun bar/1, []).   % [] instead of [1]
> * Changing the definition of foo/0 to:
>      foo() ->
>          lists:map(fun (A) -> bar(A) end, [1]).
> 
> Regards,
> -- 
> ?? ?? <daimon@REDACTED>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs


From mjtruog@REDACTED  Sat May 18 04:42:01 2013
From: mjtruog@REDACTED (Michael Truog)
Date: Fri, 17 May 2013 19:42:01 -0700
Subject: [erlang-bugs] igor -callback external type bug
Message-ID: <5196EA79.1000407@gmail.com>

Hi,

When using igor (to rename modules) it generates invalid syntax when it finds -callback() types which have been exported from external modules.  igor may just not be changing the module names to make the types valid, but somehow no igor error occurs and you will only see the error when attempting to compile the module.

Thanks,
Michael


From mjtruog@REDACTED  Sat May 18 05:50:07 2013
From: mjtruog@REDACTED (Michael Truog)
Date: Fri, 17 May 2013 20:50:07 -0700
Subject: [erlang-bugs] igor reorders types to create errors
Message-ID: <5196FA6F.8070107@gmail.com>

Hi,

If a type is declared in the same file as a record and the type depends on a record being defined the resulting file will fail to compile due to the record not being defined, simply because the type is automatically put at the top of the file (by igor), above the record definition.  This problem may relate to the preprocessor, since I am surprised the order is significant.

I understand the various igor bugs might be an annoyance, since the module name may simply indicate that the module itself is only meant to annoy and that it may never actually do anything properly for the mad scientist.  However, I am still hopeful that it (or something like it) might provide error-less module transformations, despite module names within the Erlang code (like child specs).  So, it would be nice if it wasn't simply discarded due to its problems.

Thanks,
Michael


From n.oxyde@REDACTED  Sat May 18 18:22:03 2013
From: n.oxyde@REDACTED (Anthony Ramine)
Date: Sat, 18 May 2013 18:22:03 +0200
Subject: [erlang-bugs] Compiler crash with 'inline_list_funcs' and "fun
	Fun/Arity" notation
In-Reply-To: <A1329D25-EE7E-4C1E-B20A-F9DDEAD284E0@gmail.com>
References: <5195F094.2010003@ymir.co.jp>
 <A1329D25-EE7E-4C1E-B20A-F9DDEAD284E0@gmail.com>
Message-ID: <21AE91E1-9EF1-41B8-913E-AC0C959AC3F7@gmail.com>

Hello,

This patch fixes the bug by forbidding inlining of variables which values are local fun references outside of application contexts.

	git fetch https://github.com/nox/otp.git fix-fname-inlining

	https://github.com/nox/otp/compare/erlang:maint...fix-fname-inlining
	https://github.com/nox/otp/compare/erlang:maint...fix-fname-inlining.patch

Regards,

-- 
Anthony Ramine

Le 17 mai 2013 ? 20:06, Anthony Ramine a ?crit :

> Hello,
> 
> Shorter test case, showing the problem comes from the inline itself and not inline_list_funcs:
> 
> -module(test).
> -compile(inline).
> -export([foo/0]).
> 
> foo() ->
>    F = fun bar/1,
>    fun (X) when X =:= F -> X end.
> 
> bar(X) -> X.
> 
> If you run the core_lint pass, you can see where the problem comes from:
> 
> $ erlc +clint test.erl
> test: illegal guard expression in foo/0
> 
> The inliner inlines `when 'erlang':'=:='(X, F)` to `'erlang':'=:='(X, 'bar'/1)` but local fun references can't appear in guards.
> 
> I'll try to make a patch.
> 
> Regards,
> 
> -- 
> Anthony Ramine
> 
> Le 17 mai 2013 ? 10:55, Masatake Daimon a ?crit :
> 
>> Hello,
>> 
>> Compiling the following module makes the compiler crash. I'm using
>> R16B.
>> 
>> ===== test.erl =====
>> -module(test).
>> -compile(inline).
>> -compile(inline_list_funcs).
>> -export([foo/0]).
>> 
>> foo() ->
>>   lists:map(fun bar/1, [1]).
>> 
>> bar(X) -> X.
>> 
>> ===== the crash ====
>> % erlc test.erl
>> test: function '-foo/0-lists^map/1-0-'/1+15:
>> Internal consistency check failed - please report this bug.
>> Instruction: {move,{x,0},{yy,0}}
>> Error:       {invalid_store,{yy,0},term}:
>> 
>> 
>> Note that the problem disappears with any of these changes:
>> 
>> * Commenting out "-compile(inline)."
>> * Commenting out "-compile(inline_list_funcs)."
>> * Changing the definition of foo/0 to:
>>     foo() ->
>>         lists:map(fun bar/1, []).   % [] instead of [1]
>> * Changing the definition of foo/0 to:
>>     foo() ->
>>         lists:map(fun (A) -> bar(A) end, [1]).
>> 
>> Regards,
>> -- 
>> ?? ?? <daimon@REDACTED>
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-bugs
> 


From carlsson.richard@REDACTED  Sat May 18 22:14:46 2013
From: carlsson.richard@REDACTED (Richard Carlsson)
Date: Sat, 18 May 2013 22:14:46 +0200
Subject: [erlang-bugs] Compiler crash with 'inline_list_funcs' and "fun
 Fun/Arity" notation
In-Reply-To: <21AE91E1-9EF1-41B8-913E-AC0C959AC3F7@gmail.com>
References: <5195F094.2010003@ymir.co.jp>
 <A1329D25-EE7E-4C1E-B20A-F9DDEAD284E0@gmail.com>
 <21AE91E1-9EF1-41B8-913E-AC0C959AC3F7@gmail.com>
Message-ID: <5197E136.208@gmail.com>

On 2013-05-18 18:22 , Anthony Ramine wrote:
> Hello,
>
> This patch fixes the bug by forbidding inlining of variables which values are local fun references outside of application contexts.
>
> 	git fetch https://github.com/nox/otp.git fix-fname-inlining
>
> 	https://github.com/nox/otp/compare/erlang:maint...fix-fname-inlining
> 	https://github.com/nox/otp/compare/erlang:maint...fix-fname-inlining.patch
>
> Regards,
>

Looks reasonable to me, but it's ages since I worked on that code.

    /Richard


From mjtruog@REDACTED  Sun May 19 01:25:06 2013
From: mjtruog@REDACTED (Michael Truog)
Date: Sat, 18 May 2013 16:25:06 -0700
Subject: [erlang-bugs] escript file operations fail on halt
Message-ID: <51980DD2.7060501@gmail.com>

Hi,

There is an odd type of failure when:
1) async threads are enabled by default for the Erlang VM
2) an escript is used to spawn the Erlang VM
3) erlang:halt/1 is used to terminate the escript with a known error code

The erlang:halt/1 and erlang:halt/2 code here:
https://github.com/erlang/otp/blob/maint/erts/emulator/beam/bif.c#L3937
Makes the default flush parameter false!  The default flush parameter is currently undocumented.  So, when an escript performs a file operation that depends on the async thread pool (based on the internal Erlang code and configuration) and then attempts to do erlang:halt(integer()), the file operations may not complete or perhaps only partially complete.  In my particular use case, I can observe a rename file operation getting stuck inbetween the actual completion of the rename (and I am not using anything but a normal/default Linux filesystem, not NFS).

It seems important to change the default erlang:halt/1 behaviour for escript usage so that flush is true (I understand fail-fast probably means normal Erlang VM usage shouldn't have flush default to true).  An alternative is a new escript function that sets the flush option for the user (which is probably an easier solution to agree on) (e.g., escript:exit/1).

Thanks,
Michael


From n.oxyde@REDACTED  Sun May 19 12:33:12 2013
From: n.oxyde@REDACTED (Anthony Ramine)
Date: Sun, 19 May 2013 12:33:12 +0200
Subject: [erlang-bugs] syntax_tools anonymous function error
In-Reply-To: <5195C82C.104@gmail.com>
References: <5195C82C.104@gmail.com>
Message-ID: <DF164D26-5B99-4EDD-AF74-B9419FC14DE7@gmail.com>

Hello Michael,

This patch fixes support of implicit funs with variables in igor.

	git fetch https://github.com/nox/otp.git igor-funs

	https://github.com/nox/otp/compare/erlang:maint...igor-funs
	https://github.com/nox/otp/compare/erlang:maint...igor-funs.patch

Regards,

-- 
Anthony Ramine

Le 17 mai 2013 ? 08:03, Michael Truog a ?crit :

> Hi,
> 
> I had syntax_tools break on this code "fun M:F/2" with this stack trace:
>  in function  erl_syntax_lib:analyze_function_name/1 (erl_syntax_lib.erl, line 1500)
>  in call from igor:transform_implicit_fun/3 (igor.erl, line 1807)
>  in call from igor:transform_list/3 (igor.erl, line 1748)
>  in call from igor:transform_1/3 (igor.erl, line 1741)
>  in call from igor:default_transform/3 (igor.erl, line 1733)
>  in call from igor:transform_list/3 (igor.erl, line 1748)
>  in call from igor:transform_1/3 (igor.erl, line 1741)
>  in call from igor:transform_1/3 (igor.erl, line 1742)
> 
> Thanks,
> Michael
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs


From fredrik@REDACTED  Mon May 20 09:55:43 2013
From: fredrik@REDACTED (Fredrik)
Date: Mon, 20 May 2013 09:55:43 +0200
Subject: [erlang-bugs] [erlang-patches] Compiler crash with
 'inline_list_funcs' and "fun Fun/Arity" notation
In-Reply-To: <21AE91E1-9EF1-41B8-913E-AC0C959AC3F7@gmail.com>
References: <5195F094.2010003@ymir.co.jp>
 <A1329D25-EE7E-4C1E-B20A-F9DDEAD284E0@gmail.com>
 <21AE91E1-9EF1-41B8-913E-AC0C959AC3F7@gmail.com>
Message-ID: <5199D6FF.1010302@erlang.org>

On 05/18/2013 06:22 PM, Anthony Ramine wrote:
> Hello,
>
> This patch fixes the bug by forbidding inlining of variables which values are local fun references outside of application contexts.
>
> 	git fetch https://github.com/nox/otp.git fix-fname-inlining
>
> 	https://github.com/nox/otp/compare/erlang:maint...fix-fname-inlining
> 	https://github.com/nox/otp/compare/erlang:maint...fix-fname-inlining.patch
>
> Regards,
>
Hello Anthony,
I've fetched your branch and it should be visible in the 'pu' branch 
shortly.
I also assigned it to the responsible team for review.
Thanks,

-- 

BR Fredrik Gustafsson
Erlang OTP Team


From lukas@REDACTED  Mon May 20 10:11:52 2013
From: lukas@REDACTED (Lukas Larsson)
Date: Mon, 20 May 2013 10:11:52 +0200
Subject: [erlang-bugs] crash without crash dump
In-Reply-To: <51965A3C.3070205@gmail.com>
References: <51965A3C.3070205@gmail.com>
Message-ID: <5199DAC8.8090403@erlang.org>

Hello Michael,

Have you set ERL_CRASH_DUMP_SECONDS[1] to an appropriate value?

Lukas
    [1]: http://www.erlang.org/doc/man/heart.html

On 17/05/13 18:26, Michael Truog wrote:
> Hi,
>
> I am not sure about the impact of this problem, however, it may have a larger impact.  When killing the application_controller process with the -heart option being used, no crash dump is produced:
>
> $ erl
> Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:8:8] [async-threads:10] [kernel-poll:false]
>
> Eshell V5.10.1  (abort with ^G)
> 1> exit(whereis(application_controller), kill).
> *** ERROR: Shell process terminated! ***
> {"Kernel pid terminated",application_controller,killed}
>
> Crash dump was written to: erl_crash.dump
> Kernel pid terminated (application_controller) (killed)
>
> (erl_crash.dump file exists)
>
> $ erl -heart
> heart_beat_kill_pid = 24300
> Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:8:8] [async-threads:10] [kernel-poll:false]
>
> Eshell V5.10.1  (abort with ^G)
> 1> exit(whereis(application_controller), kill).
> *** ERROR: Shell process terminated! ***
> {"Kernel pid terminated",application_controller,killed}
> heart: Fri May 17 09:20:10 2013: Erlang is crashing .. (waiting for crash dump file)
> heart: Fri May 17 09:20:10 2013: Would reboot. Terminating.
> Kernel pid terminated (application_controller) (killed)
>
> (erl_crash.dump file does not exist!)
>
> Thanks,
> Michael
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
>


From fredrik@REDACTED  Mon May 20 10:12:31 2013
From: fredrik@REDACTED (Fredrik)
Date: Mon, 20 May 2013 10:12:31 +0200
Subject: [erlang-bugs] [erlang-patches] syntax_tools anonymous function
 error
In-Reply-To: <DF164D26-5B99-4EDD-AF74-B9419FC14DE7@gmail.com>
References: <5195C82C.104@gmail.com>
 <DF164D26-5B99-4EDD-AF74-B9419FC14DE7@gmail.com>
Message-ID: <5199DAEF.2080603@erlang.org>

On 05/19/2013 12:33 PM, Anthony Ramine wrote:
> Hello Michael,
>
> This patch fixes support of implicit funs with variables in igor.
>
> 	git fetch https://github.com/nox/otp.git igor-funs
>
> 	https://github.com/nox/otp/compare/erlang:maint...igor-funs
> 	https://github.com/nox/otp/compare/erlang:maint...igor-funs.patch
>
> Regards,
>
Hello Anthony,
I've fetched your patch and it should be visible in the 'pu' branch shortly.
Thanks,

-- 

BR Fredrik Gustafsson
Erlang OTP Team


From Aleksander.Nycz@REDACTED  Mon May 20 10:41:46 2013
From: Aleksander.Nycz@REDACTED (Aleksander Nycz)
Date: Mon, 20 May 2013 10:41:46 +0200
Subject: [erlang-bugs] Problem with tw timer support in diameter app
	(otp_R16B)
Message-ID: <5199E1CA.6020209@comarch.pl>

Hello,

I change default value for param *restrict_connections *from 'nodes' to 
'false'.
After that I run very simple test using seagull symulator. Test scenario 
was following:

1. seagull: send CER
2. seagull: recv CEA
3. seagull: send CCR (init)
4. seagull: recv CCA (init)
5. seagull: send CCR (update)
6. seagull: recv CCR (update)
7. seagull: send CCR (terminate)
8. seagull: recv CCA (terminate)

After step 8. seagull does't send DPR, but just closes transport 
connection (TCP)

On server side every think looks good, but 30 sec. after CCR (terminate) 
when tw elapsed, following error message appears in log:


13:40:58.187129: <0.5046.0>: error: error_logger: --:--/--: ** Generic 
server <0.5046.0> terminating
** Last message in was {timeout,#Ref<0.0.0.14845>,tw}
** When Server state == {watchdog,down,false,30000,0,<0.1009.0>,undefined,
#Ref<0.0.0.14845>,diameter_gen_base_rfc3588,
                             {recvdata,4259932,diameterNode,
[{diameter_app,diameterNode,dictionaryDCCA,
                                      [dccaCallback],
                                      diameterNode,4,false,
                                      [{answer_errors,report},
{request_errors,answer_3xxx}]}],
                                 {0,32}},
                             {0,32},
                             {false,false},
                             false}
** Reason for termination ==
** {function_clause,
        [{diameter_watchdog,set_watchdog,
             [stop],
             [{file,"base/diameter_watchdog.erl"},{line,451}]},
         {diameter_watchdog,handle_info,2,
             [{file,"base/diameter_watchdog.erl"},{line,211}]},
{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}

13:40:58.187500: <0.5046.0>: error: error_logger: --:--/--: 
[crash_report][[[{initial_call,{diameter_watchdog,init,['Argument__1']}},
                  {pid,<0.5046.0>},
                  {registered_name,[]},
{error_info,{exit,{function_clause,[{diameter_watchdog,set_watchdog,[stop],[{file,"base/diameter_watchdog.erl"},{line,451}]},
{diameter_watchdog,handle_info,2,[{file,"base/diameter_watchdog.erl"},{line,211}]},
{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]},
[{gen_server,terminate,6,[{file,"gen_server.erl"},{line,737}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}},
{ancestors,[diameter_watchdog_sup,diameter_sup,<0.946.0>]},
                  {messages,[]},
                  {links,[<0.954.0>]},
                  {dictionary,[{random_seed,{15047,18051,14647}},
                               {{diameter_watchdog,restart},
                                {{accept,#Ref<0.0.0.1696>},
                                 [{transport_module,diameter_tcp},
{transport_config,[{reuseaddr,true},{ip,{0,0,0,0}},{port,4068}]},
{capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
                                  {watchdog_timer,30000},
                                  {reconnect_timer,60000}],
                                 {diameter_service,<0.1009.0>,
{diameter_caps,"zyndram.krakow.comarch","krakow.comarch",[],25429,"Comarch 
DIAMETER Server",[],
[12645,10415,8164],
[4],
[],[],[],[],[]},
[{diameter_app,diameterNode,dictionaryDCCA,
[dccaCallback],
diameterNode,4,false,
[{answer_errors,report},{request_errors,answer_3xxx}]}]}}},
                               {{diameter_watchdog,dwr},
['DWR',{'Origin-Host',"zyndram.krakow.comarch"},{'Origin-Realm',"krakow.comarch"},{'Origin-State-Id',[]}]}]},
                  {trap_exit,false},
                  {status,running},
                  {heap_size,75025},
                  {stack_size,24},
                  {reductions,294}],
                 []]]
13:40:58.189060: <0.954.0>: error: error_logger: --:--/--: 
[supervisor_report][[{supervisor,{local,diameter_watchdog_sup}},
                      {errorContext,child_terminated},
{reason,{function_clause,[{diameter_watchdog,set_watchdog,[stop],[{file,"base/diameter_watchdog.erl"},{line,451}]},
{diameter_watchdog,handle_info,2,[{file,"base/diameter_watchdog.erl"},{line,211}]},
{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}},
                      {offender,[{pid,<0.5046.0>},
                                 {name,diameter_watchdog},
{mfargs,{diameter_watchdog,start_link,undefined}},
                                 {restart_type,temporary},
                                 {shutdown,1000},
                                 {child_type,worker}]}]]

You can check, that function set_watchdog should be called with param 
#watchdog{}, but 'stop' param is used instead.
As a result function_clause exception is thrown.

I suggest following change in code to correct this problem (file 
diameter_watchdog.erl):

$ diff diameter_watchdog.erl_org diameter_watchdog.erl
385a386,393
 > transition({timeout, TRef, tw}, #watchdog{tref = TRef, status = T} = S)
 >   when T == initial;
 >        T == down ->
 >     case restart(S) of
 >         stop -> stop;
 >         #watchdog{} = NewS -> set_watchdog(NewS)
 >     end;
 >

You can find this solution in attachement.

Best regards
Aleksander Nycz

-- 
Aleksander Nycz
Senior Software Engineer
Telco_021 BSS R&D
Comarch SA
Phone:  +48 12 646 1216
Mobile: +48 691 464 275
website: www.comarch.pl

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130520/44c7c439/attachment.htm>
-------------- next part --------------
%%
%% %CopyrightBegin%
%%
%% Copyright Ericsson AB 2010-2013. All Rights Reserved.
%%
%% The contents of this file are subject to the Erlang Public License,
%% Version 1.1, (the "License"); you may not use this file except in
%% compliance with the License. You should have received a copy of the
%% Erlang Public License along with this software. If not, it can be
%% retrieved online at http://www.erlang.org/.
%%
%% Software distributed under the License is distributed on an "AS IS"
%% basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See
%% the License for the specific language governing rights and limitations
%% under the License.
%%
%% %CopyrightEnd%
%%

%%
%% This module implements (as a process) the state machine documented
%% in Appendix A of RFC 3539.
%%

-module(diameter_watchdog).
-behaviour(gen_server).

%% towards diameter_service
-export([start/2]).

%% gen_server callbacks
-export([init/1,
         handle_call/3,
         handle_cast/2,
         handle_info/2,
         terminate/2,
         code_change/3]).

%% diameter_watchdog_sup callback
-export([start_link/1]).

-include_lib("diameter/include/diameter.hrl").
-include("diameter_internal.hrl").

-define(DEFAULT_TW_INIT, 30000). %% RFC 3539 ch 3.4.1
-define(NOMASK, {0,32}).  %% default sequence mask

-define(BASE, ?DIAMETER_DICT_COMMON).

-record(watchdog,
        {%% PCB - Peer Control Block; see RFC 3539, Appendix A
         status = initial :: initial | okay | suspect | down | reopen,
         pending = false  :: boolean(),  %% DWA
         tw :: 6000..16#FFFFFFFF | {module(), atom(), list()},
                                %% {M,F,A} -> integer() >= 0
         num_dwa = 0 :: -1 | non_neg_integer(),
                     %% number of DWAs received during reopen
         %% end PCB
         parent = self() :: pid(),              %% service process
         transport       :: pid() | undefined,  %% peer_fsm process
         tref :: reference(),     %% reference for current watchdog timer
         dictionary :: module(),  %% common dictionary
         receive_data :: term(),
                 %% term passed into diameter_service with incoming message
         sequence :: diameter:sequence(),     %% mask
         restrict :: {diameter:restriction(), boolean()},
         shutdown = false :: boolean()}).

%% ---------------------------------------------------------------------------
%% start/2
%%
%% Start a monitor before the watchdog is allowed to proceed to ensure
%% that a failed capabilities exchange produces the desired exit
%% reason.
%% ---------------------------------------------------------------------------

-spec start(Type, {RecvData, [Opt], SvcOpts, #diameter_service{}})
   -> {reference(), pid()}
 when Type :: {connect|accept, diameter:transport_ref()},
      RecvData :: term(),
      Opt :: diameter:transport_opt(),
      SvcOpts :: [diameter:service_opt()].

start({_,_} = Type, T) ->
    Ack = make_ref(),
    {ok, Pid} = diameter_watchdog_sup:start_child({Ack, Type, self(), T}),
    try
        {erlang:monitor(process, Pid), Pid}
    after
        send(Pid, Ack)
    end.

start_link(T) ->
    {ok, _} = proc_lib:start_link(?MODULE,
                                  init,
                                  [T],
                                  infinity,
                                  diameter_lib:spawn_opts(server, [])).

%% ===========================================================================
%% ===========================================================================

%% init/1

init(T) ->
    proc_lib:init_ack({ok, self()}),
    gen_server:enter_loop(?MODULE, [], i(T)).

i({Ack, T, Pid, {RecvData,
                 Opts,
                 SvcOpts,
                 #diameter_service{applications = Apps,
                                   capabilities = Caps}
                 = Svc}}) ->
    erlang:monitor(process, Pid),
    wait(Ack, Pid),
    random:seed(now()),
    putr(restart, {T, Opts, Svc}),  %% save seeing it in trace
    putr(dwr, dwr(Caps)),           %%
    {_,_} = Mask = proplists:get_value(sequence, SvcOpts),
    Restrict = proplists:get_value(restrict_connections, SvcOpts),
    Nodes = restrict_nodes(Restrict),
    Dict0 = common_dictionary(Apps),
    #watchdog{parent = Pid,
              transport = start(T, Opts, Mask, Nodes, Dict0, Svc),
              tw = proplists:get_value(watchdog_timer,
                                       Opts,
                                       ?DEFAULT_TW_INIT),
              receive_data = RecvData,
              dictionary = Dict0,
              sequence = Mask,
              restrict = {Restrict, lists:member(node(), Nodes)}}.

wait(Ref, Pid) ->
    receive
        Ref ->
            ok;
        {'DOWN', _, process, Pid, _} = D ->
            exit({shutdown, D})
    end.

%% start/5

start(T, Opts, Mask, Nodes, Dict0, Svc) ->
    {_MRef, Pid}
        = diameter_peer_fsm:start(T, Opts, {Mask, Nodes, Dict0, Svc}),
    Pid.

%% common_dictionary/1
%%
%% Determine the dictionary of the Diameter common application with
%% Application Id 0. Fail on config errors.

common_dictionary(Apps) ->
    case
        orddict:fold(fun dict0/3,
                     false,
                     lists:foldl(fun(#diameter_app{dictionary = M}, D) ->
                                         orddict:append(M:id(), M, D)
                                 end,
                                 orddict:new(),
                                 Apps))
    of
        {value, Mod} ->
            Mod;
        false ->
            %% A transport should configure a common dictionary but
            %% don't require it. Not configuring a common dictionary
            %% means a user won't be able either send of receive
            %% messages in the common dictionary: incoming request
            %% will be answered with 3007 and outgoing requests cannot
            %% be sent. The dictionary returned here is oly used for
            %% messages diameter sends and receives: CER/CEA, DPR/DPA
            %% and DWR/DWA.
            ?BASE
    end.

%% Each application should be represented by a single dictionary.
dict0(Id, [_,_|_] = Ms, _) ->
    config_error({multiple_dictionaries, Ms, {application_id, Id}});

%% An explicit common dictionary.
dict0(?APP_ID_COMMON, [Mod], _) ->
    {value, Mod};

%% A pure relay, in which case the common application is implicit.
%% This uses the fact that the common application will already have
%% been folded.
dict0(?APP_ID_RELAY, _, false) ->
    {value, ?BASE};

dict0(_, _, Acc) ->
    Acc.

config_error(T) ->
    ?ERROR({configuration_error, T}).

%% handle_call/3

handle_call(_, _, State) ->
    {reply, nok, State}.

%% handle_cast/2

handle_cast(_, State) ->
    {noreply, State}.

%% handle_info/2

handle_info(T, #watchdog{} = State) ->
    case transition(T, State) of
        ok ->
            {noreply, State};
        #watchdog{} = S ->
            close(T, State),     %% service expects 'close' message
            event(T, State, S),  %%   before 'watchdog'
            {noreply, S};
        stop ->
            ?LOG(stop, T),
            event(T, State, State#watchdog{status = down}),
            {stop, {shutdown, T}, State}
    end.

close({'DOWN', _, process, TPid, {shutdown, Reason}},
      #watchdog{transport = TPid,
                parent = Pid}) ->
    send(Pid, {close, self(), Reason});

close(_, _) ->
    ok.

event(_, #watchdog{status = T}, #watchdog{status = T}) ->
    ok;

event(_, #watchdog{transport = undefined}, #watchdog{transport = undefined}) ->
    ok;

event(Msg,
      #watchdog{status = From, transport = F, parent = Pid},
      #watchdog{status = To, transport = T}) ->
    TPid = tpid(F,T),
    E = {[TPid | data(Msg, TPid, From, To)], From, To},
    send(Pid, {watchdog, self(), E}),
    ?LOG(transition, {self(), E}).

data(Msg, TPid, reopen, okay) ->
    {recv, TPid, 'DWA', _Pkt} = Msg,  %% assert
    {TPid, T} = eraser(open),
    [T];

data({open, TPid, _Hosts, T}, TPid, _From, To)
  when To == okay;
       To == reopen ->
    [T];

data(_, _, _, _) ->
    [].

tpid(_, Pid)
  when is_pid(Pid) ->
    Pid;

tpid(Pid, _) ->
    Pid.

send(Pid, T) ->
    Pid ! T.

%% terminate/2

terminate(_, _) ->
    ok.

%% code_change/3

code_change(_, State, _) ->
    {ok, State}.

%% ===========================================================================
%% ===========================================================================

%% transition/2
%%
%% The state transitions documented here are extracted from RFC 3539,
%% the commentary is ours.

%% Service or watchdog is telling the watchdog of an accepting
%% transport to die after reconnect_timer expiry or reestablished
%% connection (in another transport process) respectively.
transition(close, #watchdog{status = down}) ->
    {{accept, _}, _, _} = getr(restart), %% assert
    stop;
transition(close, #watchdog{}) ->
    ok;

%% Service is asking for the peer to be taken down gracefully.
transition({shutdown, Pid, _}, #watchdog{parent = Pid,
                                         transport = undefined}) ->
    stop;
transition({shutdown = T, Pid, Reason}, #watchdog{parent = Pid,
                                                  transport = TPid}
                                        = S) ->
    send(TPid, {T, self(), Reason}),
    S#watchdog{shutdown = true};

%% Parent process has died,
transition({'DOWN', _, process, Pid, _Reason},
           #watchdog{parent = Pid}) ->
    stop;

%% Transport has accepted a connection.
transition({accepted = T, TPid}, #watchdog{transport = TPid,
                                           parent = Pid}) ->
    send(Pid, {T, self(), TPid}),
    ok;

%%   STATE         Event                Actions              New State
%%   =====         ------               -------              ----------
%%   INITIAL       Connection up        SetWatchdog()        OKAY

%% By construction, the watchdog timer isn't set until we move into
%% state okay as the result of the Peer State Machine reaching the
%% Open state.
%%
%% If we're accepting then we may be resuming a connection that went
%% down in another watchdog process, in which case this is the
%% transition below, from down to reopen. That is, it's not until we
%% know the identity of the peer (ie. now) that we know that we're in
%% state down rather than initial.

transition({open, TPid, Hosts, _} = Open,
           #watchdog{transport = TPid,
                     status = initial,
                     restrict = {_, R}}
           = S) ->
    case okay(getr(restart), Hosts, R) of
        okay ->
            set_watchdog(S#watchdog{status = okay});
        reopen ->
            transition(Open, S#watchdog{status = down})
    end;

%%   DOWN          Connection up        NumDWA = 0
%%                                      SendWatchdog()
%%                                      SetWatchdog()
%%                                      Pending = TRUE       REOPEN

transition({open = Key, TPid, _Hosts, T},
           #watchdog{transport = TPid,
                     status = down}
           = S) ->
    %% Store the info we need to notify the parent to reopen the
    %% connection after the requisite DWA's are received, at which
    %% time we eraser(open). The reopen message is a later addition,
    %% to communicate the new capabilities as soon as they're known.
    putr(Key, {TPid, T}),
    set_watchdog(send_watchdog(S#watchdog{status = reopen,
                                          num_dwa = 0}));

%%   OKAY          Connection down      CloseConnection()
%%                                      Failover()
%%                                      SetWatchdog()        DOWN
%%   SUSPECT       Connection down      CloseConnection()
%%                                      SetWatchdog()        DOWN
%%   REOPEN        Connection down      CloseConnection()
%%                                      SetWatchdog()        DOWN

transition({'DOWN', _, process, TPid, _Reason},
           #watchdog{transport = TPid,
                     shutdown = true}) ->
    stop;

transition({'DOWN', _, process, TPid, _Reason},
           #watchdog{transport = TPid,
                     status = T}
           = S) ->
    set_watchdog(S#watchdog{status = case T of initial -> T; _ -> down end,
                            pending = false,
                            transport = undefined});

%% Incoming message.
transition({recv, TPid, Name, Pkt}, #watchdog{transport = TPid} = S) ->
    recv(Name, Pkt, S);

%% Current watchdog has timed out.
transition({timeout, TRef, tw}, #watchdog{tref = TRef, status = T} = S)
  when T == initial;
       T == down ->
    case restart(S) of 
        stop -> stop;
        #watchdog{} = NewS -> set_watchdog(NewS)
    end;

transition({timeout, TRef, tw}, #watchdog{tref = TRef} = S) ->
    set_watchdog(timeout(S));

%% Timer was canceled after message was already sent.
transition({timeout, _, tw}, #watchdog{}) ->
    ok;

%% State query.
transition({state, Pid}, #watchdog{status = S}) ->
    send(Pid, {self(), S}),
    ok.

%% ===========================================================================

putr(Key, Val) ->
    put({?MODULE, Key}, Val).

getr(Key) ->
    get({?MODULE, Key}).

eraser(Key) ->
    erase({?MODULE, Key}).

%% encode/3

encode(Msg, Mask, Dict) ->
    Seq = diameter_session:sequence(Mask),
    Hdr = #diameter_header{version = ?DIAMETER_VERSION,
                           end_to_end_id = Seq,
                           hop_by_hop_id = Seq},
    Pkt = #diameter_packet{header = Hdr,
                           msg = Msg},
    #diameter_packet{bin = Bin} = diameter_codec:encode(Dict, Pkt),
    Bin.

%% okay/3

okay({{accept, Ref}, _, _}, Hosts, Restrict) ->
    T = {?MODULE, connection, Ref, Hosts},
    diameter_reg:add(T),
    if Restrict ->
            okay(diameter_reg:match(T));
       true ->
            okay
    end;
%% Register before matching so that at least one of two registering
%% processes will match the other.

okay({{connect, _}, _, _}, _, _) ->
    okay.

%% okay/2

%% The peer hasn't been connected recently ...
okay([{_,P}]) ->
    P = self(),  %% assert
    okay;

%% ... or it has.
okay(C) ->
    [_|_] = [send(P, close) || {_,P} <- C, self() /= P],
    reopen.

%% set_watchdog/1

set_watchdog(#watchdog{tw = TwInit,
                       tref = TRef}
             = S) ->
    cancel(TRef),
    S#watchdog{tref = erlang:start_timer(tw(TwInit), self(), tw)}.

cancel(undefined) ->
    ok;
cancel(TRef) ->
    erlang:cancel_timer(TRef).

tw(T)
  when is_integer(T), T >= 6000 ->
    T - 2000 + (random:uniform(4001) - 1); %% RFC3539 jitter of +/- 2 sec.
tw({M,F,A}) ->
    apply(M,F,A).

%% send_watchdog/1

send_watchdog(#watchdog{pending = false,
                        transport = TPid,
                        dictionary = Dict0,
                        sequence = Mask}
              = S) ->
    send(TPid, {send, encode(getr(dwr), Mask, Dict0)}),
    ?LOG(send, 'DWR'),
    S#watchdog{pending = true}.

%% recv/3

recv(Name, Pkt, S) ->
    try rcv(Name, S) of
        #watchdog{} = NS ->
            rcv(Name, Pkt, S),
            NS
    catch
        {?MODULE, throwaway, #watchdog{} = NS} ->
            NS
    end.

%% rcv/3

rcv(N, _, _)
  when N == 'CER';
       N == 'CEA';
       N == 'DWR';
       N == 'DWA';
       N == 'DPR';
       N == 'DPA' ->
    false;

rcv(_, Pkt, #watchdog{transport = TPid,
                      dictionary = Dict0,
                      receive_data = T}) ->
    diameter_traffic:receive_message(TPid, Pkt, Dict0, T).

throwaway(S) ->
    throw({?MODULE, throwaway, S}).

%% rcv/2
%%
%% The lack of Hop-by-Hop and End-to-End Identifiers checks in a
%% received DWA is intentional. The purpose of the message is to
%% demonstrate life but a peer that consistently bungles it by sending
%% the wrong identifiers causes the connection to toggle between OPEN
%% and SUSPECT, with failover and failback as result, despite there
%% being no real problem with connectivity. Thus, relax and accept any
%% incoming DWA as being in response to an outgoing DWR.

%%   INITIAL       Receive DWA          Pending = FALSE
%%                                      Throwaway()          INITIAL
%%   INITIAL       Receive non-DWA      Throwaway()          INITIAL

rcv('DWA', #watchdog{status = initial} = S) ->
    throwaway(S#watchdog{pending = false});

rcv(_, #watchdog{status = initial} = S) ->
    throwaway(S);

%%   DOWN          Receive DWA          Pending = FALSE
%%                                      Throwaway()          DOWN
%%   DOWN          Receive non-DWA      Throwaway()          DOWN

rcv('DWA', #watchdog{status = down} = S) ->
    throwaway(S#watchdog{pending = false});

rcv(_, #watchdog{status = down} = S) ->
    throwaway(S);

%%   OKAY          Receive DWA          Pending = FALSE
%%                                      SetWatchdog()        OKAY
%%   OKAY          Receive non-DWA      SetWatchdog()        OKAY

rcv('DWA', #watchdog{status = okay} = S) ->
    set_watchdog(S#watchdog{pending = false});

rcv(_, #watchdog{status = okay} = S) ->
    set_watchdog(S);

%%   SUSPECT       Receive DWA          Pending = FALSE
%%                                      Failback()
%%                                      SetWatchdog()        OKAY
%%   SUSPECT       Receive non-DWA      Failback()
%%                                      SetWatchdog()        OKAY

rcv('DWA', #watchdog{status = suspect} = S) ->
    set_watchdog(S#watchdog{status = okay,
                            pending = false});

rcv(_, #watchdog{status = suspect} = S) ->
    set_watchdog(S#watchdog{status = okay});

%%   REOPEN        Receive DWA &        Pending = FALSE
%%                 NumDWA == 2          NumDWA++
%%                                      Failback()           OKAY

rcv('DWA', #watchdog{status = reopen,
                     num_dwa = 2 = N}
           = S) ->
    S#watchdog{status = okay,
               num_dwa = N+1,
               pending = false};

%%   REOPEN        Receive DWA &        Pending = FALSE
%%                 NumDWA < 2           NumDWA++             REOPEN

rcv('DWA', #watchdog{status = reopen,
                     num_dwa = N}
           = S) ->
    S#watchdog{num_dwa = N+1,
               pending = false};

%%   REOPEN        Receive non-DWA      Throwaway()          REOPEN

rcv(_, #watchdog{status = reopen} = S) ->
    throwaway(S).

%% timeout/1
%%
%% The caller sets the watchdog on the return value.

%%   OKAY          Timer expires &      SendWatchdog()
%%                 !Pending             SetWatchdog()
%%                                      Pending = TRUE       OKAY
%%   REOPEN        Timer expires &      SendWatchdog()
%%                 !Pending             SetWatchdog()
%%                                      Pending = TRUE       REOPEN

timeout(#watchdog{status = T,
                  pending = false}
        = S)
  when T == okay;
       T == reopen ->
    send_watchdog(S);

%%   OKAY          Timer expires &      Failover()
%%                 Pending              SetWatchdog()        SUSPECT

timeout(#watchdog{status = okay,
                  pending = true}
  = S) ->
    S#watchdog{status = suspect};

%%   SUSPECT       Timer expires        CloseConnection()
%%                                      SetWatchdog()        DOWN
%%   REOPEN        Timer expires &      CloseConnection()
%%                 Pending &            SetWatchdog()
%%                 NumDWA < 0                                DOWN

timeout(#watchdog{status = T,
                  pending = P,
                  num_dwa = N,
                  transport = TPid}
        = S)
  when T == suspect;
       T == reopen, P, N < 0 ->
    exit(TPid, {shutdown, watchdog_timeout}),
    S#watchdog{status = down};

%%   REOPEN        Timer expires &      NumDWA = -1
%%                 Pending &            SetWatchdog()
%%                 NumDWA >= 0                               REOPEN

timeout(#watchdog{status = reopen,
                  pending = true,
                  num_dwa = N}
        = S)
  when 0 =< N ->
    S#watchdog{num_dwa = -1};

%%   DOWN          Timer expires        AttemptOpen()
%%                                      SetWatchdog()        DOWN
%%   INITIAL       Timer expires        AttemptOpen()
%%                                      SetWatchdog()        INITIAL

%% RFC 3539, 3.4.1:
%%
%%   [5] While the connection is in the closed state, the AAA client MUST
%%       NOT attempt to send further watchdog messages on the connection.
%%       However, after the connection is closed, the AAA client continues
%%       to periodically attempt to reopen the connection.
%%
%%       The AAA client SHOULD wait for the transport layer to report
%%       connection failure before attempting again, but MAY choose to
%%       bound this wait time by the watchdog interval, Tw.

%% Don't bound, restarting the peer process only when the previous
%% process has died. We only need to handle state down since we start
%% the first watchdog when transitioning out of initial.

timeout(#watchdog{status = T} = S)
  when T == initial;
       T == down ->
    restart(S).

%% restart/1

restart(#watchdog{transport = undefined} = S) ->
    restart(getr(restart), S);
restart(S) ->
    S.

%% restart/2
%%
%% Only restart the transport in the connecting case. For an accepting
%% transport, there's no guarantee that an accepted connection in a
%% restarted transport if from the peer we've lost contact with so
%% have to be prepared for another watchdog to handle it. This is what
%% the diameter_reg registration in this module is for: the peer
%% connection is registered when leaving state initial and this is
%% used by a new accepting watchdog to realize that it's actually in
%% state down rather then initial when receiving notification of an
%% open connection.

restart({{connect, _} = T, Opts, Svc},
        #watchdog{parent = Pid,
                  sequence = Mask,
                  restrict = {R,_},
                  dictionary = Dict0}
        = S) ->
    send(Pid, {reconnect, self()}),
    Nodes = restrict_nodes(R),
    S#watchdog{transport = start(T, Opts, Mask, Nodes, Dict0, Svc),
               restrict = {R, lists:member(node(), Nodes)}};

%% No restriction on the number of connections to the same peer: just
%% die. Note that a state machine never enters state REOPEN in this
%% case.
restart({{accept, _}, _, _}, #watchdog{restrict = {_, false}}) ->
    stop;

%% Otherwise hang around until told to die.
restart({{accept, _}, _, _}, S) ->
    S.

%% Don't currently use Opts/Svc in the accept case.

%% dwr/1

dwr(#diameter_caps{origin_host = OH,
                   origin_realm = OR,
                   origin_state_id = OSI}) ->
    ['DWR', {'Origin-Host', OH},
            {'Origin-Realm', OR},
            {'Origin-State-Id', OSI}].

%% restrict_nodes/1

restrict_nodes(false) ->
    [];

restrict_nodes(nodes) ->
    [node() | nodes()];

restrict_nodes(node) ->
    [node()];

restrict_nodes(Nodes)
  when [] == Nodes;
       is_atom(hd(Nodes)) ->
    Nodes;

restrict_nodes(F) ->
    diameter_lib:eval(F).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2182 bytes
Desc: Kryptograficzna sygnatura S/MIME
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130520/44c7c439/attachment.bin>

From Aleksander.Nycz@REDACTED  Mon May 20 11:45:16 2013
From: Aleksander.Nycz@REDACTED (Aleksander Nycz)
Date: Mon, 20 May 2013 11:45:16 +0200
Subject: [erlang-bugs] Memory leak in diameter_service module in diameter
	app (otp_R16B)
In-Reply-To: <5199E1CA.6020209@comarch.pl>
References: <5199E1CA.6020209@comarch.pl>
Message-ID: <5199F0AC.9090800@comarch.pl>

Hello,

I think there is a problem with resource leak (memory) in 
diameter_service module.

This module is a gen_server, that state contains field watchdogT :: 
ets:tid().
This ets contains info about watchdogs.

Diameter app service cfg is:

[{'Origin-Host',  HostName},
      {'Origin-Realm', Realm},
         {'Vendor-Id',     ...},
      {'Product-Name', ...},
      {'Auth-Application-Id', [?DCCA_APP_ID]},
      {'Supported-Vendor-Id', [...]},
      {application,     [{alias,       diameterNode},
                        {dictionary, dictionaryDCCA},
                      {module,       dccaCallback}]},
      {*restrict_connections, false*}]

After start dimeter app, adding service and transport, diameter_service 
state is:

 > diameter_service:state(diameterNode).
#state{id = {1369,41606,329900},
        service_name = diameterNode,
        service = #diameter_service{pid = <0.1011.0>,
                                    capabilities = #diameter_caps{...},
                                    applications = [#diameter_app{...}]},
        watchdogT = 4194395,peerT = 4259932,shared_peers = 4325469,
        local_peers = 4391006,monitor = false,
        options = [{sequence,{0,32}},
                   {share_peers,false},
                   {use_shared_peers,false},
                   {restrict_connections,false}]}

and ets 4194395 has one record:

 > ets:tab2list(4194395).
[#watchdog{pid = <0.1013.0>,type = accept,
            ref = #Ref<0.0.0.1696>,
            options = [{transport_module,diameter_tcp},
                       {transport_config,[{reuseaddr,true},
                                          {ip,{0,0,0,0}},
                                          {port,4068}]},
{capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
                       {watchdog_timer,30000},
                       {reconnect_timer,60000}],
            state = initial,
            started = {1369,41606,330086},
            peer = false}]

   Next I run very simple test using seagull symulator. Test scenario is 
following:

1. seagull: send CER
2. seagull: recv CEA
3. seagull: send CCR (init)
4. seagull: recv CCA (init)
5. seagull: send CCR (update)
6. seagull: recv CCR (update)
7. seagull: send CCR (terminate)
8. seagull: recv CCA (terminate)

Durring test there are two watchdogs in ets:

 > ets:tab2list(4194395).
[#watchdog{pid = <0.1816.0>,type = accept,
            ref = #Ref<0.0.0.1696>,
            options = [{transport_module,diameter_tcp},
                       {transport_config,[{reuseaddr,true},
                                          {ip,{0,0,0,0}},
                                          {port,4068}]},
{capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
                       {watchdog_timer,30000},
                       {reconnect_timer,60000}],
*state = initial*,
            started = {1369,41823,711370},
            peer = false},
  #watchdog{pid = <0.1013.0>,type = accept,
            ref = #Ref<0.0.0.1696>,
            options = [{transport_module,diameter_tcp},
                       {transport_config,[{reuseaddr,true},
                                          {ip,{0,0,0,0}},
                                          {port,4068}]},
{capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
                       {watchdog_timer,30000},
                       {reconnect_timer,60000}],
*state = okay*,
            started = {1369,41606,330086},
            peer = <0.1014.0>}]

After test but before tw timer elapsed, there is two watchdogs also and 
this is ok:

 > ets:tab2list(4194395).
[#watchdog{pid = <0.1816.0>,type = accept,
            ref = #Ref<0.0.0.1696>,
            options = [{transport_module,diameter_tcp},
                       {transport_config,[{reuseaddr,true},
                                          {ip,{0,0,0,0}},
                                          {port,4068}]},
{capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
                       {watchdog_timer,30000},
                       {reconnect_timer,60000}],
*state = initial*,
            started = {1369,41823,711370},
            peer = false},
  #watchdog{pid = *<0.1013.0>*,type = accept,
            ref = #Ref<0.0.0.1696>,
            options = [{transport_module,diameter_tcp},
                       {transport_config,[{reuseaddr,true},
                                          {ip,{0,0,0,0}},
                                          {port,4068}]},
{capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
                       {watchdog_timer,30000},
                       {reconnect_timer,60000}],
*state = down*,
            started = {1369,41606,330086},
            peer = *<0.1014.0>*}]

But when tm timer elapsed transport and watchdog processes are finished:

 > erlang:is_process_alive(list_to_pid("*<0.1014.0>*")).
false
 > erlang:is_process_alive(list_to_pid("*<0.1013.0>*")).
false

and still two watchdogs are in ets:

 > ets:tab2list(4194395).
[#watchdog{pid = <0.1816.0>,type = accept,
            ref = #Ref<0.0.0.1696>,
            options = [{transport_module,diameter_tcp},
                       {transport_config,[{reuseaddr,true},
                                          {ip,{0,0,0,0}},
                                          {port,4068}]},
{capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
                       {watchdog_timer,30000},
                       {reconnect_timer,60000}],
            state = initial,
            started = {1369,41823,711370},
            peer = false},
  #watchdog{pid = <0.1013.0>,type = accept,
            ref = #Ref<0.0.0.1696>,
            options = [{transport_module,diameter_tcp},
                       {transport_config,[{reuseaddr,true},
                                          {ip,{0,0,0,0}},
                                          {port,4068}]},
{capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
                       {watchdog_timer,30000},
                       {reconnect_timer,60000}],
            state = down,
            started = {1369,41606,330086},
            peer = <0.1014.0>}]

I think watchdog *<0.1013.0> *should be removed when watchdog process is 
being finished.

I run next test and now there are 3 watchdogs in ets:

 > ets:tab2list(4194395).
[#watchdog{pid = *<0.1816.0>*,type = accept,
            ref = #Ref<0.0.0.1696>,
            options = [{transport_module,diameter_tcp},
                       {transport_config,[{reuseaddr,true},
                                          {ip,{0,0,0,0}},
                                          {port,4068}]},
{capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
                       {watchdog_timer,30000},
                       {reconnect_timer,60000}],
*state = down*,
            started = {1369,41823,711370},
            peer = *<0.1817.0>*},
  #watchdog{pid = <0.1013.0>,type = accept,
            ref = #Ref<0.0.0.1696>,
            options = [{transport_module,diameter_tcp},
                       {transport_config,[{reuseaddr,true},
                                          {ip,{0,0,0,0}},
                                          {port,4068}]},
{capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
                       {watchdog_timer,30000},
                       {reconnect_timer,60000}],
*state = down*,
            started = {1369,41606,330086},
            peer = <0.1014.0>},
  #watchdog{pid = <0.3533.0>,type = accept,
            ref = #Ref<0.0.0.1696>,
            options = [{transport_module,diameter_tcp},
                       {transport_config,[{reuseaddr,true},
                                          {ip,{0,0,0,0}},
                                          {port,4068}]},
{capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
                       {watchdog_timer,30000},
                       {reconnect_timer,60000}],
*state = initial*,
            started = {1369,42342,845898},
            peer = false}]

Watchdog and transport process are not alive:

 > erlang:is_process_alive(list_to_pid("<0.1816.0>")).
false
 > erlang:is_process_alive(list_to_pid("<0.1817.0>")).
false


I suggest following change in code to correct this problem (file 
diameter_service.erl):

$ diff diameter_service.erl diameter_service.erl_ok
1006c1006
< connection_down(#watchdog{state = WS,
---
 > connection_down(#watchdog{state = ?WD_OKAY,
1015,1017c1015,1021
<     ?WD_OKAY == WS
<         andalso
<         connection_down(Wd, fetch(PeerT, TPid), S).
---
 >     connection_down(Wd, fetch(PeerT, TPid), S);
 >
 > connection_down(#watchdog{},
 >                 To,
 >                 #state{})
 >   when is_atom(To) ->
 >     ok.

You can find this solution in attachement.

Regards
Aleksander Nycz


-- 
Aleksander Nycz
Senior Software Engineer
Telco_021 BSS R&D
Comarch SA
Phone:  +48 12 646 1216
Mobile: +48 691 464 275
website: www.comarch.pl

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130520/6819412f/attachment.htm>
-------------- next part --------------
%%
%% %CopyrightBegin%
%%
%% Copyright Ericsson AB 2010-2013. All Rights Reserved.
%%
%% The contents of this file are subject to the Erlang Public License,
%% Version 1.1, (the "License"); you may not use this file except in
%% compliance with the License. You should have received a copy of the
%% Erlang Public License along with this software. If not, it can be
%% retrieved online at http://www.erlang.org/.
%%
%% Software distributed under the License is distributed on an "AS IS"
%% basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See
%% the License for the specific language governing rights and limitations
%% under the License.
%%
%% %CopyrightEnd%
%%

%%
%% Implements the process that represents a service.
%%

-module(diameter_service).
-behaviour(gen_server).

%% towards diameter_service_sup
-export([start_link/1]).

%% towards diameter
-export([subscribe/1,
         unsubscribe/1,
         services/0,
         info/2]).

%% towards diameter_config
-export([start/1,
         stop/1,
         start_transport/2,
         stop_transport/2]).

%% towards diameter_peer
-export([notify/2]).

%% towards diameter_traffic
-export([find_incoming_app/4,
         pick_peer/3]).

%% test/debug
-export([services/1,
         subscriptions/1,
         subscriptions/0,
         call_module/3,
         whois/1,
         state/1,
         uptime/1]).

%% gen_server callbacks
-export([init/1,
         handle_call/3,
         handle_cast/2,
         handle_info/2,
         terminate/2,
         code_change/3]).

-include_lib("diameter/include/diameter.hrl").
-include("diameter_internal.hrl").

%% RFC 3539 watchdog states.
-define(WD_INITIAL, initial).
-define(WD_OKAY,    okay).
-define(WD_SUSPECT, suspect).
-define(WD_DOWN,    down).
-define(WD_REOPEN,  reopen).

-type wd_state() :: ?WD_INITIAL
                  | ?WD_OKAY
                  | ?WD_SUSPECT
                  | ?WD_DOWN
                  | ?WD_REOPEN.

-define(DEFAULT_TC,     30000).  %% RFC 3588 ch 2.1
-define(RESTART_TC,      1000).  %% if restart was this recent

%% Used to be able to swap this with anything else dict-like but now
%% rely on the fact that a service's #state{} record does not change
%% in storing in it ?STATE table and not always going through the
%% service process. In particular, rely on the fact that operations on
%% a ?Dict don't change the handle to it.
-define(Dict, diameter_dict).

%% Maintains state in a table. In contrast to previously, a service's
%% stat is not constant and is accessed outside of the service
%% process.
-define(STATE_TABLE, ?MODULE).

%% The default sequence mask.
-define(NOMASK, {0,32}).

%% The default restrict_connections.
-define(RESTRICT, nodes).

%% Workaround for dialyzer's lack of understanding of match specs.
-type match(T)
   :: T | '_' | '$1' | '$2'.

%% State of service gen_server. Note that the state term itself
%% doesn't change, which is relevant for the stateless application
%% callbacks since the state is retrieved from ?STATE_TABLE from
%% outside the service process. The pid in the service record is used
%% to determine whether or not we need to call the process for a
%% pick_peer callback in the statefull case.
-record(state,
        {id = now(),
         service_name :: diameter:service_name(), %% key in ?STATE_TABLE
         service :: #diameter_service{},
         watchdogT = ets_new(watchdogs) %% #watchdog{} at start
                  :: ets:tid(),
         peerT = ets_new(peers)         %% #peer{pid = TPid} at okay/reopen
              :: ets:tid(),
         shared_peers = ?Dict:new()          %% Alias -> [{TPid, Caps}, ...]
                     :: ets:tid(),
         local_peers  = ?Dict:new()          %% Alias -> [{TPid, Caps}, ...]
                     :: ets:tid(),
         monitor = false :: false | pid(),   %% process to die with
         options
         :: [{sequence, diameter:sequence()}  %% sequence mask
             | {restrict_connections, diameter:restriction()}
             | {share_peers, boolean()}  %% broadcast peers to remote nodes?
             | {use_shared_peers, boolean()}]}).%% use broadcasted peers?
%% shared_peers reflects the peers broadcast from remote nodes.

%% Record representing an RFC 3539 watchdog process implemented by
%% diameter_watchdog.
-record(watchdog,
        {pid  :: match(pid()),
         type :: match(connect | accept),
         ref  :: match(reference()),  %% key into diameter_config
         options :: match([diameter:transport_opt()]),%% from start_transport
         state = ?WD_INITIAL :: match(wd_state()),
         started = now(),      %% at process start
         peer = false :: match(boolean() | pid())}).
                      %% true at accepted, pid() at okay/reopen

%% Record representing an Peer State Machine processes implemented by
%% diameter_peer_fsm.
-record(peer,
        {pid   :: pid(),
         apps  :: [{0..16#FFFFFFFF, diameter:app_alias()}], %% {Id, Alias}
         caps  :: #diameter_caps{},
         started = now(),  %% at process start
         watchdog :: pid()}). %% key into watchdogT

%% ---------------------------------------------------------------------------
%% # start/1
%% ---------------------------------------------------------------------------

start(SvcName) ->
    diameter_service_sup:start_child(SvcName).

start_link(SvcName) ->
    Options = [{spawn_opt, diameter_lib:spawn_opts(server, [])}],
    gen_server:start_link(?MODULE, [SvcName], Options).
%% Put the arbitrary term SvcName in a list in case we ever want to
%% send more than this and need to distinguish old from new.

%% ---------------------------------------------------------------------------
%% # stop/1
%% ---------------------------------------------------------------------------

stop(SvcName) ->
    case whois(SvcName) of
        undefined ->
            {error, not_started};
        Pid ->
            stop(call_service(Pid, stop), Pid)
    end.

stop(ok, Pid) ->
    MRef = erlang:monitor(process, Pid),
    receive {'DOWN', MRef, process, _, _} -> ok end;
stop(No, _) ->
    No.

%% ---------------------------------------------------------------------------
%% # start_transport/3
%% ---------------------------------------------------------------------------

start_transport(SvcName, {_Ref, _Type, _Opts} = T) ->
    call_service_by_name(SvcName, {start, T}).

%% ---------------------------------------------------------------------------
%% # stop_transport/2
%% ---------------------------------------------------------------------------

stop_transport(_, []) ->
    ok;
stop_transport(SvcName, [_|_] = Refs) ->
    call_service_by_name(SvcName, {stop, Refs}).

%% ---------------------------------------------------------------------------
%% # info/2
%% ---------------------------------------------------------------------------

info(SvcName, Item) ->
    case lookup_state(SvcName) of
        [#state{} = S] ->
            service_info(Item, S);
        [] ->
            undefined
    end.

%% lookup_state/1

lookup_state(SvcName) ->
    ets:lookup(?STATE_TABLE, SvcName).

%% ---------------------------------------------------------------------------
%% # subscribe/1
%% # unsubscribe/1
%% ---------------------------------------------------------------------------

subscribe(SvcName) ->
    diameter_reg:add({?MODULE, subscriber, SvcName}).

unsubscribe(SvcName) ->
    diameter_reg:del({?MODULE, subscriber, SvcName}).

subscriptions(Pat) ->
    pmap(diameter_reg:match({?MODULE, subscriber, Pat})).

subscriptions() ->
    subscriptions('_').

pmap(Props) ->
    lists:map(fun({{?MODULE, _, Name}, Pid}) -> {Name, Pid} end, Props).

%% ---------------------------------------------------------------------------
%% # services/1
%% ---------------------------------------------------------------------------

services(Pat) ->
    pmap(diameter_reg:match({?MODULE, service, Pat})).

services() ->
    services('_').

whois(SvcName) ->
    case diameter_reg:match({?MODULE, service, SvcName}) of
        [{_, Pid}] ->
            Pid;
        [] ->
            undefined
    end.

%% ---------------------------------------------------------------------------
%% # pick_peer/3
%% ---------------------------------------------------------------------------

-spec pick_peer(SvcName, AppOrAlias, Opts)
   -> {{TPid, Caps, App}, Mask}
    | false
    | {error, term()}
 when SvcName :: diameter:service_name(),
      AppOrAlias :: {alias, diameter:app_alias()} | #diameter_app{},
      Opts :: tuple(),
      TPid :: pid(),
      Caps :: #diameter_caps{},
      App  :: #diameter_app{},
      Mask :: diameter:sequence().

pick_peer(SvcName, App, Opts) ->
    pick(lookup_state(SvcName), App, Opts).

pick([], _, _) ->
    {error, no_service};

pick([S], App, Opts) ->
    pick(S, App, Opts);

pick(#state{service = #diameter_service{applications = Apps}}
     = S,
     {alias, Alias},
     Opts) ->  %% initial call from diameter:call/4
    pick(S, find_outgoing_app(Alias, Apps), Opts);

pick(_, false, _) ->
    false;

pick(#state{options = [{_, Mask} | _]}
     = S,
     #diameter_app{module = ModX, dictionary = Dict}
     = App0,
     {DestF, Filter, Xtra}) ->
    App = App0#diameter_app{module = ModX ++ Xtra},
    [_,_] = RealmAndHost = diameter_lib:eval([DestF, Dict]),
    case pick_peer(App, RealmAndHost, Filter, S) of
        {TPid, Caps} ->
            {{TPid, Caps, App}, Mask};
        false = No ->
            No
    end.

%% ---------------------------------------------------------------------------
%% # find_incoming_app/4
%% ---------------------------------------------------------------------------

-spec find_incoming_app(PeerT, TPid, Id, Apps)
   -> {#diameter_app{}, #diameter_caps{}}  %% connection and suitable app
    | #diameter_caps{}                     %% connection but no suitable app
    | false                                %% no connection
 when PeerT :: ets:tid(),
      TPid  :: pid(),
      Id    :: non_neg_integer(),
      Apps  :: [#diameter_app{}].

find_incoming_app(PeerT, TPid, Id, Apps) ->
    try ets:lookup(PeerT, TPid) of
        [#peer{} = P] ->
            find_incoming_app(P, Id, Apps);
        [] ->             %% transport has gone down
            false
    catch
        error: badarg ->  %% service has gone down (and taken table with it)
            false
    end.

%% ---------------------------------------------------------------------------
%% # notify/2
%% ---------------------------------------------------------------------------

notify(SvcName, Msg) ->
    Pid = whois(SvcName),
    is_pid(Pid) andalso (Pid ! Msg).

%% ===========================================================================
%% ===========================================================================

state(Svc) ->
    call_service(Svc, state).

uptime(Svc) ->
    call_service(Svc, uptime).

%% call_module/3

call_module(Service, AppMod, Request) ->
    call_service(Service, {call_module, AppMod, Request}).

%% ---------------------------------------------------------------------------
%% # init/1
%% ---------------------------------------------------------------------------

init([SvcName]) ->
    process_flag(trap_exit, true),  %% ensure terminate(shutdown, _)
    i(SvcName, diameter_reg:add_new({?MODULE, service, SvcName})).

i(SvcName, true) ->
    {ok, i(SvcName)};
i(_, false) ->
    {stop, {shutdown, already_started}}.

%% ---------------------------------------------------------------------------
%% # handle_call/3
%% ---------------------------------------------------------------------------

handle_call(state, _, S) ->
    {reply, S, S};

handle_call(uptime, _, #state{id = T} = S) ->
    {reply, diameter_lib:now_diff(T), S};

%% Start a transport.
handle_call({start, {Ref, Type, Opts}}, _From, S) ->
    {reply, start(Ref, {Type, Opts}, S), S};

%% Stop transports.
handle_call({stop, Refs}, _From, S) ->
    shutdown(Refs, S),
    {reply, ok, S};

%% pick_peer with mutable state
handle_call({pick_peer, Local, Remote, App}, _From, S) ->
    #diameter_app{mutable = true} = App,  %% assert
    {reply, pick_peer(Local, Remote, self(), S#state.service_name, App), S};

handle_call({call_module, AppMod, Req}, From, S) ->
    call_module(AppMod, Req, From, S);

handle_call(stop, _From, S) ->
    shutdown(service, S),
    {stop, normal, ok, S};
%% The server currently isn't guaranteed to be dead when the caller
%% gets the reply. We deal with this in the call to the server,
%% stating a monitor that waits for DOWN before returning.

handle_call(Req, From, S) ->
    unexpected(handle_call, [Req, From], S),
    {reply, nok, S}.

%% ---------------------------------------------------------------------------
%% # handle_cast/2
%% ---------------------------------------------------------------------------

handle_cast(Req, S) ->
    unexpected(handle_cast, [Req], S),
    {noreply, S}.

%% ---------------------------------------------------------------------------
%% # handle_info/2
%% ---------------------------------------------------------------------------

handle_info(T, #state{} = S) ->
    case transition(T,S) of
        ok ->
            {noreply, S};
        {stop, Reason} ->
            {stop, {shutdown, Reason}, S}
    end.

%% transition/2

%% Peer process is telling us to start a new accept process.
transition({accepted, Pid, TPid}, S) ->
    accepted(Pid, TPid, S),
    ok;

%% Connecting transport is being restarted by watchdog.
transition({reconnect, Pid}, S) ->
    reconnect(Pid, S),
    ok;

%% Watchdog is sending notification of transport death.
transition({close, Pid, Reason}, #state{service_name = SvcName,
                                        watchdogT = WatchdogT}) ->
    #watchdog{state = WS,
              ref = Ref,
              type = Type,
              options = Opts}
        = fetch(WatchdogT, Pid),
    WS /= ?WD_OKAY
        andalso
        send_event(SvcName, {closed, Ref, Reason, {type(Type), Opts}}),
    ok;

%% Watchdog is sending notification of a state transition.
transition({watchdog, Pid, {[TPid | Data], From, To}},
           #state{service_name = SvcName,
                  watchdogT = WatchdogT}
           = S) ->
    #watchdog{ref = Ref, type = T, options = Opts}
        = Wd
        = fetch(WatchdogT, Pid),
    watchdog(TPid, Data, From, To, Wd, S),
    send_event(SvcName, {watchdog, Ref, TPid, {From, To}, {T, Opts}}),
    ok;
%% Death of a watchdog process (#watchdog.pid) results in the removal of
%% it's peer and any associated conn record when 'DOWN' is received.
%% Death of a peer process process (#peer.pid, #watchdog.peer) results in
%% ?WD_DOWN.

%% Monitor process has died. Just die with a reason that tells
%% diameter_config about the happening. If a cleaner shutdown is
%% required then someone should stop us.
transition({'DOWN', MRef, process, _, Reason}, #state{monitor = MRef}) ->
    {stop, {monitor, Reason}};

%% Local watchdog process has died.
transition({'DOWN', _, process, Pid, _Reason}, S)
  when node(Pid) == node() ->
    watchdog_down(Pid, S),
    ok;

%% Remote service wants to know about shared peers.
transition({service, Pid}, S) ->
    share_peers(Pid, S),
    ok;

%% Remote service is communicating a shared peer.
transition({peer, TPid, Aliases, Caps}, S) ->
    remote_peer_up(TPid, Aliases, Caps, S),
    ok;

%% Remote peer process has died.
transition({'DOWN', _, process, TPid, _}, S) ->
    remote_peer_down(TPid, S),
    ok;

%% Restart after tc expiry.
transition({tc_timeout, T}, S) ->
    tc_timeout(T, S),
    ok;

transition(Req, S) ->
    unexpected(handle_info, [Req], S),
    ok.

%% ---------------------------------------------------------------------------
%% # terminate/2
%% ---------------------------------------------------------------------------

terminate(Reason, #state{service_name = Name} = S) ->
    send_event(Name, stop),
    ets:delete(?STATE_TABLE, Name),
    shutdown == Reason  %% application shutdown
        andalso shutdown(application, S).

%% ---------------------------------------------------------------------------
%% # code_change/3
%% ---------------------------------------------------------------------------

code_change(FromVsn,
            #state{service_name = SvcName,
                   service = #diameter_service{applications = Apps}}
            = S,
            Extra) ->
    lists:foreach(fun(A) ->
                          code_change(FromVsn, SvcName, Extra, A)
                  end,
                  Apps),
    {ok, S}.

code_change(FromVsn, SvcName, Extra, #diameter_app{alias = Alias} = A) ->
    {ok, S} = cb(A, code_change, [FromVsn,
                                  mod_state(Alias),
                                  Extra,
                                  SvcName]),
    mod_state(Alias, S).

%% ===========================================================================
%% ===========================================================================

unexpected(F, A, #state{service_name = Name}) ->
    ?UNEXPECTED(F, A ++ [Name]).

cb(#diameter_app{module = [_|_] = M}, F, A) ->
    eval(M, F, A).

eval([M|X], F, A) ->
    apply(M, F, A ++ X).

%% Callback with state.

state_cb(#diameter_app{module = ModX, mutable = false, init_state = S},
         pick_peer = F,
         A) ->
    eval(ModX, F, A ++ [S]);

state_cb(#diameter_app{module = ModX, alias = Alias}, F, A) ->
    eval(ModX, F, A ++ [mod_state(Alias)]).

choose(true, X, _)  -> X;
choose(false, _, X) -> X.

ets_new(Tbl) ->
    ets:new(Tbl, [{keypos, 2}]).

insert(Tbl, Rec) ->
    ets:insert(Tbl, Rec),
    Rec.

%% Using the process dictionary for the callback state was initially
%% just a way to make what was horrendous trace (big state record and
%% much else everywhere) somewhat more readable. There's not as much
%% need for it now but it's no worse (except possibly that we don't
%% see the table identifier being passed around) than an ets table so
%% keep it.

mod_state(Alias) ->
    get({?MODULE, mod_state, Alias}).

mod_state(Alias, ModS) ->
    put({?MODULE, mod_state, Alias}, ModS).

%% ---------------------------------------------------------------------------
%% # shutdown/2
%% ---------------------------------------------------------------------------

%% remove_transport
shutdown(Refs, #state{watchdogT = WatchdogT})
  when is_list(Refs) ->
    ets:foldl(fun(P,ok) -> st(P, Refs), ok end, ok, WatchdogT);

%% application/service shutdown
shutdown(Reason, #state{watchdogT = WatchdogT})
  when Reason == application;
       Reason == service ->
    diameter_lib:wait(ets:foldl(fun(P,A) -> st(P, Reason, A) end,
                                [],
                                WatchdogT)).

%% st/2

st(#watchdog{ref = Ref, pid = Pid}, Refs) ->
    lists:member(Ref, Refs)
        andalso (Pid ! {shutdown, self(), transport}).  %% 'DOWN' cleans up

%% st/3

st(#watchdog{pid = Pid}, Reason, Acc) ->
    Pid ! {shutdown, self(), Reason},
    [Pid | Acc].

%% ---------------------------------------------------------------------------
%% # call_service/2
%% ---------------------------------------------------------------------------

call_service(Pid, Req)
  when is_pid(Pid) ->
    cs(Pid, Req);
call_service(SvcName, Req) ->
    call_service_by_name(SvcName, Req).

call_service_by_name(SvcName, Req) ->
    cs(whois(SvcName), Req).

cs(Pid, Req)
  when is_pid(Pid) ->
    try
        gen_server:call(Pid, Req, infinity)
    catch
        E: Reason when E == exit ->
            {error, {E, Reason}}
    end;

cs(undefined, _) ->
    {error, no_service}.

%% ---------------------------------------------------------------------------
%% # i/1
%% ---------------------------------------------------------------------------

%% Intialize the state of a service gen_server.

i(SvcName) ->
    %% Split the config into a server state and a list of transports.
    {#state{} = S, CL} = lists:foldl(fun cfg_acc/2,
                                     {false, []},
                                     diameter_config:lookup(SvcName)),

    %% Publish the state in order to be able to access it outside of
    %% the service process. Originally table identifiers were only
    %% known to the service process but we now want to provide the
    %% option of application callbacks being 'stateless' in order to
    %% avoid having to go through a common process. (Eg. An agent that
    %% sends a request for every incoming request.)
    true = ets:insert_new(?STATE_TABLE, S),

    %% Start fsms for each transport.
    send_event(SvcName, start),
    lists:foreach(fun(T) -> start_fsm(T,S) end, CL),

    init_shared(S),
    S.

cfg_acc({SvcName, #diameter_service{applications = Apps} = Rec, Opts},
        {false, Acc}) ->
    lists:foreach(fun init_mod/1, Apps),
    S = #state{service_name = SvcName,
               service = Rec#diameter_service{pid = self()},
               monitor = mref(get_value(monitor, Opts)),
               options = service_options(Opts)},
    {S, Acc};

cfg_acc({_Ref, Type, _Opts} = T, {S, Acc})
  when Type == connect;
       Type == listen ->
    {S, [T | Acc]}.

service_options(Opts) ->
    [{sequence, proplists:get_value(sequence, Opts, ?NOMASK)},
     {share_peers, get_value(share_peers, Opts)},
     {use_shared_peers, get_value(use_shared_peers, Opts)},
     {restrict_connections, proplists:get_value(restrict_connections,
                                                Opts,
                                                ?RESTRICT)}].
%% The order of options is significant since we match against the list.

mref(false = No) ->
    No;
mref(P) ->
    erlang:monitor(process, P).

init_shared(#state{options = [_, _, {_, true} | _],
                   service_name = Svc}) ->
    diameter_peer:notify(Svc, {service, self()});
init_shared(#state{options = [_, _, {_, false} | _]}) ->
    ok.

init_mod(#diameter_app{alias = Alias,
                       init_state = S}) ->
    mod_state(Alias, S).

start_fsm({Ref, Type, Opts}, S) ->
    start(Ref, {Type, Opts}, S).

get_value(Key, Vs) ->
    {_, V} = lists:keyfind(Key, 1, Vs),
    V.

%% ---------------------------------------------------------------------------
%% # start/3
%% ---------------------------------------------------------------------------

%% If the initial start/3 at service/transport start succeeds then
%% subsequent calls to start/4 on the same service will also succeed
%% since they involve the same call to merge_service/2. We merge here
%% rather than earlier since the service may not yet be configured
%% when the transport is configured.

start(Ref, {T, Opts}, S)
  when T == connect;
       T == listen ->
    try
        {ok, start(Ref, type(T), Opts, S)}
    catch
        ?FAILURE(Reason) ->
            {error, Reason}
    end.
%% TODO: don't actually raise any errors yet

%% There used to be a difference here between the handling of
%% configured listening and connecting transports but now we simply
%% tell the transport_module to start an accepting or connecting
%% process respectively, the transport implementation initiating
%% listening on a port as required.
type(listen)      -> accept;
type(accept)      -> listen;
type(connect = T) -> T.

%% start/4

start(Ref, Type, Opts, #state{watchdogT = WatchdogT,
                              peerT = PeerT,
                              options = SvcOpts,
                              service_name = SvcName,
                              service = Svc0})
  when Type == connect;
       Type == accept ->
    #diameter_service{applications = Apps}
        = Svc
        = merge_service(Opts, Svc0),
    {_,_} = Mask = proplists:get_value(sequence, SvcOpts),
    Pid = s(Type, Ref, {diameter_traffic:make_recvdata([SvcName,
                                                        PeerT,
                                                        Apps,
                                                        Mask]),
                        Opts,
                        SvcOpts,
                        Svc}),
    insert(WatchdogT, #watchdog{pid = Pid,
                                type = Type,
                                ref = Ref,
                                options = Opts}),
    Pid.

%% Note that the service record passed into the watchdog is the merged
%% record so that each watchdog may get a different record. This
%% record is what is passed back into application callbacks.

s(Type, Ref, T) ->
    {_MRef, Pid} = diameter_watchdog:start({Type, Ref}, T),
    Pid.

%% merge_service/2

merge_service(Opts, Svc) ->
    lists:foldl(fun ms/2, Svc, Opts).

%% Limit the applications known to the fsm to those in the 'apps'
%% option. That this might be empty is checked by the fsm. It's not
%% checked at config-time since there's no requirement that the
%% service be configured first. (Which could be considered a bit odd.)
ms({applications, As}, #diameter_service{applications = Apps} = S)
  when is_list(As) ->
    S#diameter_service{applications
                       = [A || A <- Apps,
                               lists:member(A#diameter_app.alias, As)]};

%% The fact that all capabilities can be configured on the transports
%% means that the service doesn't necessarily represent a single
%% locally implemented Diameter peer as identified by Origin-Host: a
%% transport can configure its own Origin-Host. This means that the
%% service little more than a placeholder for default capabilities
%% plus a list of applications that individual transports can choose
%% to support (or not).
ms({capabilities, Opts}, #diameter_service{capabilities = Caps0} = Svc)
  when is_list(Opts) ->
    %% make_caps has already succeeded in diameter_config so it will succeed
    %% again here.
    {ok, Caps} = diameter_capx:make_caps(Caps0, Opts),
    Svc#diameter_service{capabilities = Caps};

ms(_, Svc) ->
    Svc.

%% ---------------------------------------------------------------------------
%% # accepted/3
%% ---------------------------------------------------------------------------

accepted(Pid, _TPid, #state{watchdogT = WatchdogT} = S) ->
    #watchdog{ref = Ref, type = accept = T, peer = false, options = Opts}
        = Wd
        = fetch(WatchdogT, Pid),
    insert(WatchdogT, Wd#watchdog{peer = true}),%% mark replacement as started
    start(Ref, T, Opts, S).                     %% start new watchdog

fetch(Tid, Key) ->
    [T] = ets:lookup(Tid, Key),
    T.

%% ---------------------------------------------------------------------------
%% # watchdog/6
%%
%% React to a watchdog state transition.
%% ---------------------------------------------------------------------------

%% Watchdog has a new open connection.
watchdog(TPid, [T], _, ?WD_OKAY, Wd, State) ->
    connection_up({TPid, T}, Wd, State);

%% Watchdog has a new connection that will be opened after DW[RA]
%% exchange.
watchdog(TPid, [T], _, ?WD_REOPEN, Wd, State) ->
    reopen({TPid, T}, Wd, State);

%% Watchdog has recovered a suspect connection.
watchdog(TPid, [], ?WD_SUSPECT, ?WD_OKAY, Wd, State) ->
    #watchdog{peer = TPid} = Wd,  %% assert
    connection_up(Wd, State);

%% Watchdog has an unresponsive connection.
watchdog(TPid, [], ?WD_OKAY, ?WD_SUSPECT = To, Wd, State) ->
    #watchdog{peer = TPid} = Wd,  %% assert
    connection_down(Wd, To, State);

%% Watchdog has lost its connection.
watchdog(TPid, [], _, ?WD_DOWN = To, Wd, #state{peerT = PeerT} = S) ->
    close(Wd, S),
    connection_down(Wd, To, S),
    ets:delete(PeerT, TPid);

watchdog(_, [], _, _, _, _) ->
    ok.

%% ---------------------------------------------------------------------------
%% # connection_up/3
%% ---------------------------------------------------------------------------

%% Watchdog process has reached state OKAY.

connection_up({TPid, {Caps, SupportedApps, Pkt}},
              #watchdog{pid = Pid}
              = Wd,
              #state{peerT = PeerT}
              = S) ->
    Pr = #peer{pid = TPid,
               apps = SupportedApps,
               caps = Caps,
               watchdog = Pid},
    insert(PeerT, Pr),
    connection_up([Pkt], Wd#watchdog{peer = TPid}, Pr, S).

%% ---------------------------------------------------------------------------
%% # reopen/3
%% ---------------------------------------------------------------------------

reopen({TPid, {Caps, SupportedApps, _Pkt}},
       #watchdog{pid = Pid}
       = Wd,
       #state{watchdogT = WatchdogT,
              peerT = PeerT}) ->
    insert(PeerT, #peer{pid = TPid,
                        apps = SupportedApps,
                        caps = Caps,
                        watchdog = Pid}),
    insert(WatchdogT, Wd#watchdog{state = ?WD_REOPEN,
                                  peer = TPid}).

%% ---------------------------------------------------------------------------
%% # connection_up/2
%% ---------------------------------------------------------------------------

%% Watchdog has recovered as suspect connection. Note that there has
%% been no new capabilties exchange in this case.

connection_up(#watchdog{peer = TPid} = Wd, #state{peerT = PeerT} = S) ->
    connection_up([], Wd, fetch(PeerT, TPid), S).

%% connection_up/4

connection_up(Extra,
              #watchdog{peer = TPid}
              = Wd,
              #peer{apps = SApps, caps = Caps}
              = Pr,
              #state{watchdogT = WatchdogT,
                     local_peers = LDict,
                     service_name = SvcName,
                     service = #diameter_service{applications = Apps}}
              = S) ->
    insert(WatchdogT, Wd#watchdog{state = ?WD_OKAY}),
    diameter_traffic:peer_up(TPid),
    insert_local_peer(SApps, {{TPid, Caps}, {SvcName, Apps}}, LDict),
    report_status(up, Wd, Pr, S, Extra).

insert_local_peer(SApps, T, LDict) ->
    lists:foldl(fun(A,D) -> ilp(A, T, D) end, LDict, SApps).

ilp({Id, Alias}, {TC, SA}, LDict) ->
    init_conn(Id, Alias, TC, SA),
    ?Dict:append(Alias, TC, LDict).

init_conn(Id, Alias, {TPid, _} = TC, {SvcName, Apps}) ->
    #diameter_app{id = Id}  %% assert
        = App
        = find_app(Alias, Apps),

    peer_cb(App, peer_up, [SvcName, TC])
        orelse exit(TPid, kill).  %% fake transport failure

%% ---------------------------------------------------------------------------
%% # find_incoming_app/3
%% ---------------------------------------------------------------------------

%% No one should be sending the relay identifier.
find_incoming_app(#peer{caps = Caps}, ?APP_ID_RELAY, _) ->
    Caps;

find_incoming_app(Peer, Id, Apps)
  when is_integer(Id) ->
    find_incoming_app(Peer, [Id, ?APP_ID_RELAY], Apps);

%% Note that the apps represented in SApps may be a strict subset of
%% those in Apps.
find_incoming_app(#peer{apps = SApps, caps = Caps}, Ids, Apps) ->
    case keyfind(Ids, 1, SApps) of
        {_Id, Alias} ->
            {#diameter_app{} = find_app(Alias, Apps), Caps};
        false ->
            Caps
    end.

%% keyfind/3

keyfind([], _, _) ->
    false;
keyfind([Key | Rest], Pos, L) ->
    case lists:keyfind(Key, Pos, L) of
        false ->
            keyfind(Rest, Pos, L);
        T ->
            T
    end.

%% find_outgoing_app/2

find_outgoing_app(Alias, Apps) ->
    case find_app(Alias, Apps) of
        #diameter_app{id = ?APP_ID_RELAY} ->
            false;
        A ->
            A
    end.

%% find_app/2

find_app(Alias, Apps) ->
    lists:keyfind(Alias, #diameter_app.alias, Apps).

%% Don't bring down the service (and all associated connections)
%% regardless of what happens.
peer_cb(App, F, A) ->
    try state_cb(App, F, A) of
        ModS ->
            mod_state(App#diameter_app.alias, ModS),
            true
    catch
        E:R ->
            diameter_lib:error_report({failure, {E, R, ?STACK}},
                                      {App, F, A}),
            false
    end.

%% ---------------------------------------------------------------------------
%% # connection_down/3
%% ---------------------------------------------------------------------------

connection_down(#watchdog{state = ?WD_OKAY,
                          peer = TPid}
                = Wd,
                #peer{caps = Caps,
                      apps = SApps}
                = Pr,
                #state{service_name = SvcName,
                       service = #diameter_service{applications = Apps},
                       local_peers = LDict}
                = S) ->
    report_status(down, Wd, Pr, S, []),
    remove_local_peer(SApps, {{TPid, Caps}, {SvcName, Apps}}, LDict),
    diameter_traffic:peer_down(TPid);

connection_down(#watchdog{}, #peer{}, _) ->
    ok;

connection_down(#watchdog{state = ?WD_OKAY,
                          peer = TPid}
                = Wd,
                To,
                #state{watchdogT = WatchdogT,
                       peerT = PeerT}
                = S)
  when is_atom(To) ->
    insert(WatchdogT, Wd#watchdog{state = To}),
    connection_down(Wd, fetch(PeerT, TPid), S);

connection_down(#watchdog{},
                To,
                #state{})
  when is_atom(To) ->
    ok.     

remove_local_peer(SApps, T, LDict) ->
    lists:foldl(fun(A,D) -> rlp(A, T, D) end, LDict, SApps).

rlp({Id, Alias}, {TC, SA}, LDict) ->
    L = ?Dict:fetch(Alias, LDict),
    down_conn(Id, Alias, TC, SA),
    ?Dict:store(Alias, lists:delete(TC, L), LDict).

down_conn(Id, Alias, TC, {SvcName, Apps}) ->
    #diameter_app{id = Id}  %% assert
        = App
        = find_app(Alias, Apps),

    peer_cb(App, peer_down, [SvcName, TC]).

%% ---------------------------------------------------------------------------
%% # watchdog_down/2
%% ---------------------------------------------------------------------------

%% Watchdog process has died.

watchdog_down(Pid, #state{watchdogT = WatchdogT} = S) ->
    Wd = fetch(WatchdogT, Pid),
    ets:delete_object(WatchdogT, Wd),
    restart(Wd,S),
    wd_down(Wd,S).

%% Watchdog has never reached OKAY ...
wd_down(#watchdog{peer = B}, _)
  when is_boolean(B) ->
    ok;

%% ... or maybe it has.
wd_down(#watchdog{peer = TPid} = Wd, #state{peerT = PeerT} = S) ->
    connection_down(Wd, ?WD_DOWN, S),
    ets:delete(PeerT, TPid).

%% restart/2

restart(Wd, S) ->
    q_restart(restart(Wd), S).

%% restart/1

%% Always try to reconnect.
restart(#watchdog{ref = Ref,
                  type = connect = T,
                  options = Opts,
                  started = Time}) ->
    {Time, {Ref, T, Opts}};

%% Transport connection hasn't yet been accepted ...
restart(#watchdog{ref = Ref,
                  type = accept = T,
                  options = Opts,
                  peer = false,
                  started = Time}) ->
    {Time, {Ref, T, Opts}};

%% ... or it has: a replacement has already been spawned.
restart(#watchdog{type = accept}) ->
    false.

%% q_restart/2

%% Start the reconnect timer.
q_restart({Time, {_Ref, Type, Opts} = T}, S) ->
    start_tc(tc(Time, default_tc(Type, Opts)), T, S);
q_restart(false, _) ->
    ok.

%% RFC 3588, 2.1:
%%
%%   When no transport connection exists with a peer, an attempt to
%%   connect SHOULD be periodically made.  This behavior is handled via
%%   the Tc timer, whose recommended value is 30 seconds.  There are
%%   certain exceptions to this rule, such as when a peer has terminated
%%   the transport connection stating that it does not wish to
%%   communicate.

default_tc(connect, Opts) ->
    proplists:get_value(reconnect_timer, Opts, ?DEFAULT_TC);
default_tc(accept, _) ->
    0.

%% Bound tc below if the watchdog was restarted recently to avoid
%% continuous restarted in case of faulty config or other problems.
tc(Time, Tc) ->
    choose(Tc > ?RESTART_TC
             orelse timer:now_diff(now(), Time) > 1000*?RESTART_TC,
           Tc,
           ?RESTART_TC).

start_tc(0, T, S) ->
    tc_timeout(T, S);
start_tc(Tc, T, _) ->
    erlang:send_after(Tc, self(), {tc_timeout, T}).

%% tc_timeout/2

tc_timeout({Ref, _Type, _Opts} = T, #state{service_name = SvcName} = S) ->
    tc(diameter_config:have_transport(SvcName, Ref), T, S).

tc(true, {Ref, Type, Opts}, #state{service_name = SvcName}
                            = S) ->
    send_event(SvcName, {reconnect, Ref, Opts}),
    start(Ref, Type, Opts, S);
tc(false = No, _, _) ->  %% removed
    No.

%% ---------------------------------------------------------------------------
%% # close/2
%% ---------------------------------------------------------------------------

%% The watchdog doesn't start a new fsm in the accept case, it
%% simply stays alive until someone tells it to die in order for
%% another watchdog to be able to detect that it should transition
%% from initial into reopen rather than okay. That someone is either
%% the accepting watchdog upon reception of a CER from the previously
%% connected peer, or us after reconnect_timer timeout.

close(#watchdog{type = connect}, _) ->
    ok;
close(#watchdog{type = accept,
                pid = Pid,
                ref = Ref,
                options = Opts},
      #state{service_name = SvcName}) ->
    c(Pid, diameter_config:have_transport(SvcName, Ref), Opts).

%% Tell watchdog to (maybe) die later ...
c(Pid, true, Opts) ->
    Tc = proplists:get_value(reconnect_timer, Opts, 2*?DEFAULT_TC),
    erlang:send_after(Tc, Pid, close);

%% ... or now.
c(Pid, false, _Opts) ->
    Pid ! close.

%% The RFC's only document the behaviour of Tc, our reconnect_timer,
%% for the establishment of connections but we also give
%% reconnect_timer semantics for a listener, being the time within
%% which a new connection attempt is expected of a connecting peer.
%% The value should be greater than the peer's Tc + jitter.

%% ---------------------------------------------------------------------------
%% # reconnect/2
%% ---------------------------------------------------------------------------

reconnect(Pid, #state{service_name = SvcName,
                      watchdogT = WatchdogT}) ->
    #watchdog{ref = Ref,
              type = connect,
              options = Opts}
        = fetch(WatchdogT, Pid),
    send_event(SvcName, {reconnect, Ref, Opts}).

%% ---------------------------------------------------------------------------
%% # call_module/4
%% ---------------------------------------------------------------------------

%% Backwards compatibility and never documented/advertised. May be
%% removed.

call_module(Mod, Req, From, #state{service
                                   = #diameter_service{applications = Apps},
                                   service_name = Svc}
                            = S) ->
    case cm([A || A <- Apps, Mod == hd(A#diameter_app.module)],
            Req,
            From,
            Svc)
    of
        {reply = T, RC} ->
            {T, RC, S};
        noreply = T ->
            {T, S};
        Reason ->
            {reply, {error, Reason}, S}
    end.

cm([#diameter_app{alias = Alias} = App], Req, From, Svc) ->
    Args = [Req, From, Svc],

    try state_cb(App, handle_call, Args) of
        {noreply = T, ModS} ->
            mod_state(Alias, ModS),
            T;
        {reply = T, RC, ModS} ->
            mod_state(Alias, ModS),
            {T, RC};
        T ->
            diameter_lib:error_report({invalid, T},
                                      {App, handle_call, Args}),
            invalid
    catch
        E: Reason ->
            diameter_lib:error_report({failure, {E, Reason, ?STACK}},
                                      {App, handle_call, Args}),
            failure
    end;

cm([], _, _, _) ->
    unknown;

cm([_,_|_], _, _, _) ->
    multiple.

%% ---------------------------------------------------------------------------
%% # report_status/5
%% ---------------------------------------------------------------------------

report_status(Status,
              #watchdog{ref = Ref,
                        peer = TPid,
                        type = Type,
                        options = Opts},
              #peer{apps = [_|_] = As,
                    caps = Caps},
              #state{service_name = SvcName}
              = S,
              Extra) ->
    share_peer(Status, Caps, As, TPid, S),
    Info = [Status, Ref, {TPid, Caps}, {type(Type), Opts} | Extra],
    send_event(SvcName, list_to_tuple(Info)).

%% send_event/2

send_event(SvcName, Info) ->
    send_event(#diameter_event{service = SvcName,
                               info = Info}).

send_event(#diameter_event{service = SvcName} = E) ->
    lists:foreach(fun({_, Pid}) -> Pid ! E end, subscriptions(SvcName)).

%% ---------------------------------------------------------------------------
%% # share_peer/5
%% ---------------------------------------------------------------------------

share_peer(up, Caps, Aliases, TPid, #state{options = [_, {_, true} | _],
                                           service_name = Svc}) ->
    diameter_peer:notify(Svc, {peer, TPid, Aliases, Caps});

share_peer(_, _, _, _, _) ->
    ok.

%% ---------------------------------------------------------------------------
%% # share_peers/2
%% ---------------------------------------------------------------------------

share_peers(Pid, #state{options = [_, {_, true} | _],
                        local_peers = PDict}) ->
    ?Dict:fold(fun(A,Ps,ok) -> sp(Pid, A, Ps), ok end, ok, PDict);

share_peers(_, _) ->
    ok.

sp(Pid, Alias, Peers) ->
    lists:foreach(fun({P,C}) -> Pid ! {peer, P, [Alias], C} end, Peers).

%% ---------------------------------------------------------------------------
%% # remote_peer_up/4
%% ---------------------------------------------------------------------------

remote_peer_up(Pid, Aliases, Caps, #state{options = [_, _, {_, true} | _],
                                          service = Svc,
                                          shared_peers = PDict}) ->
    #diameter_service{applications = Apps} = Svc,
    Key = #diameter_app.alias,
    As = lists:filter(fun(A) -> lists:keymember(A, Key, Apps) end, Aliases),
    rpu(Pid, Caps, PDict, As);

remote_peer_up(_, _, _, #state{options = [_, _, {_, false} | _]}) ->
    ok.

rpu(_, _, PDict, []) ->
    PDict;
rpu(Pid, Caps, PDict, Aliases) ->
    erlang:monitor(process, Pid),
    T = {Pid, Caps},
    lists:foreach(fun(A) -> ?Dict:append(A, T, PDict) end, Aliases).

%% ---------------------------------------------------------------------------
%% # remote_peer_down/2
%% ---------------------------------------------------------------------------

remote_peer_down(Pid, #state{options = [_, _, {_, true} | _],
                             shared_peers = PDict}) ->
    lists:foreach(fun(A) -> rpd(Pid, A, PDict) end, ?Dict:fetch_keys(PDict)).

rpd(Pid, Alias, PDict) ->
    ?Dict:update(Alias, fun(Ps) -> lists:keydelete(Pid, 1, Ps) end, PDict).

%% ---------------------------------------------------------------------------
%% pick_peer/4
%% ---------------------------------------------------------------------------

pick_peer(#diameter_app{alias = Alias}
          = App,
          RealmAndHost,
          Filter,
          #state{local_peers = L,
                 shared_peers = S,
                 service_name = SvcName,
                 service = #diameter_service{pid = Pid}}) ->
    pick_peer(peers(Alias, RealmAndHost, Filter, L),
              peers(Alias, RealmAndHost, Filter, S),
              Pid,
              SvcName,
              App).

%% pick_peer/5

pick_peer([], [], _, _, _) ->
    false;

%% App state is mutable but we're not in the service process: go there.
pick_peer(Local, Remote, Pid, _SvcName, #diameter_app{mutable = true} = App)
  when self() /= Pid ->
    case call_service(Pid, {pick_peer, Local, Remote, App}) of
        {TPid, _} = T when is_pid(TPid) ->
            T;
        {error, _} ->
            false
    end;

%% App state isn't mutable or it is and we're in the service process:
%% do the deed.
pick_peer(Local,
          Remote,
          _Pid,
          SvcName,
          #diameter_app{alias = Alias,
                        init_state = S,
                        mutable = M}
          = App) ->
    Args = [Local, Remote, SvcName],

    try state_cb(App, pick_peer, Args) of
        {ok, {TPid, #diameter_caps{}} = T} when is_pid(TPid) ->
            T;
        {{TPid, #diameter_caps{}} = T, ModS} when is_pid(TPid), M ->
            mod_state(Alias, ModS),
            T;
        {false = No, ModS} when M ->
            mod_state(Alias, ModS),
            No;
        {ok, false = No} ->
            No;
        false = No ->
            No;
        {{TPid, #diameter_caps{}} = T, S} when is_pid(TPid) ->
            T;                     %% Accept returned state in the immutable
        {false = No, S} ->         %% case as long it isn't changed.
            No;
        T ->
            diameter_lib:error_report({invalid, T, App},
                                      {App, pick_peer, Args})
    catch
        E: Reason ->
            diameter_lib:error_report({failure, {E, Reason, ?STACK}},
                                      {App, pick_peer, Args})
    end.

%% peers/4

peers(Alias, RH, Filter, Peers) ->
    case ?Dict:find(Alias, Peers) of
        {ok, L} ->
            ps(L, RH, Filter, {[],[]});
        error ->
            []
    end.

%% Place a peer whose Destination-Host/Realm matches those of the
%% request at the front of the result list. Could add some sort of
%% 'sort' option to allow more control.

ps([], _, _, {Ys, Ns}) ->
    lists:reverse(Ys, Ns);
ps([{_TPid, #diameter_caps{} = Caps} = TC | Rest], RH, Filter, Acc) ->
    ps(Rest, RH, Filter, pacc(caps_filter(Caps, RH, Filter),
                              caps_filter(Caps, RH, {all, [host, realm]}),
                              TC,
                              Acc)).

pacc(true, true, Peer, {Ts, Fs}) ->
    {[Peer|Ts], Fs};
pacc(true, false, Peer, {Ts, Fs}) ->
    {Ts, [Peer|Fs]};
pacc(_, _, _, Acc) ->
    Acc.

%% caps_filter/3

caps_filter(C, RH, {neg, F}) ->
    not caps_filter(C, RH, F);

caps_filter(C, RH, {all, L})
  when is_list(L) ->
    lists:all(fun(F) -> caps_filter(C, RH, F) end, L);

caps_filter(C, RH, {any, L})
  when is_list(L) ->
    lists:any(fun(F) -> caps_filter(C, RH, F) end, L);

caps_filter(#diameter_caps{origin_host = {_,OH}}, [_,DH], host) ->
    eq(undefined, DH, OH);

caps_filter(#diameter_caps{origin_realm = {_,OR}}, [DR,_], realm) ->
    eq(undefined, DR, OR);

caps_filter(C, _, Filter) ->
    caps_filter(C, Filter).

%% caps_filter/2

caps_filter(_, none) ->
    true;

caps_filter(#diameter_caps{origin_host = {_,OH}}, {host, H}) ->
    eq(any, H, OH);

caps_filter(#diameter_caps{origin_realm = {_,OR}}, {realm, R}) ->
    eq(any, R, OR);

%% Anything else is expected to be an eval filter. Filter failure is
%% documented as being equivalent to a non-matching filter.

caps_filter(C, T) ->
    try
        {eval, F} = T,
        diameter_lib:eval([F,C])
    catch
        _:_ -> false
    end.

eq(Any, Id, PeerId) ->
    Any == Id orelse try
                         iolist_to_binary(Id) == iolist_to_binary(PeerId)
                     catch
                         _:_ -> false
                     end.
%% OctetString() can be specified as an iolist() so test for string
%% rather then term equality.

%% transports/1

transports(#state{watchdogT = WatchdogT}) ->
    ets:select(WatchdogT, [{#watchdog{peer = '$1', _ = '_'},
                        [{'is_pid', '$1'}],
                        ['$1']}]).

%% ---------------------------------------------------------------------------
%% # service_info/2
%% ---------------------------------------------------------------------------

%% The config passed to diameter:start_service/2.
-define(CAP_INFO, ['Origin-Host',
                   'Origin-Realm',
                   'Vendor-Id',
                   'Product-Name',
                   'Origin-State-Id',
                   'Host-IP-Address',
                   'Supported-Vendor-Id',
                   'Auth-Application-Id',
                   'Inband-Security-Id',
                   'Acct-Application-Id',
                   'Vendor-Specific-Application-Id',
                   'Firmware-Revision']).

%% The config returned by diameter:service_info(SvcName, all).
-define(ALL_INFO, [capabilities,
                   applications,
                   transport,
                   pending,
                   options]).

%% The rest.
-define(OTHER_INFO, [connections,
                     name,
                     peers,
                     statistics]).

service_info(Item, S)
  when is_atom(Item) ->
    case tagged_info(Item, S) of
        {_, T} -> T;
        undefined = No -> No
    end;

service_info(Items, S) ->
    tagged_info(Items, S).

tagged_info(Item, S)
  when is_atom(Item) ->
    case complete(Item) of
        {value, I} ->
            {I, complete_info(I,S)};
        false ->
            undefined
    end;

tagged_info(TPid, #state{watchdogT = WatchdogT, peerT = PeerT})
  when is_pid(TPid) ->
    try
        [#peer{watchdog = Pid}] = ets:lookup(PeerT, TPid),
        [#watchdog{ref = Ref, type = Type, options = Opts}]
            = ets:lookup(WatchdogT, Pid),
        [{ref, Ref},
         {type, Type},
         {options, Opts}]
    catch
        error:_ ->
            []
    end;

tagged_info(Items, S)
  when is_list(Items) ->
    [T || I <- Items, T <- [tagged_info(I,S)], T /= undefined, T /= []];

tagged_info(_, _) ->
    undefined.

complete_info(Item, #state{service = Svc} = S) ->
    case Item of
        name ->
            S#state.service_name;
        'Origin-Host' ->
            (Svc#diameter_service.capabilities)
                #diameter_caps.origin_host;
        'Origin-Realm' ->
            (Svc#diameter_service.capabilities)
                #diameter_caps.origin_realm;
        'Vendor-Id' ->
            (Svc#diameter_service.capabilities)
                #diameter_caps.vendor_id;
        'Product-Name' ->
            (Svc#diameter_service.capabilities)
                #diameter_caps.product_name;
        'Origin-State-Id' ->
            (Svc#diameter_service.capabilities)
                #diameter_caps.origin_state_id;
        'Host-IP-Address' ->
            (Svc#diameter_service.capabilities)
                #diameter_caps.host_ip_address;
        'Supported-Vendor-Id' ->
            (Svc#diameter_service.capabilities)
                #diameter_caps.supported_vendor_id;
        'Auth-Application-Id' ->
            (Svc#diameter_service.capabilities)
                #diameter_caps.auth_application_id;
        'Inband-Security-Id'  ->
            (Svc#diameter_service.capabilities)
                #diameter_caps.inband_security_id;
        'Acct-Application-Id' ->
            (Svc#diameter_service.capabilities)
                #diameter_caps.acct_application_id;
        'Vendor-Specific-Application-Id' ->
            (Svc#diameter_service.capabilities)
                #diameter_caps.vendor_specific_application_id;
        'Firmware-Revision' ->
            (Svc#diameter_service.capabilities)
                #diameter_caps.firmware_revision;
        capabilities -> service_info(?CAP_INFO, S);
        applications -> info_apps(S);
        transport    -> info_transport(S);
        options      -> info_options(S);
        pending      -> info_pending(S);
        keys         -> ?ALL_INFO ++ ?CAP_INFO ++ ?OTHER_INFO;
        all          -> service_info(?ALL_INFO, S);
        statistics   -> info_stats(S);
        connections  -> info_connections(S);
        peers        -> info_peers(S)
    end.

complete(I)
  when I == keys;
       I == all ->
    {value, I};
complete(Pre) ->
    P = atom_to_list(Pre),
    case [I || I <- ?ALL_INFO ++ ?CAP_INFO ++ ?OTHER_INFO,
               lists:prefix(P, atom_to_list(I))]
    of
        [I] -> {value, I};
        _   -> false
    end.

%% info_stats/1

info_stats(#state{watchdogT = WatchdogT}) ->
    MatchSpec = [{#watchdog{ref = '$1', peer = '$2', _ = '_'},
                  [{'is_pid', '$2'}],
                  [['$1', '$2']]}],
    try ets:select(WatchdogT, MatchSpec) of
        L ->
            diameter_stats:read(lists:append(L))
    catch
        error: badarg -> []  %% service  has gone down
    end.

%% info_transport/1
%%
%% One entry per configured transport. Statistics for each entry are
%% the accumulated values for the ref and associated watchdog/peer
%% pids.

info_transport(S) ->
    PeerD = peer_dict(S, config_dict(S)),
    RefsD = dict:map(fun(_, Ls) -> [P || L <- Ls, {peer, {P,_}} <- L] end,
                     PeerD),
    Refs = lists:append(dict:fold(fun(R, Ps, A) -> [[R|Ps] | A] end,
                                  [],
                                  RefsD)),
    Stats = diameter_stats:read(Refs),
    dict:fold(fun(R, Ls, A) ->
                      Ps = dict:fetch(R, RefsD),
                      [[{ref, R} | transport(Ls)] ++ [stats([R|Ps], Stats)]
                       | A]
              end,
              [],
              PeerD).

%% Only a config entry for a listening transport: use it.
transport([[{type, listen}, _] = L]) ->
    L ++ [{accept, []}];

%% Only one config or peer entry for a connecting transport: use it.
transport([[{type, connect} | _] = L]) ->
    L;

%% Peer entries: discard config. Note that the peer entries have
%% length at least 3.
transport([[_,_] | L]) ->
    transport(L);

%% Possibly many peer entries for a listening transport. Note that all
%% have the same options by construction, which is not terribly space
%% efficient.
transport([[{type, accept}, {options, Opts} | _] | _] = Ls) ->
    [{type, listen},
     {options, Opts},
     {accept, [lists:nthtail(2,L) || L <- Ls]}].

peer_dict(#state{watchdogT = WatchdogT, peerT = PeerT}, Dict0) ->
    try ets:tab2list(WatchdogT) of
        L ->
            lists:foldl(fun(T,A) -> peer_acc(PeerT, A, T) end, Dict0, L)
    catch
        error: badarg -> Dict0  %% service has gone down
    end.

peer_acc(PeerT, Acc, #watchdog{pid = Pid,
                               type = Type,
                               ref = Ref,
                               options = Opts,
                               state = WS,
                               started = At,
                               peer = TPid}) ->
    dict:append(Ref,
                [{type, Type},
                 {options, Opts},
                 {watchdog, {Pid, At, WS}}
                 | info_peer(PeerT, TPid, WS)],
                Acc).

info_peer(PeerT, TPid, WS)
  when is_pid(TPid), WS /= ?WD_DOWN ->
    try ets:lookup(PeerT, TPid) of
        T -> info_peer(T)
    catch
        error: badarg -> []  %% service has gone down
    end;
info_peer(_, _, _) ->
    [].

%% The point of extracting the config here is so that 'transport' info
%% has one entry for each transport ref, the peer table only
%% containing entries that have a living watchdog.

config_dict(#state{service_name = SvcName}) ->
    lists:foldl(fun config_acc/2,
                dict:new(),
                diameter_config:lookup(SvcName)).

config_acc({Ref, T, Opts}, Dict)
  when T == listen;
       T == connect ->
    dict:store(Ref, [[{type, T}, {options, Opts}]], Dict);
config_acc(_, Dict) ->
    Dict.

info_peer([#peer{pid = Pid, apps = SApps, caps = Caps, started = T}]) ->
    [{peer, {Pid, T}},
     {apps, SApps},
     {caps, info_caps(Caps)}
     | try [{port, info_port(Pid)}] catch _:_ -> [] end];
info_peer([] = No) ->
    No.

%% Extract information that the processes involved are expected to
%% "publish" in their process dictionaries. Simple but backhanded.
info_port(Pid) ->
    {_, PD} = process_info(Pid, dictionary),
    {_, T} = lists:keyfind({diameter_peer_fsm, start}, 1, PD),
    {TPid, {_Type, TMod, _Cfg}} = T,
    {_, TD} = process_info(TPid, dictionary),
    {_, Data} = lists:keyfind({TMod, info}, 1, TD),
    [{owner, TPid},
     {module, TMod}
     | try TMod:info(Data) catch _:_ -> [] end].

%% Use the fields names from diameter_caps instead of
%% diameter_base_CER to distinguish between the 2-tuple values
%% compared to the single capabilities values. Note also that the
%% returned list is tagged 'caps' rather than 'capabilities' to
%% emphasize the difference.
info_caps(#diameter_caps{} = C) ->
    lists:zip(record_info(fields, diameter_caps), tl(tuple_to_list(C))).

info_apps(#state{service = #diameter_service{applications = Apps}}) ->
    lists:map(fun mk_app/1, Apps).

mk_app(#diameter_app{} = A) ->
    lists:zip(record_info(fields, diameter_app), tl(tuple_to_list(A))).

%% info_pending/1
%%
%% One entry for each outgoing request whose answer is outstanding.

info_pending(#state{} = S) ->
    diameter_traffic:pending(transports(S)).

%% info_connections/1
%%
%% One entry per transport connection. Statistics for each entry are
%% for the peer pid only.

info_connections(S) ->
    ConnL = conn_list(S),
    Stats = diameter_stats:read([P || L <- ConnL, {peer, {P,_}} <- L]),
    [L ++ [stats([P], Stats)] || L <- ConnL, {peer, {P,_}} <- L].

conn_list(S) ->
    lists:append(dict:fold(fun conn_acc/3, [], peer_dict(S, dict:new()))).

conn_acc(Ref, Peers, Acc) ->
    [[[{ref, Ref} | L] || L <- Peers, lists:keymember(peer, 1, L)]
     | Acc].

stats(Refs, Stats) ->
    {statistics, dict:to_list(lists:foldl(fun(R,D) ->
                                                  stats_acc(R, D, Stats)
                                          end,
                                          dict:new(),
                                          Refs))}.

stats_acc(Ref, Dict, Stats) ->
    lists:foldl(fun({C,N}, D) -> dict:update_counter(C, N, D) end,
                Dict,
                proplists:get_value(Ref, Stats, [])).

%% info_peers/1
%%
%% One entry per peer Origin-Host. Statistics for each entry are
%% accumulated values for all peer pids.

info_peers(S) ->
    {PeerD, RefD} = lists:foldl(fun peer_acc/2,
                                {dict:new(), dict:new()},
                                conn_list(S)),
    Refs = lists:append(dict:fold(fun(_, Rs, A) -> [Rs|A] end,
                                  [],
                                  RefD)),
    Stats = diameter_stats:read(Refs),
    dict:fold(fun(OH, Cs, A) ->
                      Rs = dict:fetch(OH, RefD),
                      [{OH, [{connections, Cs}, stats(Rs, Stats)]} | A]
              end,
              [],
              PeerD).

peer_acc(Peer, {PeerD, RefD}) ->
    [{TPid, _}, [{origin_host, {_, OH}} | _]]
        = [proplists:get_value(K, Peer) || K <- [peer, caps]],
    {dict:append(OH, Peer, PeerD), dict:append(OH, TPid, RefD)}.

%% info_options/1

info_options(S) ->
    S#state.options.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2182 bytes
Desc: Kryptograficzna sygnatura S/MIME
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130520/6819412f/attachment.bin>

From bryan@REDACTED  Tue May 21 19:59:56 2013
From: bryan@REDACTED (Bryan Fink)
Date: Tue, 21 May 2013 13:59:56 -0400
Subject: [erlang-bugs] Supervisor terminate_child race
In-Reply-To: <2302FD2F-B3F4-4514-88B0-17082D781D1A@gmail.com>
References: <816241784.113785065.1368613048244.JavaMail.root@erlang-solutions.com>
 <B312FED6-749E-477C-8B6A-A219D27D398C@gmail.com>
 <CAGqERUEGJyobVeuDHRnCGRxTPFP6qKqCChT+VT4vnauy1ySTew@mail.gmail.com>
 <2302FD2F-B3F4-4514-88B0-17082D781D1A@gmail.com>
Message-ID: <CAOLR_oaha+Cg+V+e0P41VApaSTSS6z0YQcwgvh5MLsap2rmArA@mail.gmail.com>

On Wed, May 15, 2013 at 11:11 AM, Tim Watson <watson.timothy@REDACTED> wrote:
> On 15 May 2013, at 14:54, Siri Hansen wrote:
>
> Then again... it is up to the child's start function to create the link, and
> from the supervisor's point of view, the only place to add the monitor would
> be when the start function returns - which would be just another place to
> get a race :(
>
>
> Well quite. *sigh*

My apologies for dropping out of this conversation. I've been on vacation.

Before vacation, monitoring from the spawn was the best solution I had
come up with as well. But, as has already been pointed out, if it's
not done atomically (which I think can be done with a flag in
spawn_opt, no?), it's just another place for a race. It has also
already been pointed out that changing the supervisor-child contract
for startup isn't really an option anyway.

The only other possibility I see is to guarantee that if an EXIT
message will be delivered, that it is always delivered before any DOWN
message. If this were the case, all receive expressions could have
clauses for both EXIT and DOWN, and simply use whichever arrived
first. Tim's method of checking for EXIT after receiving DOWN would
also work in this case. I assume the problem with this guarantee is
that these messages are generated by different processes, so typical
mailbox ordering rules apply.

Fortunately for my use case, I think that simply linking my children
to their creating process instead of a supervisor may be a viable
option. All of these children are dynamic under a simple_one_for_one
supervisor, and I don't care about restart policies.

-Bryan


From mjtruog@REDACTED  Tue May 21 21:45:23 2013
From: mjtruog@REDACTED (Michael Truog)
Date: Tue, 21 May 2013 12:45:23 -0700
Subject: [erlang-bugs] escript file operations fail on halt
In-Reply-To: <51980DD2.7060501@gmail.com>
References: <51980DD2.7060501@gmail.com>
Message-ID: <519BCED3.3050508@gmail.com>

Just as an update: I found that my issue with file operations was not related to the file port driver not completing async thread jobs, which made more logical sense.  The fact remains that two things would have helped make the situation clearer:
1) clearly document the default flush operation for the erlang:halt/1 function
2) add an escript:exit/1 function which just calls erlang:halt/2 with flush == true as a convenience function (so that people are able to have simpler source code and not care about the halt/flush details).

Thanks,
Michael

On 05/18/2013 04:25 PM, Michael Truog wrote:
> Hi,
>
> There is an odd type of failure when:
> 1) async threads are enabled by default for the Erlang VM
> 2) an escript is used to spawn the Erlang VM
> 3) erlang:halt/1 is used to terminate the escript with a known error code
>
> The erlang:halt/1 and erlang:halt/2 code here:
> https://github.com/erlang/otp/blob/maint/erts/emulator/beam/bif.c#L3937
> Makes the default flush parameter false!  The default flush parameter is currently undocumented.  So, when an escript performs a file operation that depends on the async thread pool (based on the internal Erlang code and configuration) and then attempts to do erlang:halt(integer()), the file operations may not complete or perhaps only partially complete.  In my particular use case, I can observe a rename file operation getting stuck inbetween the actual completion of the rename (and I am not using anything but a normal/default Linux filesystem, not NFS).
>
> It seems important to change the default erlang:halt/1 behaviour for escript usage so that flush is true (I understand fail-fast probably means normal Erlang VM usage shouldn't have flush default to true).  An alternative is a new escript function that sets the flush option for the user (which is probably an easier solution to agree on) (e.g., escript:exit/1).
>
> Thanks,
> Michael


From watson.timothy@REDACTED  Fri May 24 16:28:42 2013
From: watson.timothy@REDACTED (Tim Watson)
Date: Fri, 24 May 2013 15:28:42 +0100
Subject: [erlang-bugs] Strange application shutdown deadlock
Message-ID: <E74136CA-C884-4440-AE73-3D9A3C64DE72@gmail.com>

We came across this at a customer's site, where one of the nodes was apparently in the process of stopping and had been in that state for at least 24 hours. The short version is that an application_master appears to be stuck waiting for a child pid (is that the X process, or the root supervisor?) which is *not* linked to it... 

The application controller is in the process of stopping an application, during which process a `get_child' message appears to have come in to that application's application_master from somewhere - we are *not* running appmon, so I'm really confused how this can happen, as the only other place where I see (indirect) calls are via the sasl release_handler!? At the bottom of this email is a dump for the application_controller and the application_master for the app it is trying to shut down. I can verify that the pid which the application_master is waiting on is definitely not linked to it - i.e., process_info(links, AppMasterPid) doesn't contain the process <0.256.0> that the master appears to be waiting on.

My reading of the code is that the application_master cannot end up in get_child_i unless a get_child request was made which arrives whilst it is in its terminate loop. As I said, we're not using appmon, therefore I assume this originated in the sasl application's release_handler_1, though I'm not sure quite which route would take us there. The relevant bit of code in application_master appears to be:

get_child_i(Child) ->
    Child ! {self(), get_child},
    receive
	{Child, GrandChild, Mod} -> {GrandChild, Mod}
    end.

This in turn originates, I'd guess, in the third receive clause of terminate_loop/2. Anyway, should that code not be dealing with a potentially dead pid for Child, either by handling links effectively - perhaps there is an EXIT signal in the mailbox already which is being ignored here in get_child_i/1 - or by some other means?

What follows below is the trace/dump output. Feel free to poke me for more info as needed.

Cheers,
Tim

[TRACE/DUMP]

pid: <6676.7.0>
registered name: application_controller
stacktrace: [{application_master,call,2,
                                 [{file,"application_master.erl"},{line,75}]},
             {application_controller,stop_appl,3,
                                     [{file,"application_controller.erl"},
                                      {line,1393}]},
             {application_controller,handle_call,3,
                                     [{file,"application_controller.erl"},
                                      {line,810}]},
             {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]}]
-------------------------
Program counter: 0x00007f9bf9a53720 (application_master:call/2 + 288)
CP: 0x0000000000000000 (invalid)
arity = 0

0x00007f9bd7948360 Return addr 0x00007f9bfb97de40 (application_controller:stop_appl/3 + 176)
y(0)     #Ref<0.0.20562.258360>
y(1)     #Ref<0.0.20562.258361>
y(2)     []

0x00007f9bd7948380 Return addr 0x00007f9bfb973c68 (application_controller:handle_call/3 + 1392)
y(0)     temporary
y(1)     rabbitmq_web_dispatch

0x00007f9bd7948398 Return addr 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272)
y(0)     {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]}
y(1)     rabbitmq_web_dispatch
y(2)     [{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}]
y(3)     [{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}]

0x00007f9bd79483c0 Return addr 0x00000000008827d8 (<terminate process normally>)
y(0)     application_controller
y(1)     {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]}
y(2)     application_controller
y(3)     <0.2.0>
y(4)     {stop_application,rabbitmq_web_dispatch}
y(5)     {<0.5864.275>,#Ref<0.0.20562.258345>}
y(6)     Catch 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272)
-------------------------

pid: <6676.255.0>
registered name: none
stacktrace: [{application_master,get_child_i,1,
                                 [{file,"application_master.erl"},{line,392}]},
             {application_master,handle_msg,2,
                                 [{file,"application_master.erl"},{line,216}]},
             {application_master,terminate_loop,2,
                                 [{file,"application_master.erl"},{line,206}]},
             {application_master,terminate,2,
                                 [{file,"application_master.erl"},{line,227}]},
             {application_master,handle_msg,2,
                                 [{file,"application_master.erl"},{line,219}]},
             {application_master,main_loop,2,
                                 [{file,"application_master.erl"},{line,194}]},
             {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]
-------------------------
Program counter: 0x00007f9bf9a570e0 (application_master:get_child_i/1 + 120)
CP: 0x0000000000000000 (invalid)
arity = 0

0x00007f9c1adc3dc8 Return addr 0x00007f9bf9a54eb0 (application_master:handle_msg/2 + 280)
y(0)     <0.256.0>

0x00007f9c1adc3dd8 Return addr 0x00007f9bf9a54d20 (application_master:terminate_loop/2 + 520)
y(0)     #Ref<0.0.20562.258362>
y(1)     <0.9596.275>
y(2)     {state,<0.256.0>,{appl_data,rabbitmq_web_dispatch,[],undefined,{rabbit_web_dispatch_app,[]},[rabbit_web_dispatch,rabbit_web_dispatch_app,rabbit_web_dispatch_registry,rabbit_web_dispatch_sup,rabbit_web_dispatch_util,rabbit_webmachine],[],infinity,infinity},[],0,<0.29.0>}

0x00007f9c1adc3df8 Return addr 0x00007f9bf9a55108 (application_master:terminate/2 + 192)
y(0)     <0.256.0>

0x00007f9c1adc3e08 Return addr 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472)
y(0)     []
y(1)     normal

0x00007f9c1adc3e20 Return addr 0x00007f9bf9a54a60 (application_master:main_loop/2 + 1600)
y(0)     <0.7.0>
y(1)     #Ref<0.0.20562.258360>
y(2)     Catch 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472)

0x00007f9c1adc3e40 Return addr 0x00007f9bfb969420 (proc_lib:init_p_do_apply/3 + 56)
y(0)     <0.7.0>

0x00007f9c1adc3e50 Return addr 0x00000000008827d8 (<terminate process normally>)
y(0)     Catch 0x00007f9bfb969440 (proc_lib:init_p_do_apply/3 + 88)
-------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 235 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130524/64bd7a3a/attachment.bin>

From mononcqc@REDACTED  Fri May 24 16:45:47 2013
From: mononcqc@REDACTED (Fred Hebert)
Date: Fri, 24 May 2013 10:45:47 -0400
Subject: [erlang-bugs] Strange application shutdown deadlock
In-Reply-To: <E74136CA-C884-4440-AE73-3D9A3C64DE72@gmail.com>
References: <E74136CA-C884-4440-AE73-3D9A3C64DE72@gmail.com>
Message-ID: <20130524144546.GB14817@ferdair.local>

Quick question: are you running a release?

If so, last time I've seen deadlocks like that was solved by making sure
*all* my applications did depend on stdlib and kernel in their app file.
When I skipped them, sometimes I'd find that things would lock up.

My guess was that dependencies from stdlib or kernel got unloaded before
my app and broke something, but I'm not sure -- In my case, I wasn't
able to inspect the node as it appeared to be 100% blocked.

Adding the apps ended up fixing the problem on the next shutdown. I'm
not sure if it might be a good fix for you, but it's a stab in the dark,

Regards,
Fred.

On 05/24, Tim Watson wrote:
> We came across this at a customer's site, where one of the nodes was apparently in the process of stopping and had been in that state for at least 24 hours. The short version is that an application_master appears to be stuck waiting for a child pid (is that the X process, or the root supervisor?) which is *not* linked to it... 
> 
> The application controller is in the process of stopping an application, during which process a `get_child' message appears to have come in to that application's application_master from somewhere - we are *not* running appmon, so I'm really confused how this can happen, as the only other place where I see (indirect) calls are via the sasl release_handler!? At the bottom of this email is a dump for the application_controller and the application_master for the app it is trying to shut down. I can verify that the pid which the application_master is waiting on is definitely not linked to it - i.e., process_info(links, AppMasterPid) doesn't contain the process <0.256.0> that the master appears to be waiting on.
> 
> My reading of the code is that the application_master cannot end up in get_child_i unless a get_child request was made which arrives whilst it is in its terminate loop. As I said, we're not using appmon, therefore I assume this originated in the sasl application's release_handler_1, though I'm not sure quite which route would take us there. The relevant bit of code in application_master appears to be:
> 
> get_child_i(Child) ->
>     Child ! {self(), get_child},
>     receive
> 	{Child, GrandChild, Mod} -> {GrandChild, Mod}
>     end.
> 
> This in turn originates, I'd guess, in the third receive clause of terminate_loop/2. Anyway, should that code not be dealing with a potentially dead pid for Child, either by handling links effectively - perhaps there is an EXIT signal in the mailbox already which is being ignored here in get_child_i/1 - or by some other means?
> 
> What follows below is the trace/dump output. Feel free to poke me for more info as needed.
> 
> Cheers,
> Tim
> 
> [TRACE/DUMP]
> 
> pid: <6676.7.0>
> registered name: application_controller
> stacktrace: [{application_master,call,2,
>                                  [{file,"application_master.erl"},{line,75}]},
>              {application_controller,stop_appl,3,
>                                      [{file,"application_controller.erl"},
>                                       {line,1393}]},
>              {application_controller,handle_call,3,
>                                      [{file,"application_controller.erl"},
>                                       {line,810}]},
>              {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]}]
> -------------------------
> Program counter: 0x00007f9bf9a53720 (application_master:call/2 + 288)
> CP: 0x0000000000000000 (invalid)
> arity = 0
> 
> 0x00007f9bd7948360 Return addr 0x00007f9bfb97de40 (application_controller:stop_appl/3 + 176)
> y(0)     #Ref<0.0.20562.258360>
> y(1)     #Ref<0.0.20562.258361>
> y(2)     []
> 
> 0x00007f9bd7948380 Return addr 0x00007f9bfb973c68 (application_controller:handle_call/3 + 1392)
> y(0)     temporary
> y(1)     rabbitmq_web_dispatch
> 
> 0x00007f9bd7948398 Return addr 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272)
> y(0)     {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]}
> y(1)     rabbitmq_web_dispatch
> y(2)     [{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}]
> y(3)     [{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}]
> 
> 0x00007f9bd79483c0 Return addr 0x00000000008827d8 (<terminate process normally>)
> y(0)     application_controller
> y(1)     {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]}
> y(2)     application_controller
> y(3)     <0.2.0>
> y(4)     {stop_application,rabbitmq_web_dispatch}
> y(5)     {<0.5864.275>,#Ref<0.0.20562.258345>}
> y(6)     Catch 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272)
> -------------------------
> 
> pid: <6676.255.0>
> registered name: none
> stacktrace: [{application_master,get_child_i,1,
>                                  [{file,"application_master.erl"},{line,392}]},
>              {application_master,handle_msg,2,
>                                  [{file,"application_master.erl"},{line,216}]},
>              {application_master,terminate_loop,2,
>                                  [{file,"application_master.erl"},{line,206}]},
>              {application_master,terminate,2,
>                                  [{file,"application_master.erl"},{line,227}]},
>              {application_master,handle_msg,2,
>                                  [{file,"application_master.erl"},{line,219}]},
>              {application_master,main_loop,2,
>                                  [{file,"application_master.erl"},{line,194}]},
>              {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]
> -------------------------
> Program counter: 0x00007f9bf9a570e0 (application_master:get_child_i/1 + 120)
> CP: 0x0000000000000000 (invalid)
> arity = 0
> 
> 0x00007f9c1adc3dc8 Return addr 0x00007f9bf9a54eb0 (application_master:handle_msg/2 + 280)
> y(0)     <0.256.0>
> 
> 0x00007f9c1adc3dd8 Return addr 0x00007f9bf9a54d20 (application_master:terminate_loop/2 + 520)
> y(0)     #Ref<0.0.20562.258362>
> y(1)     <0.9596.275>
> y(2)     {state,<0.256.0>,{appl_data,rabbitmq_web_dispatch,[],undefined,{rabbit_web_dispatch_app,[]},[rabbit_web_dispatch,rabbit_web_dispatch_app,rabbit_web_dispatch_registry,rabbit_web_dispatch_sup,rabbit_web_dispatch_util,rabbit_webmachine],[],infinity,infinity},[],0,<0.29.0>}
> 
> 0x00007f9c1adc3df8 Return addr 0x00007f9bf9a55108 (application_master:terminate/2 + 192)
> y(0)     <0.256.0>
> 
> 0x00007f9c1adc3e08 Return addr 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472)
> y(0)     []
> y(1)     normal
> 
> 0x00007f9c1adc3e20 Return addr 0x00007f9bf9a54a60 (application_master:main_loop/2 + 1600)
> y(0)     <0.7.0>
> y(1)     #Ref<0.0.20562.258360>
> y(2)     Catch 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472)
> 
> 0x00007f9c1adc3e40 Return addr 0x00007f9bfb969420 (proc_lib:init_p_do_apply/3 + 56)
> y(0)     <0.7.0>
> 
> 0x00007f9c1adc3e50 Return addr 0x00000000008827d8 (<terminate process normally>)
> y(0)     Catch 0x00007f9bfb969440 (proc_lib:init_p_do_apply/3 + 88)
> -------------------------
> 


> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs


From watson.timothy@REDACTED  Fri May 24 16:55:43 2013
From: watson.timothy@REDACTED (Tim Watson)
Date: Fri, 24 May 2013 15:55:43 +0100
Subject: [erlang-bugs] Strange application shutdown deadlock
In-Reply-To: <20130524144546.GB14817@ferdair.local>
References: <E74136CA-C884-4440-AE73-3D9A3C64DE72@gmail.com>
 <20130524144546.GB14817@ferdair.local>
Message-ID: <463AD101-7287-486D-926C-4108C35483B0@gmail.com>

Hi Fred,

On 24 May 2013, at 15:45, Fred Hebert wrote:

> Quick question: are you running a release?
> 
> If so, last time I've seen deadlocks like that was solved by making sure
> *all* my applications did depend on stdlib and kernel in their app file.
> When I skipped them, sometimes I'd find that things would lock up.
> 

No, unfortunately RabbitMQ doesn't run as part of a release.

> My guess was that dependencies from stdlib or kernel got unloaded before
> my app and broke something, but I'm not sure -- In my case, I wasn't
> able to inspect the node as it appeared to be 100% blocked.
> 

I suppose it's possible that that could happen to us, for a different set of apps. I can't see how the release handler would be involved though, since we start our nodes with start_sasl and launch applications by hand...

The code we use to shut applications down explicitly calculates the dependency order itself, so perhaps there's something wrong in there. What we do is essentially this:

stop() ->
    case whereis(rabbit_boot) of
        undefined -> ok;
        _         -> await_startup()
    end,
    rabbit_log:info("Stopping RabbitMQ~n"),
    ok = app_utils:stop_applications(app_shutdown_order()).

stop_and_halt() ->
    try
        stop()
    after
        rabbit_misc:local_info_msg("Halting Erlang VM~n", []),
        init:stop()
    end,
    ok.

app_shutdown_order() ->
    Apps = ?APPS ++ rabbit_plugins:active(),
    app_utils:app_dependency_order(Apps, true).

And that app_utils shutdown order is calculated thus:

app_dependency_order(RootApps, StripUnreachable) ->
    {ok, G} = rabbit_misc:build_acyclic_graph(
                fun (App, _Deps) -> [{App, App}] end,
                fun (App,  Deps) -> [{Dep, App} || Dep <- Deps] end,
                [{App, app_dependencies(App)} ||
                    {App, _Desc, _Vsn} <- application:loaded_applications()]),
    try
        case StripUnreachable of
            true -> digraph:del_vertices(G, digraph:vertices(G) --
                     digraph_utils:reachable(RootApps, G));
            false -> ok
        end,
        digraph_utils:topsort(G)
    after
        true = digraph:delete(G)
    end.

So even if we've shut things down in the wrong order - which I don't think we have - I still don't see where the `get_child' request comes from if the release_handler isn't involved...

Cheers,
Tim


From daniel.goertzen@REDACTED  Fri May 24 17:33:05 2013
From: daniel.goertzen@REDACTED (Daniel Goertzen)
Date: Fri, 24 May 2013 10:33:05 -0500
Subject: [erlang-bugs] printing NaN causes exception
Message-ID: <CAJCf5RyZ21zjK-d4Hf4HwvOCg-t1_1-micjsV70_6a1BKknDYQ@mail.gmail.com>

I am working with a c++ generated floating point data stream that encodes
certain events as NaN (not a number).  When I try to print out this number,
io_lib_format crashes.


Here is a [c++11] nif that creates a NaN:

static ERL_NIF_TERM quiet_nan(ErlNifEnv* env, int, const ERL_NIF_TERM
argv[])
{
    double num = std::numeric_limits<double>::quiet_NaN();
    cerr << "quiet_nan: iostream prints num as " << num << endl;
    return enif_make_double(env, num);
}


... and then when I run this interactively I get:

Erlang R15B01 (erts-5.9.1) [source] [smp:3:3] [async-threads:0]

Eshell V5.9.1  (abort with ^G)
1> channel_nif:quiet_nan().
quiet_nan: iostream prints num as nan
                                     ** exception error: no case clause
matching <<127,248,0,0,0,0,0,0>>
     in function  io_lib_format:mantissa_exponent/1 (io_lib_format.erl,
line 374)
     in call from io_lib_format:fwrite_g/1 (io_lib_format.erl, line 365)
2>


Expected behaviour is to just print "NaN" or something similar.  For my use
case I can work around this problem by just using binary representations.

For reference, there's a thread about inf and nan about 28 Feb 2012 that
focuses on doing match with these numbers. (I do not want to do math, just
move them around and print them without crashing)

Cheers,
Dan.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130524/7e4c2f60/attachment.htm>

From bob@REDACTED  Fri May 24 18:33:02 2013
From: bob@REDACTED (Bob Ippolito)
Date: Fri, 24 May 2013 09:33:02 -0700
Subject: [erlang-bugs] printing NaN causes exception
In-Reply-To: <CAJCf5RyZ21zjK-d4Hf4HwvOCg-t1_1-micjsV70_6a1BKknDYQ@mail.gmail.com>
References: <CAJCf5RyZ21zjK-d4Hf4HwvOCg-t1_1-micjsV70_6a1BKknDYQ@mail.gmail.com>
Message-ID: <CACwMPm_THdi2Kj2_sy+vr9uevW3UD5DjQhWEbMEc5o09w8KAEg@mail.gmail.com>

Erlang's floating point type doesn't allow for NaN or Inf. The only way to
safely get float specials in and out is to leave them as binary, or match
the bit patterns for them and convert to special atoms. It's a pretty sad
state.

On Friday, May 24, 2013, Daniel Goertzen wrote:

> I am working with a c++ generated floating point data stream that encodes
> certain events as NaN (not a number).  When I try to print out this number,
> io_lib_format crashes.
>
>
> Here is a [c++11] nif that creates a NaN:
>
> static ERL_NIF_TERM quiet_nan(ErlNifEnv* env, int, const ERL_NIF_TERM
> argv[])
> {
>     double num = std::numeric_limits<double>::quiet_NaN();
>     cerr << "quiet_nan: iostream prints num as " << num << endl;
>     return enif_make_double(env, num);
> }
>
>
> ... and then when I run this interactively I get:
>
> Erlang R15B01 (erts-5.9.1) [source] [smp:3:3] [async-threads:0]
>
> Eshell V5.9.1  (abort with ^G)
> 1> channel_nif:quiet_nan().
> quiet_nan: iostream prints num as nan
>                                      ** exception error: no case clause
> matching <<127,248,0,0,0,0,0,0>>
>      in function  io_lib_format:mantissa_exponent/1 (io_lib_format.erl,
> line 374)
>      in call from io_lib_format:fwrite_g/1 (io_lib_format.erl, line 365)
> 2>
>
>
>
> Expected behaviour is to just print "NaN" or something similar.  For my
> use case I can work around this problem by just using binary
> representations.
>
> For reference, there's a thread about inf and nan about 28 Feb 2012 that
> focuses on doing match with these numbers. (I do not want to do math, just
> move them around and print them without crashing)
>
> Cheers,
> Dan.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130524/845c24d1/attachment.htm>

From watson.timothy@REDACTED  Fri May 24 19:41:34 2013
From: watson.timothy@REDACTED (Tim Watson)
Date: Fri, 24 May 2013 18:41:34 +0100
Subject: [erlang-bugs] Strange application shutdown deadlock
In-Reply-To: <20130524144546.GB14817@ferdair.local>
References: <E74136CA-C884-4440-AE73-3D9A3C64DE72@gmail.com>
 <20130524144546.GB14817@ferdair.local>
Message-ID: <F20A9D6A-C0AB-49F1-9115-336EBCD7729E@gmail.com>

Gah, sorry folks - this has nothing to do with release handling, that was a red herring. Someone just pointed out that the call to get_child originates in a status check in our code.

This still looks like a bug to me though, since if you're going to handle "other" messages in terminate_loop you ought to ensure they can't deadlock the vm's shutdown sequence.

Cheers,
Tim

On 24 May 2013, at 15:45, Fred Hebert <mononcqc@REDACTED> wrote:

> Quick question: are you running a release?
> 
> If so, last time I've seen deadlocks like that was solved by making sure
> *all* my applications did depend on stdlib and kernel in their app file.
> When I skipped them, sometimes I'd find that things would lock up.
> 
> My guess was that dependencies from stdlib or kernel got unloaded before
> my app and broke something, but I'm not sure -- In my case, I wasn't
> able to inspect the node as it appeared to be 100% blocked.
> 
> Adding the apps ended up fixing the problem on the next shutdown. I'm
> not sure if it might be a good fix for you, but it's a stab in the dark,
> 
> Regards,
> Fred.
> 
> On 05/24, Tim Watson wrote:
>> We came across this at a customer's site, where one of the nodes was apparently in the process of stopping and had been in that state for at least 24 hours. The short version is that an application_master appears to be stuck waiting for a child pid (is that the X process, or the root supervisor?) which is *not* linked to it... 
>> 
>> The application controller is in the process of stopping an application, during which process a `get_child' message appears to have come in to that application's application_master from somewhere - we are *not* running appmon, so I'm really confused how this can happen, as the only other place where I see (indirect) calls are via the sasl release_handler!? At the bottom of this email is a dump for the application_controller and the application_master for the app it is trying to shut down. I can verify that the pid which the application_master is waiting on is definitely not linked to it - i.e., process_info(links, AppMasterPid) doesn't contain the process <0.256.0> that the master appears to be waiting on.
>> 
>> My reading of the code is that the application_master cannot end up in get_child_i unless a get_child request was made which arrives whilst it is in its terminate loop. As I said, we're not using appmon, therefore I assume this originated in the sasl application's release_handler_1, though I'm not sure quite which route would take us there. The relevant bit of code in application_master appears to be:
>> 
>> get_child_i(Child) ->
>>    Child ! {self(), get_child},
>>    receive
>>    {Child, GrandChild, Mod} -> {GrandChild, Mod}
>>    end.
>> 
>> This in turn originates, I'd guess, in the third receive clause of terminate_loop/2. Anyway, should that code not be dealing with a potentially dead pid for Child, either by handling links effectively - perhaps there is an EXIT signal in the mailbox already which is being ignored here in get_child_i/1 - or by some other means?
>> 
>> What follows below is the trace/dump output. Feel free to poke me for more info as needed.
>> 
>> Cheers,
>> Tim
>> 
>> [TRACE/DUMP]
>> 
>> pid: <6676.7.0>
>> registered name: application_controller
>> stacktrace: [{application_master,call,2,
>>                                 [{file,"application_master.erl"},{line,75}]},
>>             {application_controller,stop_appl,3,
>>                                     [{file,"application_controller.erl"},
>>                                      {line,1393}]},
>>             {application_controller,handle_call,3,
>>                                     [{file,"application_controller.erl"},
>>                                      {line,810}]},
>>             {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]}]
>> -------------------------
>> Program counter: 0x00007f9bf9a53720 (application_master:call/2 + 288)
>> CP: 0x0000000000000000 (invalid)
>> arity = 0
>> 
>> 0x00007f9bd7948360 Return addr 0x00007f9bfb97de40 (application_controller:stop_appl/3 + 176)
>> y(0)     #Ref<0.0.20562.258360>
>> y(1)     #Ref<0.0.20562.258361>
>> y(2)     []
>> 
>> 0x00007f9bd7948380 Return addr 0x00007f9bfb973c68 (application_controller:handle_call/3 + 1392)
>> y(0)     temporary
>> y(1)     rabbitmq_web_dispatch
>> 
>> 0x00007f9bd7948398 Return addr 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272)
>> y(0)     {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]}
>> y(1)     rabbitmq_web_dispatch
>> y(2)     [{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}]
>> y(3)     [{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}]
>> 
>> 0x00007f9bd79483c0 Return addr 0x00000000008827d8 (<terminate process normally>)
>> y(0)     application_controller
>> y(1)     {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]}
>> y(2)     application_controller
>> y(3)     <0.2.0>
>> y(4)     {stop_application,rabbitmq_web_dispatch}
>> y(5)     {<0.5864.275>,#Ref<0.0.20562.258345>}
>> y(6)     Catch 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272)
>> -------------------------
>> 
>> pid: <6676.255.0>
>> registered name: none
>> stacktrace: [{application_master,get_child_i,1,
>>                                 [{file,"application_master.erl"},{line,392}]},
>>             {application_master,handle_msg,2,
>>                                 [{file,"application_master.erl"},{line,216}]},
>>             {application_master,terminate_loop,2,
>>                                 [{file,"application_master.erl"},{line,206}]},
>>             {application_master,terminate,2,
>>                                 [{file,"application_master.erl"},{line,227}]},
>>             {application_master,handle_msg,2,
>>                                 [{file,"application_master.erl"},{line,219}]},
>>             {application_master,main_loop,2,
>>                                 [{file,"application_master.erl"},{line,194}]},
>>             {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]
>> -------------------------
>> Program counter: 0x00007f9bf9a570e0 (application_master:get_child_i/1 + 120)
>> CP: 0x0000000000000000 (invalid)
>> arity = 0
>> 
>> 0x00007f9c1adc3dc8 Return addr 0x00007f9bf9a54eb0 (application_master:handle_msg/2 + 280)
>> y(0)     <0.256.0>
>> 
>> 0x00007f9c1adc3dd8 Return addr 0x00007f9bf9a54d20 (application_master:terminate_loop/2 + 520)
>> y(0)     #Ref<0.0.20562.258362>
>> y(1)     <0.9596.275>
>> y(2)     {state,<0.256.0>,{appl_data,rabbitmq_web_dispatch,[],undefined,{rabbit_web_dispatch_app,[]},[rabbit_web_dispatch,rabbit_web_dispatch_app,rabbit_web_dispatch_registry,rabbit_web_dispatch_sup,rabbit_web_dispatch_util,rabbit_webmachine],[],infinity,infinity},[],0,<0.29.0>}
>> 
>> 0x00007f9c1adc3df8 Return addr 0x00007f9bf9a55108 (application_master:terminate/2 + 192)
>> y(0)     <0.256.0>
>> 
>> 0x00007f9c1adc3e08 Return addr 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472)
>> y(0)     []
>> y(1)     normal
>> 
>> 0x00007f9c1adc3e20 Return addr 0x00007f9bf9a54a60 (application_master:main_loop/2 + 1600)
>> y(0)     <0.7.0>
>> y(1)     #Ref<0.0.20562.258360>
>> y(2)     Catch 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472)
>> 
>> 0x00007f9c1adc3e40 Return addr 0x00007f9bfb969420 (proc_lib:init_p_do_apply/3 + 56)
>> y(0)     <0.7.0>
>> 
>> 0x00007f9c1adc3e50 Return addr 0x00000000008827d8 (<terminate process normally>)
>> y(0)     Catch 0x00007f9bfb969440 (proc_lib:init_p_do_apply/3 + 88)
>> -------------------------
>> 
> 
> 
> 
>> _______________________________________________
>> erlang-bugs mailing list
>> erlang-bugs@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-bugs
> 


From n.oxyde@REDACTED  Fri May 24 21:10:42 2013
From: n.oxyde@REDACTED (Anthony Ramine)
Date: Fri, 24 May 2013 21:10:42 +0200
Subject: [erlang-bugs] Fix renaming of bs_put_string instructions
Message-ID: <C2EBA366-4707-4FB4-976F-272EE7541025@gmail.com>

Hello,

If an Erlang module is compiled to BEAM assembly and the result contains a bs_put_string instruction, the output can't be compiled to binary anymore and the compiler crashes with the following error:

$ erlc prs.S
Function: compress/1
prs.S:none: internal error in beam_block;
crash reason: {{case_clause,
                   {'EXIT',
                       {function_clause,
                           [{beam_utils,live_opt,
                                [[{bs_put_string,1,{string,[0]}},
                                  {bs_init,
                                      {f,0},
                                      {bs_append,0,8,{field_flags,[]}},
                                      0,
                                      [{integer,8},{x,0}],
                                      {x,1}},
                                  {label,2}],
                                 2,
                                 {1,{1,1,nil,nil}},
                                 [{block,
                                      [{'%live',2},
                                       {set,[{x,0}],[{x,1}],move},
                                       {'%live',1}]},
                                  return]],
                                [{file,"beam_utils.erl"},{line,639}]},
                            {beam_utils,live_opt,1,
                                [{file,"beam_utils.erl"},{line,205}]},
                            {beam_block,function,2,
                                [{file,"beam_block.erl"},{line,38}]},
                            {lists,mapfoldl,3,
                                [{file,"lists.erl"},{line,1329}]},
                            {beam_block,module,2,
                                [{file,"beam_block.erl"},{line,29}]},
                            {compile,'-select_passes/2-anonymous-2-',2,
                                [{file,"compile.erl"},{line,476}]},
                            {compile,'-internal_comp/4-anonymous-1-',2,
                                [{file,"compile.erl"},{line,276}]},
                            {compile,fold_comp,3,
                                [{file,"compile.erl"},{line,294}]}]}}},
               [{compile,'-select_passes/2-anonymous-2-',2,
                    [{file,"compile.erl"},{line,476}]},
                {compile,'-internal_comp/4-anonymous-1-',2,
                    [{file,"compile.erl"},{line,276}]},
                {compile,fold_comp,3,[{file,"compile.erl"},{line,294}]},
                {compile,internal_comp,4,[{file,"compile.erl"},{line,278}]},
                {compile,'-do_compile/2-anonymous-0-',2,
                    [{file,"compile.erl"},{line,152}]}]}


The clause was probably commented-out because at this point in the code, no bs_put_string instruction has been generated yet when compiling from Erlang.

This bug was reported by Lo?c Hoguin.

	git fetch https://github.com/nox/otp.git fix-bs_put_string-renaming

	https://github.com/nox/otp/compare/erlang:maint...fix-bs_put_string-renaming
	https://github.com/nox/otp/compare/erlang:maint...fix-bs_put_string-renaming.patch

Regards,

-- 
Anthony Ramine


From bgustavsson@REDACTED  Mon May 27 09:47:28 2013
From: bgustavsson@REDACTED (=?UTF-8?Q?Bj=C3=B6rn_Gustavsson?=)
Date: Mon, 27 May 2013 09:47:28 +0200
Subject: [erlang-bugs] Fix renaming of bs_put_string instructions
In-Reply-To: <C2EBA366-4707-4FB4-976F-272EE7541025@gmail.com>
References: <C2EBA366-4707-4FB4-976F-272EE7541025@gmail.com>
Message-ID: <CA+yh78SfkWhYn1oeo-2RF-7jCBnJCYa6FdhTtwhdVciK-t7HUg@mail.gmail.com>

Looks good. Will be graduated after few days of
testing in our daily build.s


On Fri, May 24, 2013 at 9:10 PM, Anthony Ramine <n.oxyde@REDACTED> wrote:

> Hello,
>
> If an Erlang module is compiled to BEAM assembly and the result contains a
> bs_put_string instruction, the output can't be compiled to binary anymore
> and the compiler crashes with the following error:
>
> $ erlc prs.S
> Function: compress/1
> prs.S:none: internal error in beam_block;
> crash reason: {{case_clause,
>                    {'EXIT',
>                        {function_clause,
>                            [{beam_utils,live_opt,
>                                 [[{bs_put_string,1,{string,[0]}},
>                                   {bs_init,
>                                       {f,0},
>                                       {bs_append,0,8,{field_flags,[]}},
>                                       0,
>                                       [{integer,8},{x,0}],
>                                       {x,1}},
>                                   {label,2}],
>                                  2,
>                                  {1,{1,1,nil,nil}},
>                                  [{block,
>                                       [{'%live',2},
>                                        {set,[{x,0}],[{x,1}],move},
>                                        {'%live',1}]},
>                                   return]],
>                                 [{file,"beam_utils.erl"},{line,639}]},
>                             {beam_utils,live_opt,1,
>                                 [{file,"beam_utils.erl"},{line,205}]},
>                             {beam_block,function,2,
>                                 [{file,"beam_block.erl"},{line,38}]},
>                             {lists,mapfoldl,3,
>                                 [{file,"lists.erl"},{line,1329}]},
>                             {beam_block,module,2,
>                                 [{file,"beam_block.erl"},{line,29}]},
>                             {compile,'-select_passes/2-anonymous-2-',2,
>                                 [{file,"compile.erl"},{line,476}]},
>                             {compile,'-internal_comp/4-anonymous-1-',2,
>                                 [{file,"compile.erl"},{line,276}]},
>                             {compile,fold_comp,3,
>                                 [{file,"compile.erl"},{line,294}]}]}}},
>                [{compile,'-select_passes/2-anonymous-2-',2,
>                     [{file,"compile.erl"},{line,476}]},
>                 {compile,'-internal_comp/4-anonymous-1-',2,
>                     [{file,"compile.erl"},{line,276}]},
>                 {compile,fold_comp,3,[{file,"compile.erl"},{line,294}]},
>
> {compile,internal_comp,4,[{file,"compile.erl"},{line,278}]},
>                 {compile,'-do_compile/2-anonymous-0-',2,
>                     [{file,"compile.erl"},{line,152}]}]}
>
>
> The clause was probably commented-out because at this point in the code,
> no bs_put_string instruction has been generated yet when compiling from
> Erlang.
>
> This bug was reported by Lo?c Hoguin.
>
>         git fetch https://github.com/nox/otp.gitfix-bs_put_string-renaming
>
>
> https://github.com/nox/otp/compare/erlang:maint...fix-bs_put_string-renaming
>
> https://github.com/nox/otp/compare/erlang:maint...fix-bs_put_string-renaming.patch
>
> Regards,
>
> --
> Anthony Ramine
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
>


-- 
Bj?rn Gustavsson, Erlang/OTP, Ericsson AB
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130527/9c4f3190/attachment.htm>

From anders.otp@REDACTED  Mon May 27 10:40:27 2013
From: anders.otp@REDACTED (Anders Svensson)
Date: Mon, 27 May 2013 10:40:27 +0200
Subject: [erlang-bugs] Memory leak in diameter_service module in
 diameter app (otp_R16B)
In-Reply-To: <5199F0AC.9090800@comarch.pl>
References: <5199E1CA.6020209@comarch.pl>
	<5199F0AC.9090800@comarch.pl>
Message-ID: <CADho9oefE8zEwSd9MMFEJFOdOLd4cxTnN0GrsBvw-UsoKZgbog@mail.gmail.com>

Hi Aleksander.

Yes, it is indeed a bug that was introduced in R16B. The fix was
merged into maint on April 12, in this commit:

  https://github.com/erlang/otp/commit/656b37f1b6fbc3611f5e0f8b8c0e4f61bef9092b

The commit for the fix itself points at the one that introduced the error:

  https://github.com/erlang/otp/commit/c609108ce017069a77708f80dae9e89c45ff222d

So, fetch maint and the problem should be solved.

Sorry for the slow reply: I've been on vacation.

Anders


On Mon, May 20, 2013 at 11:45 AM, Aleksander Nycz
<Aleksander.Nycz@REDACTED> wrote:
> Hello,
>
> I think there is a problem with resource leak (memory) in diameter_service
> module.
>
> This module is a gen_server, that state contains field watchdogT ::
> ets:tid().
> This ets contains info about watchdogs.
>
> Diameter app service cfg is:
>
> [{'Origin-Host',  HostName},
>      {'Origin-Realm', Realm},
>         {'Vendor-Id',     ...},
>      {'Product-Name', ...},
>      {'Auth-Application-Id', [?DCCA_APP_ID]},
>      {'Supported-Vendor-Id', [...]},
>      {application,     [{alias,       diameterNode},
>                        {dictionary, dictionaryDCCA},
>                      {module,       dccaCallback}]},
>      {restrict_connections, false}]
>
> After start dimeter app, adding service and transport, diameter_service
> state is:
>
>> diameter_service:state(diameterNode).
> #state{id = {1369,41606,329900},
>        service_name = diameterNode,
>        service = #diameter_service{pid = <0.1011.0>,
>                                    capabilities = #diameter_caps{...},
>                                    applications = [#diameter_app{...}]},
>        watchdogT = 4194395,peerT = 4259932,shared_peers = 4325469,
>        local_peers = 4391006,monitor = false,
>        options = [{sequence,{0,32}},
>                   {share_peers,false},
>                   {use_shared_peers,false},
>                   {restrict_connections,false}]}
>
> and ets 4194395 has one record:
>
>> ets:tab2list(4194395).
> [#watchdog{pid = <0.1013.0>,type = accept,
>            ref = #Ref<0.0.0.1696>,
>            options = [{transport_module,diameter_tcp},
>                       {transport_config,[{reuseaddr,true},
>                                          {ip,{0,0,0,0}},
>                                          {port,4068}]},
>                       {capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
>                       {watchdog_timer,30000},
>                       {reconnect_timer,60000}],
>            state = initial,
>            started = {1369,41606,330086},
>            peer = false}]
>
>   Next I run very simple test using seagull symulator. Test scenario is
> following:
>
> 1. seagull: send CER
> 2. seagull: recv CEA
> 3. seagull: send CCR (init)
> 4. seagull: recv CCA (init)
> 5. seagull: send CCR (update)
> 6. seagull: recv CCR (update)
> 7. seagull: send CCR (terminate)
> 8. seagull: recv CCA (terminate)
>
> Durring test there are two watchdogs in ets:
>
>> ets:tab2list(4194395).
> [#watchdog{pid = <0.1816.0>,type = accept,
>            ref = #Ref<0.0.0.1696>,
>            options = [{transport_module,diameter_tcp},
>                       {transport_config,[{reuseaddr,true},
>                                          {ip,{0,0,0,0}},
>                                          {port,4068}]},
>                       {capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
>                       {watchdog_timer,30000},
>                       {reconnect_timer,60000}],
>            state = initial,
>            started = {1369,41823,711370},
>            peer = false},
>  #watchdog{pid = <0.1013.0>,type = accept,
>            ref = #Ref<0.0.0.1696>,
>            options = [{transport_module,diameter_tcp},
>                       {transport_config,[{reuseaddr,true},
>                                          {ip,{0,0,0,0}},
>                                          {port,4068}]},
>                       {capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
>                       {watchdog_timer,30000},
>                       {reconnect_timer,60000}],
>            state = okay,
>            started = {1369,41606,330086},
>            peer = <0.1014.0>}]
>
> After test but before tw timer elapsed, there is two watchdogs also and this
> is ok:
>
>> ets:tab2list(4194395).
> [#watchdog{pid = <0.1816.0>,type = accept,
>            ref = #Ref<0.0.0.1696>,
>            options = [{transport_module,diameter_tcp},
>                       {transport_config,[{reuseaddr,true},
>                                          {ip,{0,0,0,0}},
>                                          {port,4068}]},
>                       {capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
>                       {watchdog_timer,30000},
>                       {reconnect_timer,60000}],
>            state = initial,
>            started = {1369,41823,711370},
>            peer = false},
>  #watchdog{pid = <0.1013.0>,type = accept,
>            ref = #Ref<0.0.0.1696>,
>            options = [{transport_module,diameter_tcp},
>                       {transport_config,[{reuseaddr,true},
>                                          {ip,{0,0,0,0}},
>                                          {port,4068}]},
>                       {capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
>                       {watchdog_timer,30000},
>                       {reconnect_timer,60000}],
>            state = down,
>            started = {1369,41606,330086},
>            peer = <0.1014.0>}]
>
> But when tm timer elapsed transport and watchdog processes are finished:
>
>> erlang:is_process_alive(list_to_pid("<0.1014.0>")).
> false
>> erlang:is_process_alive(list_to_pid("<0.1013.0>")).
> false
>
> and still two watchdogs are in ets:
>
>> ets:tab2list(4194395).
> [#watchdog{pid = <0.1816.0>,type = accept,
>            ref = #Ref<0.0.0.1696>,
>            options = [{transport_module,diameter_tcp},
>                       {transport_config,[{reuseaddr,true},
>                                          {ip,{0,0,0,0}},
>                                          {port,4068}]},
>                       {capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
>                       {watchdog_timer,30000},
>                       {reconnect_timer,60000}],
>            state = initial,
>            started = {1369,41823,711370},
>            peer = false},
>  #watchdog{pid = <0.1013.0>,type = accept,
>            ref = #Ref<0.0.0.1696>,
>            options = [{transport_module,diameter_tcp},
>                       {transport_config,[{reuseaddr,true},
>                                          {ip,{0,0,0,0}},
>                                          {port,4068}]},
>                       {capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
>                       {watchdog_timer,30000},
>                       {reconnect_timer,60000}],
>            state = down,
>            started = {1369,41606,330086},
>            peer = <0.1014.0>}]
>
> I think watchdog <0.1013.0> should be removed when watchdog process is being
> finished.
>
> I run next test and now there are 3 watchdogs in ets:
>
>> ets:tab2list(4194395).
> [#watchdog{pid = <0.1816.0>,type = accept,
>            ref = #Ref<0.0.0.1696>,
>            options = [{transport_module,diameter_tcp},
>                       {transport_config,[{reuseaddr,true},
>                                          {ip,{0,0,0,0}},
>                                          {port,4068}]},
>                       {capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
>                       {watchdog_timer,30000},
>                       {reconnect_timer,60000}],
>            state = down,
>            started = {1369,41823,711370},
>            peer = <0.1817.0>},
>  #watchdog{pid = <0.1013.0>,type = accept,
>            ref = #Ref<0.0.0.1696>,
>            options = [{transport_module,diameter_tcp},
>                       {transport_config,[{reuseaddr,true},
>                                          {ip,{0,0,0,0}},
>                                          {port,4068}]},
>                       {capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
>                       {watchdog_timer,30000},
>                       {reconnect_timer,60000}],
>            state = down,
>            started = {1369,41606,330086},
>            peer = <0.1014.0>},
>  #watchdog{pid = <0.3533.0>,type = accept,
>            ref = #Ref<0.0.0.1696>,
>            options = [{transport_module,diameter_tcp},
>                       {transport_config,[{reuseaddr,true},
>                                          {ip,{0,0,0,0}},
>                                          {port,4068}]},
>                       {capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
>                       {watchdog_timer,30000},
>                       {reconnect_timer,60000}],
>            state = initial,
>            started = {1369,42342,845898},
>            peer = false}]
>
> Watchdog and transport process are not alive:
>
>> erlang:is_process_alive(list_to_pid("<0.1816.0>")).
> false
>> erlang:is_process_alive(list_to_pid("<0.1817.0>")).
> false
>
>
> I suggest following change in code to correct this problem (file
> diameter_service.erl):
>
> $ diff diameter_service.erl diameter_service.erl_ok
> 1006c1006
> < connection_down(#watchdog{state = WS,
> ---
>> connection_down(#watchdog{state = ?WD_OKAY,
> 1015,1017c1015,1021
> <     ?WD_OKAY == WS
> <         andalso
> <         connection_down(Wd, fetch(PeerT, TPid), S).
> ---
>>     connection_down(Wd, fetch(PeerT, TPid), S);
>>
>> connection_down(#watchdog{},
>>                 To,
>>                 #state{})
>>   when is_atom(To) ->
>>     ok.
>
> You can find this solution in attachement.
>
> Regards
> Aleksander Nycz
>
>
> --
> Aleksander Nycz
> Senior Software Engineer
> Telco_021 BSS R&D
> Comarch SA
> Phone:  +48 12 646 1216
> Mobile: +48 691 464 275
> website: www.comarch.pl
>
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
>


From anders.otp@REDACTED  Mon May 27 17:12:44 2013
From: anders.otp@REDACTED (Anders Svensson)
Date: Mon, 27 May 2013 17:12:44 +0200
Subject: [erlang-bugs] Problem with tw timer support in diameter app
	(otp_R16B)
In-Reply-To: <5199E1CA.6020209@comarch.pl>
References: <5199E1CA.6020209@comarch.pl>
Message-ID: <CADho9odRagvKHwRGmSB9NMqdk4h+EPcP_5v2ZiVo5xxX60U-rQ@mail.gmail.com>

Thanks for the report. The fix should be in the maint branch (destined
for R16B01) by the end of the week.

Anders


On Mon, May 20, 2013 at 10:41 AM, Aleksander Nycz
<Aleksander.Nycz@REDACTED> wrote:
> Hello,
>
> I change default value for param restrict_connections from 'nodes' to
> 'false'.
> After that I run very simple test using seagull symulator. Test scenario was
> following:
>
> 1. seagull: send CER
> 2. seagull: recv CEA
> 3. seagull: send CCR (init)
> 4. seagull: recv CCA (init)
> 5. seagull: send CCR (update)
> 6. seagull: recv CCR (update)
> 7. seagull: send CCR (terminate)
> 8. seagull: recv CCA (terminate)
>
> After step 8. seagull does't send DPR, but just closes transport connection
> (TCP)
>
> On server side every think looks good, but 30 sec. after CCR (terminate)
> when tw elapsed, following error message appears in log:
>
>
> 13:40:58.187129: <0.5046.0>: error: error_logger: --:--/--: ** Generic
> server <0.5046.0> terminating
> ** Last message in was {timeout,#Ref<0.0.0.14845>,tw}
> ** When Server state == {watchdog,down,false,30000,0,<0.1009.0>,undefined,
>                             #Ref<0.0.0.14845>,diameter_gen_base_rfc3588,
>                             {recvdata,4259932,diameterNode,
>                                 [{diameter_app,diameterNode,dictionaryDCCA,
>                                      [dccaCallback],
>                                      diameterNode,4,false,
>                                      [{answer_errors,report},
>                                       {request_errors,answer_3xxx}]}],
>                                 {0,32}},
>                             {0,32},
>                             {false,false},
>                             false}
> ** Reason for termination ==
> ** {function_clause,
>        [{diameter_watchdog,set_watchdog,
>             [stop],
>             [{file,"base/diameter_watchdog.erl"},{line,451}]},
>         {diameter_watchdog,handle_info,2,
>             [{file,"base/diameter_watchdog.erl"},{line,211}]},
>         {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]},
>         {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
>
> 13:40:58.187500: <0.5046.0>: error: error_logger: --:--/--:
> [crash_report][[[{initial_call,{diameter_watchdog,init,['Argument__1']}},
>                  {pid,<0.5046.0>},
>                  {registered_name,[]},
>
> {error_info,{exit,{function_clause,[{diameter_watchdog,set_watchdog,[stop],[{file,"base/diameter_watchdog.erl"},{line,451}]},
>
> {diameter_watchdog,handle_info,2,[{file,"base/diameter_watchdog.erl"},{line,211}]},
>
> {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]},
>
> {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]},
>
> [{gen_server,terminate,6,[{file,"gen_server.erl"},{line,737}]},
>
> {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}},
>                  {ancestors,[diameter_watchdog_sup,diameter_sup,<0.946.0>]},
>                  {messages,[]},
>                  {links,[<0.954.0>]},
>                  {dictionary,[{random_seed,{15047,18051,14647}},
>                               {{diameter_watchdog,restart},
>                                {{accept,#Ref<0.0.0.1696>},
>                                 [{transport_module,diameter_tcp},
>
> {transport_config,[{reuseaddr,true},{ip,{0,0,0,0}},{port,4068}]},
>
> {capabilities_cb,[#Fun<diameterNode.acceptCER.2>]},
>                                  {watchdog_timer,30000},
>                                  {reconnect_timer,60000}],
>                                 {diameter_service,<0.1009.0>,
>
> {diameter_caps,"zyndram.krakow.comarch","krakow.comarch",[],25429,"Comarch
> DIAMETER Server",[],
>
> [12645,10415,8164],
>                                                                  [4],
>
> [],[],[],[],[]},
>
> [{diameter_app,diameterNode,dictionaryDCCA,
>
> [dccaCallback],
>
> diameterNode,4,false,
>
> [{answer_errors,report},{request_errors,answer_3xxx}]}]}}},
>                               {{diameter_watchdog,dwr},
>
> ['DWR',{'Origin-Host',"zyndram.krakow.comarch"},{'Origin-Realm',"krakow.comarch"},{'Origin-State-Id',[]}]}]},
>                  {trap_exit,false},
>                  {status,running},
>                  {heap_size,75025},
>                  {stack_size,24},
>                  {reductions,294}],
>                 []]]
> 13:40:58.189060: <0.954.0>: error: error_logger: --:--/--:
> [supervisor_report][[{supervisor,{local,diameter_watchdog_sup}},
>                      {errorContext,child_terminated},
>
> {reason,{function_clause,[{diameter_watchdog,set_watchdog,[stop],[{file,"base/diameter_watchdog.erl"},{line,451}]},
>
> {diameter_watchdog,handle_info,2,[{file,"base/diameter_watchdog.erl"},{line,211}]},
>
> {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]},
>
> {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}},
>                      {offender,[{pid,<0.5046.0>},
>                                 {name,diameter_watchdog},
>
> {mfargs,{diameter_watchdog,start_link,undefined}},
>                                 {restart_type,temporary},
>                                 {shutdown,1000},
>                                 {child_type,worker}]}]]
>
> You can check, that function set_watchdog should be called with param
> #watchdog{}, but 'stop' param is used instead.
> As a result function_clause exception is thrown.
>
> I suggest following change in code to correct this problem (file
> diameter_watchdog.erl):
>
> $ diff diameter_watchdog.erl_org diameter_watchdog.erl
> 385a386,393
>> transition({timeout, TRef, tw}, #watchdog{tref = TRef, status = T} = S)
>>   when T == initial;
>>        T == down ->
>>     case restart(S) of
>>         stop -> stop;
>>         #watchdog{} = NewS -> set_watchdog(NewS)
>>     end;
>>
>
> You can find this solution in attachement.
>
> Best regards
> Aleksander Nycz
>
> --
> Aleksander Nycz
> Senior Software Engineer
> Telco_021 BSS R&D
> Comarch SA
> Phone:  +48 12 646 1216
> Mobile: +48 691 464 275
> website: www.comarch.pl
>
>
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs
>


From andrew.pennebaker@REDACTED  Wed May 29 03:30:21 2013
From: andrew.pennebaker@REDACTED (Andrew Pennebaker)
Date: Tue, 28 May 2013 21:30:21 -0400
Subject: [erlang-bugs] erl -s <module> crashes
Message-ID: <CAHXt_SV4MFV_hNqJCTLae3dh1HTZK6g5QvhEJxKmvVYJxzmnbg@mail.gmail.com>

When I try to `erl -s` any module, Erlang crashes. Dump attached.

Trace:

    $ cat hello.erl
    %% 22 Feb 2011

    -module(hello).
    -author("andrew.pennebaker@REDACTED").
    -export([main/1]).

    main(_) -> io:format("Hello World!~n", []).

    $ erlc hello.erl

    $ erl -s hello
    Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:2:2]
[async-threads:0] [hipe] [kernel-poll:false] [dtrace]

    {"init terminating in
do_boot",{undef,[{hello,start,[],[]},{init,start_it,1,[]},{init,start_em,1,[]}]}}

    Crash dump was written to: erl_crash.dump
    init terminating in do_boot ()

System:

    $ specs erlang os
    Specs:

    specs 0.4
    https://github.com/mcandre/specs#readme

    rebar -V
    rebar 2.1.0-pre R15B03 20130528_213220 git 2.1.0-pre-46-g78fa8fc

    erl -eval 'erlang:display(erlang:system_info(otp_release)), halt().'
-noshell
    "R15B03"

    system_profiler SPSoftwareDataType | grep 'System Version'
    System Version: OS X 10.8.3 (12D78)

-- 
Cheers,

Andrew Pennebaker
www.yellosoft.us
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130528/3297a76b/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: erl_crash.dump
Type: application/octet-stream
Size: 341771 bytes
Desc: not available
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130528/3297a76b/attachment.obj>

From vladdu55@REDACTED  Wed May 29 10:51:35 2013
From: vladdu55@REDACTED (Vlad Dumitrescu)
Date: Wed, 29 May 2013 10:51:35 +0200
Subject: [erlang-bugs] erl -s <module> crashes
In-Reply-To: <CAHXt_SV4MFV_hNqJCTLae3dh1HTZK6g5QvhEJxKmvVYJxzmnbg@mail.gmail.com>
References: <CAHXt_SV4MFV_hNqJCTLae3dh1HTZK6g5QvhEJxKmvVYJxzmnbg@mail.gmail.com>
Message-ID: <CAA-EFXuPNjOCR59jSNRcjt8V8SoucXkGSwoc0U1X58zoAPC1qQ@mail.gmail.com>

Hi!

This part of the error message

On Wed, May 29, 2013 at 3:30 AM, Andrew Pennebaker <
andrew.pennebaker@REDACTED> wrote:

> ,{undef,[{hello,start,[],[]}
>

tells you that the system tried to execute hello:start(), which is the
behaviour when only specifying the module name after -s

>From the docs:

-s Mod [Func [Arg1, Arg2, ...]](init flag)

Makes init call the specified function. Func defaults to start. If no
arguments are provided, the function is assumed to be of arity 0. Otherwise
it is assumed to be of arity 1, taking the list [Arg1,Arg2,...] as
argument. All arguments are passed as atoms. See
init(3)<http://www.erlang.org/doc/man/init.html>
.


regards,
Vlad
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-bugs/attachments/20130529/9b44a108/attachment.htm>

From a.zhuravlev@REDACTED  Wed May 29 10:53:29 2013
From: a.zhuravlev@REDACTED (Alexander Zhuravlev)
Date: Wed, 29 May 2013 12:53:29 +0400
Subject: [erlang-bugs] erl -s <module> crashes
In-Reply-To: <CAHXt_SV4MFV_hNqJCTLae3dh1HTZK6g5QvhEJxKmvVYJxzmnbg@mail.gmail.com>
References: <CAHXt_SV4MFV_hNqJCTLae3dh1HTZK6g5QvhEJxKmvVYJxzmnbg@mail.gmail.com>
Message-ID: <20130529085329.GA8854@zmac.js-kit.local>

On Tue, May 28, 2013 at 09:30:21PM -0400, Andrew Pennebaker wrote:
> When I try to `erl -s` any module, Erlang crashes. Dump attached.
> 
> Trace:
> 
>     $ cat hello.erl
>     %% 22 Feb 2011
> 
>     -module(hello).
>     -author("andrew.pennebaker@REDACTED").
>     -export([main/1]).
> 
>     main(_) -> io:format("Hello World!~n", []).
> 
>     $ erlc hello.erl
> 
>     $ erl -s hello
>     Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:2:2]
> [async-threads:0] [hipe] [kernel-poll:false] [dtrace]
> 
>     {"init terminating in
> do_boot",{undef,[{hello,start,[],[]},{init,start_it,1,[]},{init,start_em,1,[]}]}}
> 
>     Crash dump was written to: erl_crash.dump
>     init terminating in do_boot ()

You need to check description of the erl's "-s" flag. It accepts a module and
an optional function name (by default set to "start" if not specified).

zmac:~> cat hello.erl
-module(hello).
-export([main/0]).

main() ->
	io:format("Hello World!~n", []).

zmac:~> erl -s hello main -s erlang halt
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:2:2] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

Hello World!

> 
> System:
> 
>     $ specs erlang os
>     Specs:
> 
>     specs 0.4
>     https://github.com/mcandre/specs#readme
> 
>     rebar -V
>     rebar 2.1.0-pre R15B03 20130528_213220 git 2.1.0-pre-46-g78fa8fc
> 
>     erl -eval 'erlang:display(erlang:system_info(otp_release)), halt().'
> -noshell
>     "R15B03"
> 
>     system_profiler SPSoftwareDataType | grep 'System Version'
>     System Version: OS X 10.8.3 (12D78)
> 
> -- 
> Cheers,
> 
> Andrew Pennebaker
> www.yellosoft.us


> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs

-- 
Alexander Zhuravlev


From watson.timothy@REDACTED  Wed May 29 11:12:24 2013
From: watson.timothy@REDACTED (Tim Watson)
Date: Wed, 29 May 2013 10:12:24 +0100
Subject: [erlang-bugs] Strange application shutdown deadlock
In-Reply-To: <F20A9D6A-C0AB-49F1-9115-336EBCD7729E@gmail.com>
References: <E74136CA-C884-4440-AE73-3D9A3C64DE72@gmail.com>
 <20130524144546.GB14817@ferdair.local>
 <F20A9D6A-C0AB-49F1-9115-336EBCD7729E@gmail.com>
Message-ID: <7E9616D5-BD4E-4F87-9EDA-A3AF2262FE05@gmail.com>

Any word from the OTP folks on this one?

On 24 May 2013, at 18:41, Tim Watson <watson.timothy@REDACTED> wrote:

> Gah, sorry folks - this has nothing to do with release handling, that was a red herring. Someone just pointed out that the call to get_child originates in a status check in our code.
> 
> This still looks like a bug to me though, since if you're going to handle "other" messages in terminate_loop you ought to ensure they can't deadlock the vm's shutdown sequence.
> 
> Cheers,
> Tim
> 
> On 24 May 2013, at 15:45, Fred Hebert <mononcqc@REDACTED> wrote:
> 
>> Quick question: are you running a release?
>> 
>> If so, last time I've seen deadlocks like that was solved by making sure
>> *all* my applications did depend on stdlib and kernel in their app file.
>> When I skipped them, sometimes I'd find that things would lock up.
>> 
>> My guess was that dependencies from stdlib or kernel got unloaded before
>> my app and broke something, but I'm not sure -- In my case, I wasn't
>> able to inspect the node as it appeared to be 100% blocked.
>> 
>> Adding the apps ended up fixing the problem on the next shutdown. I'm
>> not sure if it might be a good fix for you, but it's a stab in the dark,
>> 
>> Regards,
>> Fred.
>> 
>> On 05/24, Tim Watson wrote:
>>> We came across this at a customer's site, where one of the nodes was apparently in the process of stopping and had been in that state for at least 24 hours. The short version is that an application_master appears to be stuck waiting for a child pid (is that the X process, or the root supervisor?) which is *not* linked to it... 
>>> 
>>> The application controller is in the process of stopping an application, during which process a `get_child' message appears to have come in to that application's application_master from somewhere - we are *not* running appmon, so I'm really confused how this can happen, as the only other place where I see (indirect) calls are via the sasl release_handler!? At the bottom of this email is a dump for the application_controller and the application_master for the app it is trying to shut down. I can verify that the pid which the application_master is waiting on is definitely not linked to it - i.e., process_info(links, AppMasterPid) doesn't contain the process <0.256.0> that the master appears to be waiting on.
>>> 
>>> My reading of the code is that the application_master cannot end up in get_child_i unless a get_child request was made which arrives whilst it is in its terminate loop. As I said, we're not using appmon, therefore I assume this originated in the sasl application's release_handler_1, though I'm not sure quite which route would take us there. The relevant bit of code in application_master appears to be:
>>> 
>>> get_child_i(Child) ->
>>>   Child ! {self(), get_child},
>>>   receive
>>>   {Child, GrandChild, Mod} -> {GrandChild, Mod}
>>>   end.
>>> 
>>> This in turn originates, I'd guess, in the third receive clause of terminate_loop/2. Anyway, should that code not be dealing with a potentially dead pid for Child, either by handling links effectively - perhaps there is an EXIT signal in the mailbox already which is being ignored here in get_child_i/1 - or by some other means?
>>> 
>>> What follows below is the trace/dump output. Feel free to poke me for more info as needed.
>>> 
>>> Cheers,
>>> Tim
>>> 
>>> [TRACE/DUMP]
>>> 
>>> pid: <6676.7.0>
>>> registered name: application_controller
>>> stacktrace: [{application_master,call,2,
>>>                                [{file,"application_master.erl"},{line,75}]},
>>>            {application_controller,stop_appl,3,
>>>                                    [{file,"application_controller.erl"},
>>>                                     {line,1393}]},
>>>            {application_controller,handle_call,3,
>>>                                    [{file,"application_controller.erl"},
>>>                                     {line,810}]},
>>>            {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]}]
>>> -------------------------
>>> Program counter: 0x00007f9bf9a53720 (application_master:call/2 + 288)
>>> CP: 0x0000000000000000 (invalid)
>>> arity = 0
>>> 
>>> 0x00007f9bd7948360 Return addr 0x00007f9bfb97de40 (application_controller:stop_appl/3 + 176)
>>> y(0)     #Ref<0.0.20562.258360>
>>> y(1)     #Ref<0.0.20562.258361>
>>> y(2)     []
>>> 
>>> 0x00007f9bd7948380 Return addr 0x00007f9bfb973c68 (application_controller:handle_call/3 + 1392)
>>> y(0)     temporary
>>> y(1)     rabbitmq_web_dispatch
>>> 
>>> 0x00007f9bd7948398 Return addr 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272)
>>> y(0)     {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]}
>>> y(1)     rabbitmq_web_dispatch
>>> y(2)     [{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}]
>>> y(3)     [{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}]
>>> 
>>> 0x00007f9bd79483c0 Return addr 0x00000000008827d8 (<terminate process normally>)
>>> y(0)     application_controller
>>> y(1)     {state,[],[],[],[{ssl,<0.507.0>},{public_key,undefined},{crypto,<0.501.0>},{rabbitmq_web_dispatch,<0.255.0>},{webmachine,<0.250.0>},{mochiweb,undefined},{xmerl,undefined},{inets,<0.237.0>},{amqp_client,<0.233.0>},{mnesia,<0.60.0>},{sasl,<0.34.0>},{stdlib,undefined},{kernel,<0.9.0>}],[],[{ssl,temporary},{public_key,temporary},{crypto,temporary},{rabbitmq_web_dispatch,temporary},{webmachine,temporary},{mochiweb,temporary},{xmerl,temporary},{inets,temporary},{amqp_client,temporary},{mnesia,temporary},{sasl,permanent},{stdlib,permanent},{kernel,permanent}],[],[{rabbit,[{ssl_listeners,[5671]},{ssl_options,[{cacertfile,"/etc/rabbitmq/server.cacrt"},{certfile,"/etc/rabbitmq/server.crt"},{keyfile,"/etc/rabbitmq/server.key"},{verify,verify_none},{fail_if_no_peer_cert,false}]},{default_user,<<2 bytes>>},{default_pass,<<8 bytes>>},{vm_memory_high_watermark,5.000000e-01}]},{rabbitmq_management,[{listener,[{port,15672},{ssl,true}]}]}]}
>>> y(2)     application_controller
>>> y(3)     <0.2.0>
>>> y(4)     {stop_application,rabbitmq_web_dispatch}
>>> y(5)     {<0.5864.275>,#Ref<0.0.20562.258345>}
>>> y(6)     Catch 0x00007f9bf9a600c8 (gen_server:handle_msg/5 + 272)
>>> -------------------------
>>> 
>>> pid: <6676.255.0>
>>> registered name: none
>>> stacktrace: [{application_master,get_child_i,1,
>>>                                [{file,"application_master.erl"},{line,392}]},
>>>            {application_master,handle_msg,2,
>>>                                [{file,"application_master.erl"},{line,216}]},
>>>            {application_master,terminate_loop,2,
>>>                                [{file,"application_master.erl"},{line,206}]},
>>>            {application_master,terminate,2,
>>>                                [{file,"application_master.erl"},{line,227}]},
>>>            {application_master,handle_msg,2,
>>>                                [{file,"application_master.erl"},{line,219}]},
>>>            {application_master,main_loop,2,
>>>                                [{file,"application_master.erl"},{line,194}]},
>>>            {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]
>>> -------------------------
>>> Program counter: 0x00007f9bf9a570e0 (application_master:get_child_i/1 + 120)
>>> CP: 0x0000000000000000 (invalid)
>>> arity = 0
>>> 
>>> 0x00007f9c1adc3dc8 Return addr 0x00007f9bf9a54eb0 (application_master:handle_msg/2 + 280)
>>> y(0)     <0.256.0>
>>> 
>>> 0x00007f9c1adc3dd8 Return addr 0x00007f9bf9a54d20 (application_master:terminate_loop/2 + 520)
>>> y(0)     #Ref<0.0.20562.258362>
>>> y(1)     <0.9596.275>
>>> y(2)     {state,<0.256.0>,{appl_data,rabbitmq_web_dispatch,[],undefined,{rabbit_web_dispatch_app,[]},[rabbit_web_dispatch,rabbit_web_dispatch_app,rabbit_web_dispatch_registry,rabbit_web_dispatch_sup,rabbit_web_dispatch_util,rabbit_webmachine],[],infinity,infinity},[],0,<0.29.0>}
>>> 
>>> 0x00007f9c1adc3df8 Return addr 0x00007f9bf9a55108 (application_master:terminate/2 + 192)
>>> y(0)     <0.256.0>
>>> 
>>> 0x00007f9c1adc3e08 Return addr 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472)
>>> y(0)     []
>>> y(1)     normal
>>> 
>>> 0x00007f9c1adc3e20 Return addr 0x00007f9bf9a54a60 (application_master:main_loop/2 + 1600)
>>> y(0)     <0.7.0>
>>> y(1)     #Ref<0.0.20562.258360>
>>> y(2)     Catch 0x00007f9bf9a54f70 (application_master:handle_msg/2 + 472)
>>> 
>>> 0x00007f9c1adc3e40 Return addr 0x00007f9bfb969420 (proc_lib:init_p_do_apply/3 + 56)
>>> y(0)     <0.7.0>
>>> 
>>> 0x00007f9c1adc3e50 Return addr 0x00000000008827d8 (<terminate process normally>)
>>> y(0)     Catch 0x00007f9bfb969440 (proc_lib:init_p_do_apply/3 + 88)
>>> -------------------------
>>> 
>> 
>> 
>> 
>>> _______________________________________________
>>> erlang-bugs mailing list
>>> erlang-bugs@REDACTED
>>> http://erlang.org/mailman/listinfo/erlang-bugs
>> 


From Anders.Ramsell@REDACTED  Wed May 29 20:02:23 2013
From: Anders.Ramsell@REDACTED (Anders.Ramsell@REDACTED)
Date: Wed, 29 May 2013 18:02:23 +0000
Subject: [erlang-bugs] Compiler/linter bug breaking unused variable warnings
Message-ID: <82DC27D088947C4D943175FDA0DA60F411526021@EXMB13TSTRZ2.tcad.telia.se>


When a function creates a record and more than one field is bound to the value of a list comprehension the compiler/linter fails to generate warnings for unused variables in that function. I just tested this on R16B and the problem is still there.

I use the following module to test this:

--8<----------------------------------------------------------
-module(missing_warning).

-export([test_missing_warning/2,
         test_with_warning1/2,
         test_with_warning2/2
        ]).

-record(data, {aList, bList}).

test_missing_warning(Data, KeyList) -> %% Data never used - no warning.
    KeyList2 = filter(KeyList), %% KeyList2 never used - no warning.
    #data{aList = [Key || Key <- KeyList],
          bList = [Key || Key <- KeyList]}.

test_with_warning1(Data, KeyList) -> %% Data never used - get warning.
    KeyList2 = filter(KeyList), %% KeyList2 never used - get warning.
    #data{aList = [Key || Key <- KeyList]}. %% Only one LC in the record.

test_with_warning2(Data, KeyList) -> %% Data never used - get warning.
    KeyList2 = filter(KeyList), %% KeyList2 never used - get warning.
    {data,
     [Key || Key <- KeyList], %% Not in a record. 
     [Key || Key <- KeyList]}.

filter(L) -> L.
--8<----------------------------------------------------------

In all three test functions the variables Data (in the function head) and KeyList2 (in the function body) are unused.
Compiling the module should produce six warnings but I only get four.
You get the same result with other "advanced" calls like
	lists:map(fun(Key) -> Key end, KeyList)
so it's not limited to list comprehensions.
If the fields are bound to e.g. the variable KeyList directly the warnings work just fine.

/Anders


From andrew@REDACTED  Wed May 29 21:46:24 2013
From: andrew@REDACTED (Andrew Thompson)
Date: Wed, 29 May 2013 15:46:24 -0400
Subject: [erlang-bugs] Eunit, test generators and code:purge()
Message-ID: <20130529194624.GE31341@hijacked.us>

So, I've been chasing a failure in a test suite for the last couple
days. Turns out, the problem is the test suite does this:

* Test module A, with a test generator function
* Test module B, and meck module A

Eunit's runner is holding a reference to something in module A (probably
a fun), so when meck does a purge on A as part of test B, the code
server kills the eunit test runner process. This bug was actually
reported three years ago:

http://erlang.org/pipermail/erlang-bugs/2010-June/001844.html

But it still affects at least R15B03, which is what I'm using.

I have a slightly modified version of b_mod that proves that eunit is
holding a ref to something from a_mod:

-module(b_mod).

-include_lib("eunit/include/eunit.hrl").

second_test() ->
    ?debugFmt("I am ~p ~p~n", [self(), erlang:process_info(self())]),
    true = code:delete(a_mod),
    ?debugFmt("processes using a_mod: ~p~n", [[P || P <- processes(), erlang:check_process_code(P, a_mod)]]),
    true = code:soft_purge(a_mod),
    ok.


I looked into trying to patch this, but the eunit code is too convoluted
for me to understand where it is holding the problematic reference.

Andrew


From joearms@REDACTED  Wed May 29 15:02:26 2013
From: joearms@REDACTED (Joe Armstrong)
Date: Wed, 29 May 2013 15:02:26 +0200
Subject: [erlang-bugs] ioi:columns bug
Message-ID: <CAL6cY6E1ATWa_vc5+q0=kq+15_Qm1LCxgML4rgfO1Se_HyxzWA@mail.gmail.com>

io:columns() does not work in a process spawned from the command line

-module(bug).
-compile(export_all).

test() ->
    io:format("~p~n", [io:columns()]).

When I run this in a shell it works

2> c(bug).
{ok,bug}
3> bug:test().
{ok,132}
ok

But not from the command line

> erl -s bug test
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4]
[async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
1> {error,enotsup}

Something is strange - the group leader of a process launched from the
command line is different to the shell group leader. But I can do io:format
from a command launched from the command line --- what's happening here?

/Joe


From ml.jc.campagne@REDACTED  Thu May 30 10:17:47 2013
From: ml.jc.campagne@REDACTED (Jean-Charles Campagne)
Date: Thu, 30 May 2013 10:17:47 +0200
Subject: [erlang-bugs] ioi:columns bug
In-Reply-To: <CAL6cY6E1ATWa_vc5+q0=kq+15_Qm1LCxgML4rgfO1Se_HyxzWA@mail.gmail.com>
References: <CAL6cY6E1ATWa_vc5+q0=kq+15_Qm1LCxgML4rgfO1Se_HyxzWA@mail.gmail.com>
Message-ID: <96CFB37C-B16F-40DF-B83A-3E65DDAA6990@gmail.com>

Hi Joe,

Not sure what is going here either, I stumble upon the same issue like this. I did not have the opportunity to get to the bottom of it though.

However, using "-noshell" does not generate an error.

   $ erl -s bug test -noshell
   {ok,143}


Then again that might be incompatible with what you are trying to achieve. Hope that sheds some light. It worked for me as I did not need to have a shell in the end.

Also, I noticed that specifying 'standard_err' as IoDevice works (but not 'standard_io'), as in:

======================================================
-module(bug_err).
-compile(export_all).

test_err() ->
    io:format("~p~n", [io:columns(standard_error)]).
======================================================

  $ erl -s bug_err test_err
  Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:2:2] [async-threads:0] [hipe] [kernel-poll:false]
  
  {ok,143}
  Eshell V5.9.3.1  (abort with ^G)
  1> 
======================================================

My guess is standard_io somehow is not opened/accessible.

My 2cents.

Regards,
Jc

On 29 mai 2013, at 15:02, Joe Armstrong <joearms@REDACTED> wrote:

> io:columns() does not work in a process spawned from the command line
> 
> -module(bug).
> -compile(export_all).
> 
> test() ->
>    io:format("~p~n", [io:columns()]).
> 
> When I run this in a shell it works
> 
> 2> c(bug).
> {ok,bug}
> 3> bug:test().
> {ok,132}
> ok
> 
> But not from the command line
> 
>> erl -s bug test
> Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4]
> [async-threads:10] [hipe] [kernel-poll:false]
> 
> Eshell V5.10.1  (abort with ^G)
> 1> {error,enotsup}
> 
> Something is strange - the group leader of a process launched from the
> command line is different to the shell group leader. But I can do io:format
> from a command launched from the command line --- what's happening here?
> 
> /Joe
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs


From magnus@REDACTED  Thu May 30 12:30:28 2013
From: magnus@REDACTED (Magnus Henoch)
Date: Thu, 30 May 2013 11:30:28 +0100
Subject: [erlang-bugs] Eunit, test generators and code:purge()
In-Reply-To: <20130529194624.GE31341@hijacked.us> (Andrew Thompson's message
 of "Wed, 29 May 2013 15:46:24 -0400")
References: <20130529194624.GE31341@hijacked.us>
Message-ID: <m2sj14g3p7.fsf@mail.gmail.com>

Andrew Thompson <andrew@REDACTED> writes:

> So, I've been chasing a failure in a test suite for the last couple
> days. Turns out, the problem is the test suite does this:
>
> * Test module A, with a test generator function
> * Test module B, and meck module A
>
> Eunit's runner is holding a reference to something in module A (probably
> a fun), so when meck does a purge on A as part of test B, the code
> server kills the eunit test runner process. This bug was actually
> reported three years ago:
>
> http://erlang.org/pipermail/erlang-bugs/2010-June/001844.html
>
> But it still affects at least R15B03, which is what I'm using.
>
> I have a slightly modified version of b_mod that proves that eunit is
> holding a ref to something from a_mod:
>
> -module(b_mod).
>
> -include_lib("eunit/include/eunit.hrl").
>
> second_test() ->
>     ?debugFmt("I am ~p ~p~n", [self(), erlang:process_info(self())]),
>     true = code:delete(a_mod),
>     ?debugFmt("processes using a_mod: ~p~n", [[P || P <- processes(), erlang:check_process_code(P, a_mod)]]),
>     true = code:soft_purge(a_mod),
>     ok.
>
>
> I looked into trying to patch this, but the eunit code is too convoluted
> for me to understand where it is holding the problematic reference.

I've had the same problem, and somehow discovered that it works if the
test generator function in A has a title.  That is, instead of:

my_test_() ->
    ?_test(do_something()).

write:

my_test_() ->
    {"do something", ?_test(do_something())}.

That led me to think that Eunit holds on to the fun object as a "name"
if the test has no explicit title.

Regards,
Magnus


From andrew@REDACTED  Thu May 30 15:14:28 2013
From: andrew@REDACTED (Andrew Thompson)
Date: Thu, 30 May 2013 09:14:28 -0400
Subject: [erlang-bugs] Eunit, test generators and code:purge()
In-Reply-To: <m2sj14g3p7.fsf@mail.gmail.com>
References: <20130529194624.GE31341@hijacked.us>
 <m2sj14g3p7.fsf@mail.gmail.com>
Message-ID: <20130530131428.GF31341@hijacked.us>

On Thu, May 30, 2013 at 11:30:28AM +0100, Magnus Henoch wrote:
> I've had the same problem, and somehow discovered that it works if the
> test generator function in A has a title.

Interesting idea. Unfortunately, the test I have is using a setup
fixture (with named tests), so your workaround doesn't seem to apply
here.

Andrew


From magnus@REDACTED  Thu May 30 15:48:45 2013
From: magnus@REDACTED (Magnus Henoch)
Date: Thu, 30 May 2013 14:48:45 +0100
Subject: [erlang-bugs] Eunit, test generators and code:purge()
In-Reply-To: <20130530131428.GF31341@hijacked.us> (Andrew Thompson's message
 of "Thu, 30 May 2013 09:14:28 -0400")
References: <20130529194624.GE31341@hijacked.us>
 <m2sj14g3p7.fsf@mail.gmail.com> <20130530131428.GF31341@hijacked.us>
Message-ID: <m2obbsfuiq.fsf@mail.gmail.com>

Andrew Thompson <andrew@REDACTED> writes:

> On Thu, May 30, 2013 at 11:30:28AM +0100, Magnus Henoch wrote:
>> I've had the same problem, and somehow discovered that it works if the
>> test generator function in A has a title.
>
> Interesting idea. Unfortunately, the test I have is using a setup
> fixture (with named tests), so your workaround doesn't seem to apply
> here.

I found that the same workaround worked for setup fixtures:

my_test_() ->
    {"this title saves the test", setup, Setup, Cleanup, Tests}.

/m


From andrew@REDACTED  Thu May 30 16:21:36 2013
From: andrew@REDACTED (Andrew Thompson)
Date: Thu, 30 May 2013 10:21:36 -0400
Subject: [erlang-bugs] Eunit, test generators and code:purge()
In-Reply-To: <m2obbsfuiq.fsf@mail.gmail.com>
References: <20130529194624.GE31341@hijacked.us>
 <m2sj14g3p7.fsf@mail.gmail.com>
 <20130530131428.GF31341@hijacked.us>
 <m2obbsfuiq.fsf@mail.gmail.com>
Message-ID: <20130530142136.GG31341@hijacked.us>

On Thu, May 30, 2013 at 02:48:45PM +0100, Magnus Henoch wrote:
> I found that the same workaround worked for setup fixtures:
> 
> my_test_() ->
>     {"this title saves the test", setup, Setup, Cleanup, Tests}.
> 
Thank you! I had no idea that was even a valid fixture, but it worked!

Andrew


From n.oxyde@REDACTED  Fri May 31 00:46:57 2013
From: n.oxyde@REDACTED (Anthony Ramine)
Date: Fri, 31 May 2013 00:46:57 +0200
Subject: [erlang-bugs] Compiler/linter bug breaking unused variable
	warnings
In-Reply-To: <82DC27D088947C4D943175FDA0DA60F411526021@EXMB13TSTRZ2.tcad.telia.se>
References: <82DC27D088947C4D943175FDA0DA60F411526021@EXMB13TSTRZ2.tcad.telia.se>
Message-ID: <5312CBA2-4C31-46FF-9E8A-74589DB5349D@gmail.com>

Hello,

Smaller test case reproducing the bug, without KeyList2 nor filter/1:

-8<--
-module(missing_warning).

-export([test_missing_warning/2,
        test_with_warning1/2,
        test_with_warning2/2
       ]).

-record(data, {aList, bList}).

test_missing_warning(Data, KeyList) -> %% Data, KeyList never used - no warning.
   #data{aList = [Key || Key <- []],
         bList = [Key || Key <- []]}.

test_with_warning1(Data, KeyList) -> %% Data, KeyList never used - get warning.
   #data{aList = [Key || Key <- []]}. %% Only one LC in the record.

test_with_warning2(Data, KeyList) -> %% Data, KeyList never used - get warning.
   {data,
    [Key || Key <- []], %% Not in a record.
    [Key || Key <- []]}.
-->8-

Regards,

-- 
Anthony Ramine

Le 29 mai 2013 ? 20:02, <Anders.Ramsell@REDACTED> <Anders.Ramsell@REDACTED> a ?crit :

> 
> When a function creates a record and more than one field is bound to the value of a list comprehension the compiler/linter fails to generate warnings for unused variables in that function. I just tested this on R16B and the problem is still there.
> 
> I use the following module to test this:
> 
> --8<----------------------------------------------------------
> -module(missing_warning).
> 
> -export([test_missing_warning/2,
>         test_with_warning1/2,
>         test_with_warning2/2
>        ]).
> 
> -record(data, {aList, bList}).
> 
> test_missing_warning(Data, KeyList) -> %% Data never used - no warning.
>    KeyList2 = filter(KeyList), %% KeyList2 never used - no warning.
>    #data{aList = [Key || Key <- KeyList],
>          bList = [Key || Key <- KeyList]}.
> 
> test_with_warning1(Data, KeyList) -> %% Data never used - get warning.
>    KeyList2 = filter(KeyList), %% KeyList2 never used - get warning.
>    #data{aList = [Key || Key <- KeyList]}. %% Only one LC in the record.
> 
> test_with_warning2(Data, KeyList) -> %% Data never used - get warning.
>    KeyList2 = filter(KeyList), %% KeyList2 never used - get warning.
>    {data,
>     [Key || Key <- KeyList], %% Not in a record. 
>     [Key || Key <- KeyList]}.
> 
> filter(L) -> L.
> --8<----------------------------------------------------------
> 
> In all three test functions the variables Data (in the function head) and KeyList2 (in the function body) are unused.
> Compiling the module should produce six warnings but I only get four.
> You get the same result with other "advanced" calls like
> 	lists:map(fun(Key) -> Key end, KeyList)
> so it's not limited to list comprehensions.
> If the fields are bound to e.g. the variable KeyList directly the warnings work just fine.
> 
> /Anders
> 
> _______________________________________________
> erlang-bugs mailing list
> erlang-bugs@REDACTED
> http://erlang.org/mailman/listinfo/erlang-bugs