heart does not restart node launched with run_erl

Serge serge@REDACTED
Sun Jan 29 17:03:08 CET 2006


We happened to resolve this issue by handling SIGCHLD in run_erl.  When 
run_erl is executing $HEART_COMMAND that includes erl with a -heart 
option: 'run_erl ... "erl ... -heart"', the following is observed:

1. run_erl starts erl
2. erl starts heart
3. heart monitors erl

If erl gets killed or exits, then

1. heart restarts HEART_COMMAND
2. new run_erl detects an active UDS (owned by old run_erl) and exits
3. heart gets terminated (since it restarted the HEART_COMMAND)
4. old run_erl gets terminated as well (I don't recall right now what 
triggers its termination)

At the end we end up with no Erlang running.  Attached is a patch to 
run_erl that addresses this issue by forcing run_erl to exit upon 
detecting the death of the node started by HEART_COMMAND.  Note that 
this patch also includes the patch provided by Ernie Makris / Jouni Rynö 
(news://news.gmane.org:119/025601c5cf6c$459cd1d0$4601a8c0@hercules) for 
RedHat ES 4.0 and Fedora.

I hope it can be included in the next release.

Regards,

Serge

erlang-questions@REDACTED wrote:
> Hi all,
>   Ran into a weird problem.  I have an embedded application that is started with run_erl from a .sh script.  I also use heart to restart the application. HEART_COMMAND is set to launch the same start.sh script that was used to start the application initially.  At the start, the process tree looks as follows:
> 
>  3196 ?        S      0:00 /home/drpdev/erts-5.4.10/bin/run_erl -daemon /home/drpdev/var/tmp/drp /home/drpdev/var/log/drp -exec /home/drpdev/bin/start_erl
>  3202 pts/2    Ssl+   0:02  _ /home/drpdev/erts-5.4.10/bin/beam -- -root /home/drpdev -progname drip -- -home /home/drpdev -boot /home/drpdev/releases/1.
>  3222 ?        Ss     0:00      _ heart -pid 3202
>  3227 ?        Ss     0:00      _ inet_gethost 4
>  3228 ?        S      0:00      |   _ inet_gethost 4
>  3229 ?        Ss     0:00      _ sh -s disksup
> 
> To test the restart, I kill pid 3202 and see the following:
> 
>  3222 ?        Ss     0:00 heart -pid 3202
>  3196 ?        S      0:00 /home/drpdev/erts-5.4.10/bin/run_erl -daemon /home/drpdev/var/tmp/drp /home/drpdev/var/log/drp -exec /home/drpdev/bin/start_erl
>  3202 ?        Zs     0:02  _ [beam] <defunct>
> 
> 
> Next, heart launches the script:
> 
>  3253 ?        S      0:00    /bin/bash /home/drpdev/bin/drip.sh start
>  3272 ?        S      0:00        _ sleep 3
>  3196 ?        S      0:00 /home/drpdev/erts-5.4.10/bin/run_erl -daemon /home/drpdev/var/tmp/drp /home/drpdev/var/log/drp -exec /home/drpdev/bin/start_erl
>  3202 ?        Zs     0:02  _ [beam] <defunct>
> 
> The sleep 3 is right before it calls the run_erl command to start the embedded application. Note that the old run_erl (pid 3196) is still hanging around although the node itself (pid 3202) is defunct.
> 
> When drip.sh calls run_erl, the old run_erl (pid 3196) goes away, but no new run_erl process appears.  Application is not started either. erlang.log.1 does not showI see the following in the run_erl.log:
> 
> -------
> Pty master read; run_erl [3196] Wed Jan  4 15:59:37 2006
> Pty master read; run_erl [3196] Wed Jan  4 16:00:46 2006
> Pty master read; run_erl [3196] Wed Jan  4 16:00:51 2006
> Pty master read; run_erl [3279] Wed Jan  4 16:00:54 2006
> /home/drpdev/erts-5.4.10/bin/run_erl: pid is : 3279
> run_erl [3196] Wed Jan  4 16:00:54 2006
> FIFO read; run_erl [3196] Wed Jan  4 16:00:54 2006
> OK
> run_erl [3196] Wed Jan  4 16:00:54 2006
> Pty master read; run_erl [3196] Wed Jan  4 16:00:54 2006
> Pty master read; run_erl [3196] Wed Jan  4 16:00:54 2006
> Pty master read; run_erl [3196] Wed Jan  4 16:00:54 2006
> Erlang closed the connection.
> -------
> 
> I am curious why new run_erl (pid 3279) process did not start. Also, why did the old run_erl (pid 3196) did not terminate until the new run_erl attempted to start?  I verified that this is not a coincidence - old run_erl will remain hanging in the process list until a new run_erl is started.
> 
> Please, let me know if anyone else experienced similar issue. If needed I can provide additional info/config files, but not sure at this point which ones.
> 
> Thank you.
> Dmitry Korsun
> IDT Corp.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: run_erl.patch
URL: <http://erlang.org/pipermail/erlang-patches/attachments/20060129/28cbe8ce/attachment.ksh>


More information about the erlang-patches mailing list