Error Handling

Definitions
Exit signals are sent when processes crash
Exit Signals propagate through Links
Processes can trap exit signals
Complex Exit signal Propagation
Robust Systems can be made by Layering
Primitives For Exit Signal Handling
A Robust Server
Allocator with Error Recovery
Allocator Utilities

Definitions

Link A bi-directional propagation path for exit signals.
Exit Signal - Transmit process termination information.
Error trapping - The ability of a process to process exit signals as if they were messages.

Exit Signals are Sent when Processes Crash

When a process crashes (e.g. failure of a BIF or a pattern match) Exit Signals are sent to all processes to which the failing process is currently linked.

Dies and sends signal to linked processes

Exit Signals propagate through Links

Suppose we have a number of processes which are linked together, as in the following diagram. Process A is linked to B, B is linked to C (The links are shown by the arrows).

Now suppose process A fails - exit signals start to propogate through the links:

Exit signals propagating, A to B to C...

These exit signals eventuall reach all the processes which are linked together.

The rule for propagating errors is: If the process which receives an exit signal, caused by an error, is not trapping exits then the process dies and sends exit signals to all its linked processes.

Processes can trap exit signals

In the following diagram P1 is linked to P2 and P2 is linked to P3. An error occurs in P1 - the error propagates to P2. P2 traps the error and the error is not propagated to P3.

Process traps exit and does not propagate exit

P2 has the following code:

receive
    {'EXIT', P1, Why} ->
	... exit signals ...
    {P3, Msg} ->
	... normal messages ...
end

Complex Exit signal Propagation

Suppose we have the following set of processes and links:

Bidirectional links in chain of processes

The process marked with a double ring is an error trapping process.

Process that traps exit stops propagation

If an error occurs in any of A, B, or C then All of these process will die (through propagation of errors). Process D will be unaffected.

Exit Signal Propagation Semantics

When a process terminates it sends an exit signal, either normal or non-normal, to the processes in its link set.
A process which is not trapping exit signals (a normal process) dies if it receives a non-normal exit signal. When it dies it sends a non-normal exit signal to the processes in its link set.
A process which is trapping exit signals converts all incoming exit signals to conventional messages which it can receive in a receive statement.
Errors in BIFs or pattern matching errors send automatic exit signals to the link set of the process where the error occured.

Robust Systems can be made by Layering

By building a system in layers we can make a robust system. Level1 traps and corrects errors occuring in Level2. Level2 traps and corrects errors ocuring in the application level.

In a well designed system we can arrange that application programers will not have to write any error handling code since all error handling is isolated to deper levels in the system.

Hierarhical layered trapping - supervision

Primitives For Exit Signal Handling

link(Pid) - Set a bi-directional link between the current process and the process Pid
process_flag(trap_exit, true) - Set the current process to convert exit signals to exit messages, these messages can then be received in a normal receive statement.
exit(Reason) - Terminates the process and generates an exit signal where the process termination information is Reason.

What really happens is as follows: Each process has an associated mailbox - Pid ! Msg sends the message Msg to the mailbox associated with the process Pid.

The receive .. end construct attempts to remove messages from the mailbox of the current process. Exit signals which arrive at a process either cause the process to crash (if the process is not trapping exit signals) or are treated as normal messages and placed in the process mailbox (if the process is trapping exit signals). Exit signals are sent implicitly (as a result of evaluating a BIF with incorrect arguments) or explicitly (using exit(Pid, Reason), or exit(Reason) ).

If Reason is the atom normal - the receiving process ignores the signal (if it is not trapping exits). When a process terminates without an error it sends normal exit signals to all linked processes. Don't say you didn't ask!

A Robust Server

The following server assumes that a client process will send an alloc message to allocate a resource and then send a release message to deallocate the resource.

This is unreliable - What happens if the client crashes before it sends the release message?

top(Free, Allocated) ->
    receive
	{Pid, alloc} ->
	    top_alloc(Free, Allocated, Pid);
	{Pid ,{release, Resource}} ->
	    Allocated1 = delete({Resource,Pid}, Allocated),
    	    top([Resource|Free], Allocated1)
    end.

top_alloc([], Allocated, Pid) ->
    Pid ! no,
    top([], Allocated);

top_alloc([Resource|Free], Allocated, Pid) ->
    Pid ! {yes, Resource},
    top(Free, [{Resource,Pid}|Allocated]).

This is the top loop of an allocator with no error recovery. Free is a list of unreserved resources. Allocated is a list of pairs {Resource, Pid} - showing which resource has been allocated to which process.

Allocator with Error Recovery

The following is a reliable server. If a client craches after it has allocated a resource and before it has released the resource, then the server will automatically release the resource.

The server is linked to the client during the time interval when the resource is allocted. If an exit message comes from the client during this time the resource is released.

top_recover_alloc([], Allocated, Pid) ->
    Pid ! no,
    top_recover([], Allocated);

top_recover_alloc([Resource|Free], Allocated, Pid) ->
    %% No need to unlink.
    Pid ! {yes, Resource},
    link(Pid),
    top_recover(Free, [{Resource,Pid}|Allocated]).

top_recover(Free, Allocated) ->
    receive
	{Pid , alloc} ->
	    top_recover_alloc(Free, Allocated, Pid);
	{Pid, {release, Resource}} ->
	    unlink(Pid),
 	    Allocated1 = delete({Resource, Pid}, Allocated),
	    top_recover([Resource|Free], Allocated1);
	{'EXIT', Pid, Reason} ->
	    %% No need to unlink.
	    Resource = lookup(Pid, Allocated),
	    Allocated1 = delete({Resource, Pid}, Allocated),
	    top_recover([Resource|Free], Allocated1)
    end.

Not done -- multiple allocation to same process. i.e. before doing the unlink(Pid) we should check to see that the process has not allocated more than one device.

Allocator Utilities

delete(H, [H|T]) ->
    T;
delete(X, [H|T]) ->
    [H|delete(X, T)].

lookup(Pid, [{Resource,Pid}|_]) ->
    Resource;
lookup(Pid, [_|Allocated]) ->
    lookup(Pid, Allocated).