Let some other process fix the error (Long)

Thu Apr 24 20:17:31 CEST 2003

>>>>> "ja" == Joe Armstrong <joe@REDACTED> writes:

ja> You *can't do this with unix process like concurrency* - you can
ja> observe failure but not accurately diagnose the reason for
ja> failure.

Ja, it's sometimes nice to know if there was a particular type of
failure in some other process ... but an OTP supervisor process cares
very little about the type of failure one of its children
experienced.  

Knowing the cause of the failure is important from a logging point of
view.  However, AFAIK, the only reason why it cares is if the dead
child was configured to be permitted to die without restart.

Last week, I stumbled across the research of George Candea and Armando
Fox at Stanford.  They've been doing research into "crash-only
computing" and "recursively rebootable" software systems.  It takes an
Erlang OTP person just a few minutes to read their work before saying,
"Wow, they're implementing OTP-like supervisor behaviors for Java
systems."  Well, there are a few differences:

* Each recursively-restartable Java component must be running in its
  own JVM -- operating system process separation provides the smallest
  unit of failure.  The same principle could be applied to a
  distributed system of independent OS processes communicating via
  CORBA or other IPC mechanism.

* Restart behavior isn't configured at what an OTP person would call a
  particular supervisor process.  Instead, there's a single recovery
  manager component which maintains a tree of component dependencies.
  All restartable components are leaves of the tree.

  It uses an OTP-like "all-for-one" component restart strategy: if a
  component monitoring agent notifies the recovery manager that a
  component has failed, the recovery manager will attempt to restart
  all components that have the same dependency tree parent.  If that
  doesn't fix the problem, the manager restarts all components with
  the same dependency tree grandparent.  And so on, until the system
  is running without faults.

For more info, see:

	http://www.stanford.edu/~candea/research.html
	http://swig.stanford.edu/public/projects/RR

In a past life, I tried preaching this kind of restart logic for a
multi-OS-process, multi-OS-instance system.  My preaching fell upon
deaf ears, much to my chagrin.  It's nice to see that I wasn't a
(utterly, completely) crazy man raving in the wilderness.  :-)

-Scott