Author:
Maria Scott <maria-12648430(at)hnc-agency(dot)org> , Jan Uhlig <juhlig(at)hnc-agency(dot)org>
Status:
Final/24.0 Implemented in OTP release 24.0
Type:
Standards Track
Created:
04-Mar-2021
Erlang-Version:
24.0
Post-History:
08-Mar-2021, 17-Mar-2021, 23-Mar-2021, 31-Mar-2021, https://github.com/erlang/otp/pull/4521

EEP 56: Automatic supervisor shutdown triggered by termination of significant children #

Abstract #

This EEP introduces a way of automatically terminating supervisors based on the termination of specifically marked significant children.

This document is based on the discussion in OTP-PR 4521.

Motivation #

Children under a supervisor often represent a work unit, that means, a group of cooperating processes, as opposed to just a single process. Such work unit supervisors (called group supervisors in the context of this document) are themselves typically hosted by a simple_one_for_one supervisor, via which they are started as needed.

At the time of this writing, however, there is no good, canonical way of stopping such group supervisors once the work unit they represent has finished it’s work and the respective child processes have terminated, meaning the group supervisors will hang around, idle forever unless stopped manually one way or another.

This has been addressed in applications in a variety of ways, none of which can be called truly good, straightforward, or canonical:

  • Passing the Pid of the group supervisor to a child that is responsible for shutting down that supervisor. The shutdown is then achieved by sending an exit signal to the group supervisor. While the appropriate exit signals are documented, it is not for this purpose, and flinging around exit signals can be dangerous.
  • Passing the Pid of the group supervisor and the Pid of the supervisor on top of it to a child that is responsible for shutting down that supervisor. The shutdown is then achieved by telling the supervisor on top to terminate the group supervisor via supervisor:terminate_child/2. As this is a blocking call, a process has to be spawned for it. Also, this will cause the top supervisor to be blocked until the group supervisor has been shut down, so it will not accept other requests until then.

Both of the above approaches suffer from the fact that the children responsible for the shutdown have to know things about their surroundings, namely…:

  • … that it is located under a (group) supervisor in the first place. In the second of the above approaches, even that this group supervisor is again located under yet another supervisor. And, conversely, it is not possible to run this process independently (e.g. for testing), that is, without the supervisor layers on top.
  • … that it may have siblings, and that it is safe to cause their shutdown by shutting down the supervisor. From a programmer’s perspective, it is not obvious just by looking at the supervisor implementation that a certain child may cause a shutdown of the supervisor and all children, so there is a potential for surprises.

This may be tackled by having a dedicated overseer child that watches the other children and acts according to their behavior. However, this requires considerable boilerplate code for tasks that would be better suited in the supervisor. Also, there is the problem that the overseer process must keep the list of children it watches up to date should any of them be restarted, either by enabling the children to register with it on start (for which they in turn must know the overseer process’ pid), or asking the supervisor for it.

Another approach that is often used is to make the children responsible for the shutdown of the group supervisor permanent and the supervisor’s restart intensity to 0. This has the downside that the child will not be restarted but cause the supervisor to shut down if it exits abnormally but could be restarted. Another downside to this approach is that it produces error messages (crash reports), even if the shutdown is intended.

Last but not least, some people have taken the approach to clone the OTP supervisor and customize it to their needs, for reasons outlined here and others.

Rationale #

This EEP provides a means to alleviate the problems outlined in the motivation by introducing a way to mark specific children as significant via a new child spec flag, and a way to configure supervisors to shut down automatically depending on the exit of significant children via a new supervisor flag.

In order to keep backwards compatibility, the new flags will only be usable in the map forms of child specs and supervisor flags, and for the same reason the default values for the new flags are chosen such that, in their absence, the supervisor behaves the same as it does to date.

The new child spec flag is named significant with possible values true and false, with false being the default.

The new supervisor flag is named auto_shutdown with possible values never, any_significant and all_significant, with never being the default.

With the supervisor auto_shutdown flag set to never, the child spec flag significant is not allowed to be true. The never value and the restriction on the significant value is intended as a safety means to defend against unintended automatic shutdowns, for example by the exit of a significant child which was added later via supervisor:start_child/2. As the spec for such a child would not be present in the supervisor:init/1 callback code but somewhere else, debugging such unexplained supervisor shutdowns might be difficult.

Otherwise, the following rules apply when a significant child exits on its own:

  • A transient child will be restarted (not cause a supervisor shutdown) if it exits abnormally. If it exits normally…

    • if the supervisor auto_shutdown flag is any_significant, the supervisor will shut down
    • if the supervisor auto_shutdown flag is all_significant, the supervisor will shut down if the child was the last active significant child
  • A temporary child will never be restarted. If it exits normally or abnormally, the same rules as for transient children apply, in regard to the supervisor auto_shutdown flag.

If the restart type is permanent, the significant flag is not allowed to be true, as this combination does not make sense.

To be clear, the above rules only apply when significant children exit by themselves, that is, not when being terminated manually via supervisor:terminate_child/2, not when other non-significant children exit, and not when being terminated as a consequence of a sibling’s death in the one_for_all or rest_for_one strategies.

The approach proposed here could also be used to the effect of “shutdown when empty” by marking all children as significant and setting the supervisor auto_shutdown flag to all_significant.

It is worth mentioning that the simple_one_for_one strategy poses a special case, as it can have only a single child spec that applies to all children. That means that either all children are significant ones, or none is.

Considerations #

Using temporary significant children in one_for_all and rest_for_one supervisors may lead to an edge case scenario in which an intended automatic shutdown will not happen. Temporary children will not be restarted, not even when their termination was caused by a sibling’s death. On the other hand, the automic shutdown of a supervisor is not triggered when a significant child is terminated as a consequence of a sibling’s death. Thus, a temporary significant child intended to automatically shut down it’s supervisor will be lost if it is terminated as a consequence of a sibling’s death.

Backwards Compatibility #

The changes proposed in this document introduce no incompatible changes, as the new child spec and supervisor flags are optional and default to values that result in the current behavior. Also, all the current workarounds outlined in the Motivation will still work.

Although the proposed changes are backwards compatible, applications using this enhancement may not be compatible when compiled with previous OTP versions unless proper care is taken. Such an application compiled with older OTP versions will leak processes, as the automatic supervisor shutdowns it relies on to remove unused parts of it’s supervision tree will not happen. Taking care of this issue is at the discretion of implementors if they expect an application which uses the significant child behavior to be compiled with an OTP version that predates it’s appearance.

Implementation #

A reference implementation which will be updated to reflect the state of this document can be found in OTP-PR 4638.

Copyright #

This document has been placed in the public domain.