Author: Richard A. O'Keefe <ok(at)cs(dot)otago(dot)ac(dot)nz>
    Status: Draft
    Type: Standards Track
    Created: 09-Feb-2010
    Erlang-Version: OTP_R13B-3
    Post-History:
****
EEP 32: Module-local process names
----

Abstract
========

The process registry in Erlang is convenient, but counts as
a global shared mutable variable, with two major defects:
the possibility of data races (shared mutable variable) and
the impossibility of encapsulation (global).  This EEP
resurrects the old (1997 or earlier) proposal of module-
local process-valued variables, providing a replacement for
node-local uses of the registry with encapsulation and without
races.

Specification
=============

A module (or an instance of a parameterized module) may have
one or more top level pid-valued variables, and if so, has a
lock associated with them.  The directive has the form

    -pid_name(Atom).

where Atom is an atom.  To avoid confusing programmers who
still have to deal with the registry, this Atom may not be
'undefined'.

If there is at least one such directive in a module, the
compiler automatically generates a function called
`pid_name/1`.  In the scope of directives

    -pid_name(pn_1).
    ...
    -pid_name(pn_k).

the `pid_name/1` function is rather like

    pid_name(pn_1) ->
        with_module_lock(read) -> X = *pn_1 end, X;
    ...
    pid_name(pn_k) ->
        with_module_lock(read) -> X = *pn_k end, X.

except that we expect there to be a VM instruction
`get_pid_safely(Address)`, and we expect the compiler to
inline calls to pid_name(Atom) when Atom is known.
On a machine like the `X86` or `X86_64`, this could be a
single locked load instruction.

The value of a `-pid_name` is always a process id.

There is a special process id value which at all times represents
a dead process.  So within a module,

    pid_name(X) ! Message

is legal if and only if X is one of the pid-names declared in
the module, and whether or not the process it names has died.

If there is a need to discover whether a `-pid_name` has within
the recent but unpredictable past been associated with a live
process, that can be found out by combining `pid_name/1` with
`process_info/2`.

As with the registry, a process may have at most one `pid_name`.
For debugging purposes, I suppose that `process_info` could be
extended to return a `{pid_name,{Module,Name}}` tuple.

When a process exits, it is automatically unregistered.
That is, if it was bound to a `-pid_name`, that `-pid_name`
now refers to the conventional dead process.  This draft of
this EEP includes no other way for a process to be unregistered.

The important thing about registering a process is that it
should be atomic.  So there are two new functions

    pid_name_spawn(Name, Fun)
    pid_name_spawn_link(Name, Fun)

We can understand them as

    pid_name_spawn(Name, Fun)
      when is_atom(Name), is_function(Fun, 0) ->
        with_module_lock(write) ->
        P = *Name,
        if P is a live process ->
            P
         ; P is a dead process ->
            Q = spawn(Fun),
            *Name := Q,
            Q
        end
        end.

    pid_name_spawn_link(Name, Fun)
      when is_atom(Name), is_function(Fun, 0) ->
        with_module_lock(write) ->
        P = *Name,
        if P is a live process ->
            P
         ; P is a dead process ->
            Q = spawn(Fun),
            *Name := Q,
            Q
        end
        end.

Here, as earlier, `with_module_lock` is pseudo-code, meant to
suggest some sort of reader-writer locking on a private lock,
existing only inside a module that has declared a `-pid_name`.

These two functions are automatically declared inside the
module, like `pid_name/1`.  The three functions are not functions
automatically inherited from the `erlang:` module but functions
that are logically inside the module, however they might be
actually implemented.  There doesn't seem to be any good
reason for a module to export any of these functions, and the
compiler should at least warn if that is attempted.

Motivation
==========

* Encapsulation.

  The process registry is often used when clients of a module
  need to communicate with one or more servers managed by the
  module, but the interface code is inside the module.  There
  is no advantage, and much risk, in exposing the process.  A
  big reason for this process is to get the benefit of having
  mutable process variables without the loss of encapsulation.

* Efficiency.

  As a shared mutable data structure, the registry has to be
  accessed within the scope of suitable locks.  With this
  approach, each module has its own lock, contention ought
  to be pretty nearly zero, and the commonest use case of
  the registry can, I believe, be a simple load instruction.

* Safety.

  It is actually surprisingly hard to register a process
  safely, and the use of registered names is oddly inconsistent
  with the use of direct process ids.  This interface is meant
  to be simpler to use safely.

Rationale
=========

The old Erlang book describes four functions for dealing with
registered process names.  There are two more main interfaces.

    Name ! Message when is_atom(Name) ->
      % Also available as erlang:send(Name, Message).
      % A 'badarg' exception results if Pid is an atom that is
      % not the registered name of a live local process or port.
        whereis(Name) ! Message.

    register(Name, Pid) when is_atom(Name), is_pid(Pid) ->
      % A 'badarg' exception results if Pid is not a live local
      % process or port, if Name is not an atom or is already in
      % use, if Pid already has a registered name, or if Name is
      % 'undefined'.
        "whereis(Name) := Pid".

    unregister(Name) when is_atom(Name) ->
      % A 'badarg' exception results if Name is not an atom
      % currently in use as the registered name of some process
      % or port.  'undefined' is always an error.
        "whereis(Name) := undefined".

    whereis(Name) when is_atom(Name) ->
      % A 'badarg' exception results if Name is not a name.
      % in effect, a global mutable hash table with
      % atom keys and pid-or-'undefined' values.

    registered() ->
        % yes, I know this is not executable Erlang.
        [Name || is_atom(Name), is_pid(whereis(Name))].

    process_info(Pid, registered_name) when is_pid(Pid) ->
        % yes, I know this is not executable Erlang.
        case [Name || is_atom(Name), whereis(Name) =:= Pid]
          of [N] -> {registered_name,N}
           ; []  -> []
        end.

When a process terminates, for whatever reason, it does the
equivalent of

    case process_info(self(), registered_name)
      of {_,Name} -> unregister(Name)
       ; []       -> ok
    end.

This has an astonishing consequence.

Suppose I do

    Pid = spawn(Fun),
    ...
    Pid ! Message

and between the time the process was created and the time I send
the message to it, the process dies.  In Erlang this is
perfectly ok, and the message just disappears.

Now suppose I do

    register(Name, spawn(Fun)),
    ...
    Name ! Message

and between the time the process was created and the time I send
the message to it, the process dies.  Anyone would expect the
result to be exactly the same: because the `Name` pointed to a
process which has died, this amounts to sending a message to a
dead process, which is perfectly ok, and the message just
disappears.  Most confusingly, that is not what happens, and
instead you get a 'badarg' exception.

Now suppose I do

    send(Pid, Message) when is_pid(Pid) ->
        Pid ! Message;
    send(Name, Message) when is_atom(Name) ->
        case whereis(Name)
          of undefined -> ok
           ; Pid when is_pid(Pid) -> Pid ! Message
        end.
    ...
        register(Name, spawn(Fun)),
        ...
        send(Name, Message)

This works the way we would expect, but why is it necessary?

In Erlang as it stands, `Name ! Message` will raise an error if
`Name` would have referred to the right process but that process
has died.  It might be argued that this is a useful debugging
aid, but nothing helps us if `Name` now refers to the WRONG
process.  Right now, consider

    whereis(Name) ! Message

This will raise an exception if the named process had died
before whereis/1 was called, but consider this timing:

    live           dies
       whereis runs      message sent

A slight change in timing can unpredictably change the
behaviour from silence-on-late-death to error-on-early-death
and vice versa.

    pid_name(Name) ! Message

is *consistently* silent.

The current process registry is also used for ports, which act in
many ways like processes.

The old Erlang book is absolutely right that sometimes you
need a way to talk to a process you haven't been previously
introduced to.  However, it is not true that this must be
done by means of a global hash table.  You could always ask
a module for the information.

Let's take program 5.5 from the book.

    -module(number_analyser).
    -export([start/0,server/1]).
    -export([add_number/2,analyse/1]).

    start() ->
        register(number_analyser,
        spawn(number_analyser, server, [nil])).

    %% The interface functions.

    add_number(Seq, Dest) ->
        request({add_number,Seq,Dest}).

    analyse(Seq) ->
        request({analyse,Seq}).

    request(Req) ->
        number_analyser ! {self(), Req},
        receive
        {number_analyser,Reply} ->
                Reply
        end.

    %% The server.

    server(Analyser_Table) ->
        receive
            {From, {analyse, Seq}} ->
            Result = lookup(Seq, Analyser_Table),
            From ! {number_analyser, Result},
            server(Analyser_Table)
          ; {From, {add_number, Seq, Dest}} ->
            From ! {number_analyser, ack},
            server(insert(Seq, Dest, Analyser_Table))
        end.

The first thing we notice about this is that the registry is used
to allow a process that is a client of this module to communicate
with a process managed by this module through interface functions
in this module.  There is no reason why the process should be
given a GLOBALLY visible name, and every reason why it should NOT.
We would like to ensure that all communication with the server
process goes through the interface functions, and as long as the
process is in a global registry, anything could happen.  The
global process registry thus defeats its own purpose.

Similarly, because the reply messages to the interface functions
are tagged, not with the server's identity, but with its public
name, they are easy to forge.  Both of these problems also apply
to Program 5.6 in the old book.

But there is worse.  It is NEVER safe to call `register/2` or
`unregister/1`.  Recall that the precondition for `register/2`
requires that the `Name` not be in use.  But there is no way to
ever be sure of that.  For example, you might try

    spawn_if_necessary(Name, Fun) ->
        case whereis(Name)        % T1
          of undefined ->
         Pid = spawn(Fun),    % T2
         register(Name, Pid)    % T3
           ; Pid when is_pid(Pid) ->
             ok
        end,
        Pid.

Unfortunately, between time T1, when `whereis/1` reports that the
`Name` is not in use, and time T3, when we try to assign it, some
other process might have been registered.  Also, between time T2,
when the new process is created, and T3, when we use the `Pid`, the
process might have died.

Because the registry is global, it is no use searching existing
code to see whether the `Name` is clobbered; the bug might be
introduced in future code.

There appears to be no way to protect against the possibility of a
process dying between T2 and T3.  The obvious hack,

    Pid = spawn(Fun),
    erlang:suspend_process(Pid),
    register(Name, Pid),
    erlang:resume_process(Pid)

won't work because `erlang:suspend_process/1` is documented as
having the same 'badarg if Pid is not the pid of a live local
process' snafu as `register/2`.  The only really safe way around the
issue would be for the new process to be born suspended, and
there's no way to do that.  There is no 'suspended' option allowed
in the options list of `spawn_opt/[2-5]`.

In practice, of course, the new process WON'T die, typically
because it goes into a loop waiting for a message.  Even so, this
amount of fragility in a primitive is a bit worrying.

Let's take a quick check to see how real all this is.

`sounder.erl` has

    start() ->
        case whereis(sounder) of
            undefined ->
            case file:read_file_info('/dev/audio') of
                {ok, FI} when FI#file_info.access==read_write ->
                register(sounder, spawn(sounder,go,[])),
                ok;
                _Other ->
                register(sounder, spawn(sounder,nosound,[])),
                silent
            end;
            _Pid ->
            ok
        end.

Here's a curious thing:  the first time `sounder:start/0` is
called, it will return different values (ok, silent) depending
on whether sound (is, is not) supported.  Later calls always
return ok.  This contradicts the documentation.  Whoops!
Apart from that, it's a straightforward `spawn_if_necessary`.

`man.erl` has

    start() ->
        case whereis(man) of
            undefined ->
            register(man,Pid=spawn(man,init,[])),
            Pid;
            Pid ->
            Pid
        end.

This is precisely

    start() -> spawn_if_necessary(fun () -> man:init() end).

`tv_table_owner` has

    start() ->
        case whereis(?REGISTERED_NAME) of
            undefined ->
            ServerPid = spawn(?MODULE, init, []),
            case catch register(?REGISTERED_NAME, ServerPid) of
                true ->
                ok;
                {'EXIT', _Reason} ->
                exit(ServerPid, kill),
                timer:sleep(500),
                start()
            end;
            Pid when is_pid(Pid) ->
            ok
        end.

Let's repackage that to see what's going on:

    spawn_if_necessary(Name, Fun) ->
        case whereis(Name)
          of undefined ->
             Pid = spawn(Fun),
             case catch register(Name, Pid)
               of true ->
                  Pid
                ; {'EXIT', _} ->
                  exit(Pid, kill),
                  timer:sleep(500),
                  spawn_if_necessary(Name, Fun)
             end
           ; Pid when is_pid(Pid) ->
         ok
        end.

If there is a live local process registered under `Name`, return its
`Pid`.  Of course, after the function returns to believe that there
is STILL a live local process registered under Name, but that's
just as true of `whereis/1`.

If there is not, then create a new process, regardless of whether
that turns out to be useful.  Try to register it.  The `Pid` will be
the pid of a live local process that is not registered under any
other name, and `Name` must be an atom other than 'undefined', or
`whereis/1` would have crashed.  So it should be that the only thing
that can go wrong is that some other process has snuck in and
swiped the registry slot.  In that case, kill the process, wait a
long time, and try again.

In theory, it is possible for this to loop forever, with just the
right malevolent timing by an adversary.  In practice, I'm sure it
works very well.

The thing is, if the 'primitives' are this fragile, I would rather
not expose beginners to them.  Or for that matter, most people:
there are plenty of uses of `register/1` in the Erlang/OTP sources
that are not this well protected.

The simplest fix to the 'registration race' problem would be to
verify that `spawn_if_necessary/2` is sound, correct it if
necessary, and put it in a library.  However, that does nothing to
fix the globality of the registry.

There is no analogue of registered().  Inside a module, you can
see what names are available; outside the module, you have no
right to know.

This EEP does not propose abolishing the old registry.  There
is a lot of code, and a lot of training material, that still
uses or mentions it.  Above all, the old registry can do one
thing that this EEP cannot do and isn't meant to, and that is
to provide names that can be used in other nodes, in `{Node,Name}`
form.  The aim of this proposal is to provide something that can
replace MOST uses of the registry with something safer, and in
particular to allow gradual migration to per-module registration.

Backwards Compatibility
=======================

The only modules that are affected by the new feature are
those that visibly contain an explicit `-pid_name` directive.

Reference Implementation
========================

None.

Example
=======

Here is the old book's Program 5.5 again, brought up to date.

    -module(number_analyser).
    -export([
        add_number/2,
        analyse/1,
        start/0,
        stop/0
     ]).
    -pid_name(server).

    start() ->
        pid_name_spawn(server, fun () -> server(nil) end).

    stop() ->
        pid_name(server) ! stop.

    add_number(Seq, Dest) ->
        request({add_number,Seq,Dest}).

    analyse(Seq) ->
        request({analyse,Seq}).

    request(Request) ->
        P = pid_name(server),
        P ! {self(), Request},
        receive {P,Reply} -> Reply end.

    server(Analyser_Table) ->
        receive
            {From, {analyse, Seq}} ->
            From ! {self(), lookup(Seq, Analyser_Table)},
            server(Analyser_Table)
          ; {From, {add_number, Seq, Dest}} ->
            From ! {self(), ok},
            server(insert(Seq, Dest, Analyser_Table))
        end.

* It is now possible to use a programming convention where the
  `-pid_name` of every server is 'server'.

* It is no longer possible for code outside the module to send
  messages to the server process.

* It is no longer possible (well, no longer embarrassingly easy)
  for an outsider to forge responses from the server.

Copyright
=========

This document has been placed in the public domain.

[EmacsVar]: <> "Local Variables:"
[EmacsVar]: <> "mode: indented-text"
[EmacsVar]: <> "indent-tabs-mode: nil"
[EmacsVar]: <> "sentence-end-double-space: t"
[EmacsVar]: <> "fill-column: 70"
[EmacsVar]: <> "coding: utf-8"
[EmacsVar]: <> "End:"