9 Troubleshooting, Problems and 'Gotchas'

This section looks at problems which trip up many beginners, starting with some minor syntactic irritations. If you bump into something confusing which isn't listed here, try posting a question to the mailing lists.

9.1 Why can't I write a = 3 (badmatch)?

In Erlang, all variables must start with a capital letter, so you can write A = 3. but not a = 3. In the latter case Erlang complains that this is a "bad match".

In the second version, the 'a' is an atom while 3 is a number. 'a' cannot be pattern matched with 3. The use of capital letters for variables comes from Prolog.

If the terms atom and pattern match are new to you, have a read of the first chapter or two of the Erlang book. (Very roughly, an atom corresponds to an enum in C and pattern matching in this case is like an assert statement.

9.2 Why can't I change the value of a variable?

Erlang only lets you assign a variable once in a function. Some simple examples of when you bump into this are:

		1> A = 3.
		3
		2> A = 4.
		** exited: {{badmatch,4},[{erl_eval,expr,3}]} **
		3> A = A + 4.
		** exited: {{badmatch,7},[{erl_eval,expr,3}]} **

The usual "workaround" is to use a new variable:

		4> B = A + 4.
		7

(This behaviour is called "single assignment", and it's considered to be a feature, not a bug. Most functional languages, including ML and Haskell, also behave this way. Among other things, It's nice being able to rely on a certain variable always having the same value.)

9.3 Why do lists of numbers get printed incorrectly?

Sometimes the shell (or a program) prints a list of number in an unexpected way, for instance:

	1> [65, 66, 67].
	"ABC"

This happens because Erlang represents strings as lists of integers, so if you ask the shell to print a list of integers, the shell takes a guess as to whether you want to see it as a list of numbers or as a string. The shell bases its guess on checking whether or not the list contains all printable characters, so you can force a string to be unprintable:

	5> [0, 65, 66, 67].
	[0,65,66,67]

In Erlang/OTP R16B and above, you can use the function shell:strings/1 to turn off this behaviour in the shell:

	2> shell:strings(false).
	true
	3> [65, 66, 67].
	[65,66,67]

A similar problem occurs with io:fwrite(), but in that case you can take direct control by specifying the appropriate formatting character. "~s" always prints the argument as a string, "~w" always prints it as a list.

9.4 Why can't I call arbitrary functions in a guard?

If that was allowed, there would be no guarantee that guards were side-effect free.

Also, it is convenient to be able to program as though guards do not consume any significant amount of execution time. There's a list of BIFs which can be called from within guards in the Erlang book and the standard Erlang spec, some examples are size(), length(), integer(), record().

The "problem" often crops up when using if:

	issue_warning() ->
	  if (os:type() == {win32, windows}) ->    %% illegal guard
	    ok = io:fwrite("you are using windows\n");
	  true ->
	    ok = io:fwrite("no problem\n")
	  end.

The solution is usually to use case instead. Case is used much more frequently than if in most Erlang programs:

	issue_warning() ->
          case os:type() of
            {win32, windows} -> ok = io:fwrite("you are using windows\n");
            _ -> ok = io:fwrite("no problem\n")
	  end.

9.5 Why can't I use "or" in a guard?

You can. Since R6A, several guards separated with semicolons perform a logical 'or', for example:

	f(N) when (N - 1) > 3; atom(N) -> yes;
	f(N) -> no.

will return yes for f(5) as well as f(blork). Similarly, the comma is used for logical and:

	g(N) when integer(N), N > 5 -> yes;
	g(N) -> no.

Beware! The semicolon operator doesn't mean exactly the same thing as 'or', for instance f(hello) returns yes , whereas evaluating f(hello - 1) > 3 or atom(hello) returns a badarith error. The first example, f(), is defined as having the same effect as writing:

	f(N) when (N - 1) > 3 -> yes;
	f(N) when atom(N) -> yes;
	f(N) -> no.

9.6 Why does 'catch' give me a syntax error?

It seems natural enough to write

	2> A = catch 1/0.
	** 2: syntax error before: 'catch' **

But the parsing rules for Erlang are a bit unexpected here, catch binds less tightly than you might expect it to. To make it parse the way you really want:

	3> A = (catch 1/0).
	{'EXIT',{badarith,[{erl_eval,eval_op,3},
                   {erl_eval,expr,3},
                   {erl_eval,exprs,4},
                   {shell,eval_loop,2}]}}

9.7 What is a sticky directory?

This typically crops up when playing with examples from the old (last century) Erlang book, for instance:

        1> c(sets.erl).
        {error,sticky_directory}

The problem is that there already is a standard library module called 'sets'. The Erlang runtime system is protecting you.

The easiest solution is to rename your module, e.g. to mysets.erl. It is also possible to 'un-stick' the directory containing the library module.

9.8 Why won't my distributed Erlang nodes communicate?

For Erlang nodes to be able to communicate, you need

A working tcp network between the nodes. On unix systems you can check this by using telnet, though a working telnet doesn't guarantee that enough of your network is working, e.g. DNS problems throw a spanner in Erlang's distribution mechanisms.
The nodes to use the same node naming scheme (you cannot have a system where some nodes use fully qualified names and others use short names).
The nodes must agree to use the same "magic security cookie".

Here's an example of how to create two nodes on different machines called martell and grolsch and verify that they're connected. On one machine:

	~ >rlogin martell
	Last login: Sat Feb 5 20:40:52 from super
	~ >erl -sname first_node
	Eshell V4.9.1.1  (abort with ^G)
	(first_node@martell)1> erlang:set_cookie(first_node, nocookie).
	true

And on the other

	~ >rlogin grolsch
	Last login: Thu Feb 3 10:54:20 from :0
	~ >erl -sname second_node
	Eshell V4.9.1.1  (abort with ^G)
	(second_node@grolsch)1> erlang:set_cookie(second_node, nocookie).
	true
	(second_node@grolsch)2> net:ping(first_node@martell).
	pong
	(second_node@grolsch)3> rpc:call(first_node@martell, os, type, []).
	{unix,sunos}

The pong tells us that the connection works, the result of net:ping() is pang when the connection isn't working. The rpc:call() command illustrates executing a command on the other node.

Warning

Cookies are a password for Erlang nodes. The example above sets the password to "nocookie", which removes almost all security. Anyone on your network can use your Erlang node for anything, including deleting all of your files.

9.9 When distribution won't work

One simple reason why distribution won't work is if you're combining different versions of Erlang. The OTP group's aim is to maintain backwards compatibility with the two previous major versions of Erlang/OTP, i.e. R13B works with all R13B minor releases as well as with R12B and R11B.

Beyond that, you need to start digging. Erlang nodes communicate by connecting to the epmd daemon. The daemon is started automatically the first time you start a distributed node. For debugging, it's useful to kill it by running epmd -kill ) and then restart it with the debugging flag. You can find epmd in the same directory as the erlang binary:

	~ >/otp/releases/otp_beam_sunos5_r6b/erts-4.9.1/bin/epmd -d -d
	epmd: Sat Feb 5 21:04:39 2000: epmd running - daemon = 0
	epmd: Sat Feb 5 21:04:39 2000: try to initiate listening port 4369
	epmd: Sat Feb 5 21:04:39 2000: starting
	epmd: Sat Feb 5 21:04:39 2000: entering the main select() loop

When you start a distributed node on the same system, you should see a message in the epmd window:

	epmd: Sat Feb 5 21:07:33 2000: registering 'first_node:1', port 53566

Similarly, when you use net:ping from a node on another system, you should see some messages go by. If nothing seems to be happening on epmd, it's useful to scan for packets on the ethernet to see if the information is really getting out on the net. On unix systems you can do this with tcpdump:

	# /usr/local/sbin/tcpdump port 4369
	tcpdump: listening on le0
	21:10:17.349286 grolsch.37558 > martell.4369: S 747683789:747683789(0)

9.10 Avoiding DNS

One fairly common obstacle to getting distribution to work is a broken DNS setup. One way to avoid that, or test for it, is to start distributed Erlang with IP addresses instead of hostnames, e.g. on one machine or window:

        ~ >erl -name first_node@127.0.0.1
        Erlang (BEAM) emulator version 5.5.5 [source] [64-bit]
        Eshell V5.5.5  (abort with ^G)
	(first_node@127.0.0.1)1> erlang:set_cookie('second_node@127.0.0.1', nocookie).
	true
	(first_node@127.0.0.1)2> net:ping('second_node@127.0.0.1').
	pong

and in the other:

        ~ >erl -name second_node@127.0.0.1
        Erlang (BEAM) emulator version 5.5.5 [source] [64-bit] [async-threads:0]
        Eshell V5.5.5  (abort with ^G)
        (second_node@127.0.0.1)1> erlang:set_cookie('first_node@127.0.0.1', nocookie).

9.11 Why does my application die every second time I load new code into it?

Erlang's code replacement system is based around there being (up to) two copies of the code loaded at any time, these are called "old" and "new". When you load new code, the current version becomes "old" and the "old" code is thrown away. Any processes still running "old" code are killed.

You can check if there is any old code for a particular module still running:

	Eshell V4.9.1  (abort with ^G)
	1> l(erl).
	{module,erl}
	2> code:soft_purge(erl).
	false

In this case old code is still running. You can also check if a particular process is running old code, by using erlang:check_process_code(Pid, Module).

9.12 Why can't I open devices (e.g. a serial port) like normal files?

Short answer: because the erlang runtime system was not designed to do that. The Erlang runtime system's internal file access system, efile, must avoid blocking, otherwise the whole Erlang system will block. This is not a good thing in a soft real-time system. When accessing regular files, it's generally a reasonable assumption that operations will not block. Devices, on the other hand, are quite likely to block. Some devices, such as serial ports, may block indefinitely. There are several possible ways to solve this. The Erlang runtime system could be altered, or an external port program could be used to access the device. Two mailing list discussions about the topic can be found here and here.

9.13 Why doesn't the stack backtrace show the right functions for this code:

	-module(erl).
	-export([a/0]).

	a() -> b().
	b() -> c().
	c() -> 3 = 4.           %% will cause badmatch

The stack backtrace only shows function c(), rather than a(), b() and c(). This is because of last-call-optimisation; the compiler knows it does not need to generate a stack frame for a() or b() because the last thing it did was call another function, hence the stack frame does not appear in the stack backtrace.

9.14 What do I do if I get an error when compiling Open Source Erlang?

Consider using a binary release (Debian GNU/Linux includes them in the woody and potato releases, windows binaries are on the download page ).

Ask for help on the mailing list, providing as many details as you can, including

your system details (OS, OS version, compiler version)
the error messages you get
any errors you got when you ran configure

9.15 Why do errors in the shell kill other processes?

Because the other processes are linked to the shell process.

Each time an error occurs in the erlang shell, the shell process exits and a new one is started:

	1> self().
	<0.23.0>
	2> x - y.
	** exited: {badarith,[{erl_eval,eval_op,3},
                      {erl_eval,exprs,4},
                      {shell,eval_loop,2}]} **
	3> self().
	<0.40.0>

Thus all processes linked to <0.23.0> will receive exit signals. Unless they trap exits, they will exit too. One way to avoid this is avoid linking "background" processes to shell processes, for instance by using spawn instead of spawn_link.

9.16 Why do I get incorrect answers for floating-point operations?

Some floating point operations produce results which surprise some people. Here's an example:

  5> 1.001 * 1000.
  1000.9999999999999

This is not an error, it is a property of floating point arithmetic and it is not specific to Erlang. Any language which uses floating point arithmetic behaves this way.

http://floating-point-gui.de/ is a relaxed introduction to the topic. What Every Computer Scientist Should Know About Floating-Point Arithmetic goes into more depth.

Erlang's output formatting lets you specify the precision, for instance:

  1> io:fwrite("~.3f\n", [1.001 * 1000]).
  1001.000