error: Removing (timedout) connection

Mon Jan 10 20:20:18 CET 2011

Hi guys/gals,

Recently I've been converting my non-distributed Erlang app into a
distributed one and I ran into some troubles.  If you want to skip straight
to the question it's at the end, but I try to give some insight into what
I'm doing below.

First off, I attached a PDF (sorry, PDF was not my choice) which contains a
diagram I drew of the current setup.  I apologize for my utter failure as an
artist.  In this diagram you'll see 3 vertical partitions representing 3
different machines and a horizontal one representing the fact that each
machine has 2 Erland nodes on it.  3 of the Erlang nodes form a riak
cluster.  The other 3 are the application (or should I say release) I wrote,
and to distribute my app I utilized riak's underlying technology, riak_core
(I use it as an easy way to persist cluster membership and use the ring
metadata to store some data).  These six nodes are fully connected, i.e.
each node has connection to the other.

Occasionally, I've noticed the following message on any one of the six
nodes:

=ERROR REPORT==== ...
** Node <node> not responding **
** Removing (timedout) connection **

Furthermore, using net_kernel:monitor_nodes(true, [nodedown_reason]) I've
noticed messages like the following:

{nodedown, <node>, [{nodedown_reason, connection_closed}]}

You'll notice there is a system process running on machine A, and it makes a
gen_server:cast to three processes to do some work, and these processes each
call link (L).  Each of these three (gen_server) processes makes a call (at
roughly the same time) to the riak cluster performing the _same exact_
map/reduce job.  Sometimes I'll see errors where this map/reduce job times
out on one of the nodes.  So at lunch, I wondered, is it because there is
just too much communication going on between the nodes that the kernel ticks
are getting lost or delayed?  I wondered if each node was using the same TCP
connection to talk to every other node.  That could explain my symptoms,
right?  A few netcats later and I realized that it's a dedicated conn for
each node, so that theory was blown.  However, I still think that many msgs
being passed back and forth could be the cause of the problem, and I
wondered if it blocks the VM in some way so that the kernel tick can't get
through?

Q: Can a chatty cluster cause the kernel ticks to be lost/delayed thus
causing nodes to disconnect from each other?

Thanks,

-Ryan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20110110/f2d31176/attachment.htm>