[erlang-questions] System limit bringing down rex and the VM

Thu Sep 9 14:15:15 CEST 2010

I understand your points completely... however, there is certainly a difference from having an erlang process die and allowing it's peers to handle the cleanup and having the erlang vm die. The vm has no peer in that way. Yes Mnesia needs RPC... but so do a lot of things and if the pattern is to be followed that you die and allow the peers to respond... that's not what is happening here. Rex dies and brings the world down with it. Mnesia is unable to respond to the issue. The mnesia code shows that it is prepared for {badrpc,_} errors. Rex failed. OK. Let me (or mnesia) decide what to do next. Why is this such a controversial idea? This isn't about mnesia though. It's about rex. Anything could have triggered the crash. I fail to see why rex is the arbiter of the vm's fate when the process limit is reached. People keep implying it was designed this way but the behavior is sporadic and inconsistent with similar issues. If such a behavior was designed it should live in the VM (not that I support this behavior) and not a process that in practice is totally recoverable from when it fails (ie {badrpc,nodedown}).

-----Original Message-----
From: erlang-questions@REDACTED [mailto:erlang-questions@REDACTED] On Behalf Of Ulf Wiger
Sent: Thursday, September 09, 2010 5:15 AM
To: erlang-questions@REDACTED
Subject: Re: [erlang-questions] System limit bringing down rex and the VM

On 09/09/2010 12:34 AM, bile@REDACTED wrote:
> 
> How many other limits cause the platform to shit the bed? I suspect 
> few.

Actually, hitting the system limits themselves will not cause the VM to crash. OOM is the only exception I can think of right now.

Trying to spawn a process when the max number of processes has been reached will simply raise an exception. However, some (most) application code will not include to cope with the situation that you can't spawn a process, so most likely, applications will come crashing down when this happens. Which one goes first is mainly up to chance.

Erlang is a concurrency-oriented language. The spawn() function is about as fundamental as new() in OO languages. In most other languages, you treat the spawning of processes as something scary that you don't want to do unless you absolutely have to.
Also, the limits are usually pretty low. In Erlang, you can raise the limit to > 200 million processes, if you have enough memory for it. The default limit is fairly low (32,000) mainly for historical reasons, but also because it makes sense to keep memory footprint low by default, and 32K processes is plenty enough for most uses.

> Those who defend this behavior are not consistent. The behavior of the 
> core processes are not consistent. Just look at the code.

It would be better if you mention specific instances rather than asking people to "just look at the code". OTOH, we can probably stipulate that the code base is inconsistent in many ways....

In general, application code does not cope with system limits being exhausted. This is in line with Erlang's "let it crash"
philosophy as well as the fact that if you want true robustness in the kind of products Erlang was designed for, you have to have a redundant setup anyway.

The thing about redundancy is that it works best if the failing side fails quickly and distinctly, rather than trying in vain to correct the problem locally. This is the essence of "fail-fast"
programming. Although this is not 100% consistently implemented in Erlang/OTP either, you might want to keep in mind that Erlang has been breaking new ground in this respect, and most of the people who've worked on OTP components over the years were originally steeped in the same programming mindset as everyone else. ;-)

>>> I shouldn't have to build my own spawn wrapper to keep track of the 
>>> number of processes. The VM already does this. Besides, this problem 
>>> couldn't be fully addressed that way.
>>
>> You don't have to. I suspect you need to do some sort of load 
>> regulation in your system.
> 
> Load regulation? My system is designed to support arbitrary process 
> creation. I was maxing out the processes as a scale test. If for some 
> reason it can't spawn new processes then I want control over what 
> happens next. Rex and the supervisor's behavior takes that control 
> from me. At best I can poll the process count and warn that the system 
> will soon fail but am powerless to do anything about it.

With any programming language or operating environment, you have the responsibility as a developer to understand and respect the fundamental assumptions made when developing the environment.
If your requirements don't match well enough, it might be better to find another language/environment that fits your problem better.

You seem to be saying that you shouldn't have to worry about the user of your application throwing more work at you than the system is capable of handling? This may be a valid requirement in some domains, but Erlang is fundamentally a language for developing messaging systems, which have to cope with overload situations (including Denial-of-Service attacks) in a structured way. When subjected to a DoS attack, you typically don't just want to accept the challenge and likely die honorably as a result. The normal way to do that is to push back, or shed load so that it doesn't overwhelm the core components in your system.

You might compare this with the recurring discussions about active vs passive sockets. Sockets in POSIX are by design passive, as you have to explicitly read data from the buffer, but in Erlang, the default is that the VM empties the buffer and delivers the data to the socket owner asynchronously. I think this was the wrong default, and the recommendation is to use passive, or {active,once} to avoid being swamped by input from the network. This is in line with the idea of not accepting more work than you can cope with.

> The failure of
> a non-essential component of the system should not cause the VM to 
> fail just like a bad process in an OS should cause it to halt. Could 
> one of you please explain to me how that analogy is incorrect?

RPC is not a non-essential component, especially not to mnesia.
Mnesia assumes that rpc will work, unless something really bad has happened. The recommended behaviour when something really bad happens in Erlang is to die and let the rest of the system take over.

BR,
Ulf W

________________________________________________________________
erlang-questions (at) erlang.org mailing list.
See http://www.erlang.org/faq.html
To unsubscribe; mailto:erlang-questions-unsubscribe@REDACTED