[erlang-questions] data sharing is outside the semantics of Erlang, but it sure is useful

Thu Sep 17 03:02:53 CEST 2009

> And can't reasonably be *expected* to do.  It is reasonable
> to expect Erlang to *preserve* sharing, as when sending a term
> to another process, because failing to do so can make space use
> blow up in a rather sickening way which it's hard for a
> programmer to detect.
I wasn't suggesting Erlang create sharing where there is none, just  
that it preserves sharing unless requested not to.

> I sometimes think that for every use case there is an equal
> and opposite use case.  In the case of memory, for example,
> we've got *space* issues and *cache* issues.  Looking for
> existing copies of stuff can save you space, but it can
> do terrible things to your cache (bringing in stuff that it
> turns out you don't want).  The tradeoffs depend on how much
> space you may save, how likely the saving is, and how well you
> can avoid looking at irrelevant stuff while looking for an
> existing copy.  The programmer is in a better position to know
> these things than the Erlang compiler or runtime system.
I'm not suggesting that we do it for every, single piece of data.  We  
already sort of do it for atoms, most numbers are small enough that  
it's not a big win, so the only real question is for lists / tuples /  
binaries.

The win for lists and binaries is pretty huge.  Binaries are rarely  
small.  Lists get huge too.  I'm not suggesting a full blown hash-cons  
solution, but some way to prevent invisible expansion is pretty  
critical.

> One thing I didn't quite understand was why the original data
> source is emitting stuff with lots of duplication in the first
> place.  Fixing the duplication problem at the source has the
> added benefit of reducing the cost of getting the data into an
> Erlang process to start with.
I've run into this when working with a simple graph algorithm.   
Representing edges as {source,dest} was great for atoms and horrible  
for strings.  All of my tests used atoms, but at runtime, the strings  
were being duplicated (because I was messaging them around).  It was  
noticeable.

Another problem I had was with a backend for the Linux Network Block  
Device.  I was tossing around disk blocks (4k binaries) and had  
pathological memory usage really quickly.

Real development has real problems with unnecessary data duplication.   
This is not a matter of optimization.  Someone needs to finish one of  
the alternate heap implementations.  Really.

-- 
Jayson Vantuyl
kagato@REDACTED