[erlang-questions] Managing a huge binary file

Dmitry Kolesnikov dmkolesnikov@REDACTED
Tue Apr 30 14:16:20 CEST 2013


Hello,

Performance of you solution is bound on disk I/O. 
Just to validate here, are you able to consume data with desirable speed from the storage using other run-time e.g. C? 

Open the file in raw mode with read ahead flag should be sufficient for many use-cases (e.g. I am using that method for late CSV intake and the speed is acceptable). What is an average line per second you are  getting in you application?

Best Reagrads, Dmitry

On Apr 30, 2013, at 3:06 PM, Salvador Tamarit <stamarit@REDACTED> wrote:

> 2013/4/30 Jesper Louis Andersen <jesper.louis.andersen@REDACTED>:
>> 
>> On Apr 30, 2013, at 3:23 AM, Salvador Tamarit <stamarit@REDACTED> wrote:
>> 
>>> Suppose that we have a huge binary file (8GB or so) containing all the
>>> integers from 0 to a given (very big) number. The numbers are unsorted
>>> and some of them (70 numbers for instance) has been removed.
>>> 
>>> The goal is to find the the Nth removed number (in order). For
>>> instance, if the file contains the numbers from 0 to 99, and then
>>> number 101, the 1st removed would be 100.
>> 
>> Quick shot at a possible solution: Variant of a Fenwick-tree.
>> 
>> Essentially, the idea is to count bit-patterns in the numbers and let each bit correspond to a slot in an array. Now you can "count" how many times a given pattern is there. With a bit of mangling, it may be possible to walk over all the data building up the counts in the array and then finally scrutinize it for 0'es and then work backwards what missing pattern you have.
>> 
>> Another variant: Radix tree or trie structure on bit-patterns since this greatly compresses the in-memory representation. Hash Array Mapped Tries comes to mind here. You can easily walk it in order to determine "holes" in the structure afterwards.
>> 
> 
> Both solutions seems very interesting. Thanks Jesper and Ian for your
> proposals. However, I can't load the file (in a reasonable time)
> because it's too big, so your solutions will be useful for my second
> step :)
> 
>> Jesper Louis Andersen
>>  Erlang Solutions Ltd., Copenhagen
>> 
>> 
>> 
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-questions@REDACTED
>> http://erlang.org/mailman/listinfo/erlang-questions
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@REDACTED
> http://erlang.org/mailman/listinfo/erlang-questions




More information about the erlang-questions mailing list