Skip to content Skip to sidebar Skip to footer

Python: Compress And Save/load Large Data From/into Memory

I have a huge dictionary with numpy arrays as values which consumes almost all RAM. There is no possibility to pickle or compress it entirely. I've checked some of solutions to rea

Solution 1:

Your first focus should be on having a sane way to serialize and deserialize your data. We have several constraints about your data provided in the question itself, or in comments on same:

  • Your data consists of a dictionary with a very large number of key/value pairs
  • All keys are unicode strings
  • All values are numpy arrays which are individually short enough to easily fit in memory at any given time (or even to allow multiple copies of any single value), although in aggregate the storage required becomes extremely large.

This suggests a fairly simple implementation:

def serialize(f, content):
    for k,v in content.items():
        # write length of key, followed by key as string
        k_bstr = k.encode('utf-8')
        f.write(struct.pack('L', len(k_bstr)))
        f.write(k_bstr)
        # write length of value, followed by value in numpy.save format
        memfile = io.BytesIO()
        numpy.save(memfile, v)
        f.write(struct.pack('L', memfile.tell()))
        f.write(memfile.getvalue())

def deserialize(f):
    retval = {}
    while True:
        content = f.read(struct.calcsize('L'))
        if not content: break
        k_len = struct.unpack('L', content)[0]
        k_bstr = f.read(k_len)
        k = k_bstr.decode('utf-8')
        v_len = struct.unpack('L', f.read(struct.calcsize('L')))[0]
        v_bytes = io.BytesIO(f.read(v_len))
        v = numpy.load(v_bytes)
        retval[k] = v
    return retval

As a simple test:

test_file = io.BytesIO()
serialize(test_file, {
    "First Key": numpy.array([123,234,345]),
    "Second Key": numpy.array([321,432,543]),
})

test_file.seek(0)
print(deserialize(test_file))

...so, we've got that -- now, how do we add compression? Easily.

with gzip.open('filename.gz', 'wb') as gzip_file:
    serialize(gzip_file, your_data)

...or, on the decompression side:

with gzip.open('filename.gz', 'rb') as gzip_file:
    your_data = deserialize(gzip_file)

This works because the gzip library already streams data out as it's requested, rather than compressing it or decompressing it all at once. There's no need to do windowing and chunking yourself -- just leave it to the lower layer.

Solution 2:

To write a dictionary to disk, the zipfile module is a good fit.

  • When saving - Save each chunk as a file in the zip.
  • When loading - Iterate over the files in the zip and rebuild the data.

Post a Comment for "Python: Compress And Save/load Large Data From/into Memory"