Skip to content Skip to sidebar Skip to footer

Python Pandas MemoryError

I have those packages installed: python: 2.7.3.final.0 python-bits: 64 OS: Linux machine: x86_64 processor: x86_64 byteorder: little pandas: 0.13.1 This is the dataframe info: <

Solution 1:

I can also reproduce it on 0.13.1, but the issue does not occur in 0.12 or in 0.14 (released yesterday), so it seems a bug in 0.13.
So, maybe try to upgrade your pandas version, as the vectorized way is much faster as the apply (5s vs >1min on my machine), and using less peak memory (200Mb vs 980Mb, with %memit) on 0.14

Using your sample data repeated 50000 times (leading to a df of 450k rows), and using the apply_id function of @jsalonen:

In [23]: pd.__version__ 
Out[23]: '0.14.0'

In [24]: %timeit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
1 loops, best of 3: 5.42 s per loop

In [25]: %timeit df_train.apply(apply_id, 1)
1 loops, best of 3: 1min 11s per loop

In [26]: %load_ext memory_profiler

In [27]: %memit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
peak memory: 201.75 MiB, increment: 0.01 MiB

In [28]: %memit df_train.apply(apply_id, 1)
peak memory: 982.56 MiB, increment: 780.79 MiB

Solution 2:

Try generating the _id field with DataFrame.apply call:

def apply_id(x):
    x['_id'] = "{}_{}_{}".format(x['Store'], x['Dept'], x['Date_Str'])
    return x

df_train = df_train.apply(apply_id, 1)

When using apply the id generation is performed per row resulting in minimal overhead in memory allocation.


Post a Comment for "Python Pandas MemoryError"