Skip to content Skip to sidebar Skip to footer

How To Pass A Large Number Of Dataframe Columns To Numpy Vectorize As Argument

I've got a dataframe with exactly 31 columns and, for example, 100 rows. I need to create a list with 100 dictionaries that have values processed from the different 31 columns. I a

Solution 1:

As you can read from the np.vectorize: the vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

Therefore as hpaulj already said, it won't speed up your code

However, if you want to use it anyway, you don't have to type all your columns, just use a list comprehension:

np.vectorize(_build_data)([my_df[c] for c inlist(my_df)], param1, param2, param3)

Solution 2:

I suspect you were trying to use np.vectorize because you read that numpy 'vectorization' is a way of speeding up pandas code.

In [29]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=['A','B','C'])                  
In [30]: df                                                                                    
Out[30]: 
   A   B   C
001213452678391011

The slow, row by row, approach to taking the row mean:

In [31]: df.apply(lambda row: np.mean(row), axis=1)                                            
Out[31]: 
01.014.027.0310.0
dtype: float64

The fast numpy method:

In [32]: df.to_numpy()                                                                         
Out[32]: 
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])
In [33]: df.to_numpy().mean(axis=1)                                                            
Out[33]: array([ 1.,  4.,  7., 10.])

That is, we get an array of the dataframe values, and use a fast compiled method to calculate row means.

But to make something like a dictionary for each row:

In [35]: df.apply(lambda row: {str(k):k for k in row}, axis=1)                                 
Out[35]: 
0        {'0': 0, '1': 1, '2': 2}
1        {'3': 3, '4': 4, '5': 5}
2        {'6': 6, '7': 7, '8': 8}
3    {'9': 9, '10': 10, '11': 11}
dtype: object

We have to iterate on array rows, just like we do with the dataframe apply:

In [36]: [{str(k):k for k in row} for row in df.to_numpy()]                                    
Out[36]: 
[{'0': 0, '1': 1, '2': 2},
 {'3': 3, '4': 4, '5': 5},
 {'6': 6, '7': 7, '8': 8},
 {'9': 9, '10': 10, '11': 11}]

The array approach is faster:

In [37]: timeit df.apply(lambda row: {str(k):k for k in row}, axis=1)                          
1.13 ms ± 702 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [38]: timeit [{str(k):k for k in row} for row in df.to_numpy()]                             
40.8 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

But the apply method returns a dataframe, not a list. I suspect most of the extra time is in that step.

np.vectorize (and np.frompyfunc) can also be used to iterate on an array, but the default is to iterate on elements, not rows or columns. In general they are slower than the more explicit iteration (as I do in [36]).

A clumsy way of making a dataframe from the list:

In [53]: %%timeit 
    ...: df1 = pd.DataFrame(['one','two','three','four'],columns=['d'])   
    ...: df1['d'] =[{str(k):k for k in row} for row in df.to_numpy()]                                                                                       
572 µs ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Post a Comment for "How To Pass A Large Number Of Dataframe Columns To Numpy Vectorize As Argument"