How To Pass A Large Number Of Dataframe Columns To Numpy Vectorize As Argument
Solution 1:
As you can read from the np.vectorize: the vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
Therefore as hpaulj already said, it won't speed up your code
However, if you want to use it anyway, you don't have to type all your columns, just use a list comprehension:
np.vectorize(_build_data)([my_df[c] for c inlist(my_df)], param1, param2, param3)
Solution 2:
I suspect you were trying to use np.vectorize
because you read that numpy 'vectorization' is a way of speeding up pandas
code.
In [29]: df = pd.DataFrame(np.arange(12).reshape(4,3), columns=['A','B','C'])
In [30]: df
Out[30]:
A B C
001213452678391011
The slow, row by row, approach to taking the row mean:
In [31]: df.apply(lambda row: np.mean(row), axis=1)
Out[31]:
01.014.027.0310.0
dtype: float64
The fast numpy method:
In [32]: df.to_numpy()
Out[32]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
In [33]: df.to_numpy().mean(axis=1)
Out[33]: array([ 1., 4., 7., 10.])
That is, we get an array of the dataframe values, and use a fast compiled method to calculate row means.
But to make something like a dictionary for each row:
In [35]: df.apply(lambda row: {str(k):k for k in row}, axis=1)
Out[35]:
0 {'0': 0, '1': 1, '2': 2}
1 {'3': 3, '4': 4, '5': 5}
2 {'6': 6, '7': 7, '8': 8}
3 {'9': 9, '10': 10, '11': 11}
dtype: object
We have to iterate on array rows, just like we do with the dataframe apply
:
In [36]: [{str(k):k for k in row} for row in df.to_numpy()]
Out[36]:
[{'0': 0, '1': 1, '2': 2},
{'3': 3, '4': 4, '5': 5},
{'6': 6, '7': 7, '8': 8},
{'9': 9, '10': 10, '11': 11}]
The array approach is faster:
In [37]: timeit df.apply(lambda row: {str(k):k for k in row}, axis=1)
1.13 ms ± 702 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [38]: timeit [{str(k):k for k in row} for row in df.to_numpy()]
40.8 µs ± 157 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
But the apply
method returns a dataframe, not a list. I suspect most of the extra time is in that step.
np.vectorize
(and np.frompyfunc
) can also be used to iterate on an array, but the default is to iterate on elements, not rows or columns. In general they are slower than the more explicit iteration (as I do in [36]).
A clumsy way of making a dataframe from the list:
In [53]: %%timeit
...: df1 = pd.DataFrame(['one','two','three','four'],columns=['d'])
...: df1['d'] =[{str(k):k for k in row} for row in df.to_numpy()]
572 µs ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Post a Comment for "How To Pass A Large Number Of Dataframe Columns To Numpy Vectorize As Argument"