Pandas: How To Limit The Results Of Str.contains?
Solution 1:
Believe it or not but .str accessor is slow. You can use list comprehensions with better performance.
df = pd.DataFrame({'col2':np.random.choice(['substring','midstring','nostring','substrate'],100000)})
Test for equality
all(df['col2'].str.contains('substr', case=True, regex=False) ==
pd.Series(['substr'in i for i in df['col2']]))
Output:
True
Timings:
%timeit df['col2'].str.contains('substr', case=True, regex=False)
10 loops, best of3: 37.9 ms per loop
versus
%timeit pd.Series(['substr' in i foriin df['col2']])
100 loops, best of 3: 19.1 ms per loop
Solution 2:
You can spead it up with:
matching = df['col2'].head(n).str.contains('substr', case=True, regex=False)
rows = df['col1'].head(n)[matching==True]
However this solution would retrieve the matching results within the first n
rows, not the first n
matching results.
In case you actually want the first n
matching results you should use:
rows = df['col1'][df['col2'].str.contains("substr")==True].head(n)
But this option is way slower of course.
Inspired in @ScottBoston's answer you can use following approach for a complete faster solution:
rows = df['col1'][pd.Series(['substr'in i for i indf['col2']])==True].head(n)
This is faster but not that faster than showing the whole results with this option. With this solution you can get the first n
matching results.
With below test code we can see how fast is each solution and it's results:
import pandas as pd
import time
n = 10
a = ["Result", "from", "first", "column", "for", "this", "matching", "test", "end"]
b = ["This", "is", "a", "test", "has substr", "also has substr", "end", "of", "test"]
col1 = a*1000000
col2 = b*1000000
df = pd.DataFrame({"col1":col1,"col2":col2})
# Original option
start_time = time.time()
matching = df['col2'].str.contains('substr', case=True, regex=False)
rows = df[matching].col1.drop_duplicates()
print("--- %s seconds ---" % (time.time() - start_time))
# Faster option
start_time = time.time()
matching_fast = df['col2'].head(n).str.contains('substr', case=True, regex=False)
rows_fast = df['col1'].head(n)[matching==True]
print("--- %s seconds for fast solution ---" % (time.time() - start_time))
# Other option
start_time = time.time()
rows_other = df['col1'][df['col2'].str.contains("substr")==True].head(n)
print("--- %s seconds for other solution ---" % (time.time() - start_time))
# Complete option
start_time = time.time()
rows_complete = df['col1'][pd.Series(['substr'in i for i indf['col2']])==True].head(n)
print("--- %s seconds for complete solution ---" % (time.time() - start_time))
This would output:
>>>
--- 2.33899998665 seconds ------ 0.302999973297 seconds for fast solution ------ 4.56700015068 seconds for other solution ------ 1.61599993706 seconds for complete solution ---
And the resulting Series would be:
>>> rows
4for5this
Name: col1, dtype: object
>>> rows_fast
4for5this
Name: col1, dtype: object
>>> rows_other
4for5this13for14this22for23this31for32this40for41this
Name: col1, dtype: object
>>> rows_complete
4for5this13for14this22for23this31for32this40for41this
Name: col1, dtype: object
Post a Comment for "Pandas: How To Limit The Results Of Str.contains?"