Sorting Entire Csv By Frequency Of Occurence In One Column
Solution 1:
This seems to do what you want, basically add a count column by performing a groupby
and transform
with value_counts
and then you can sort on that column:
In [22]:
df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort('count', ascending=False)
Out[22]:
CompanyName HighPriority QualityIssue count
5 Customer3 NoUser43 Customer3 No Equipment 47 Customer3 Yes Equipment 46 Customer3 Yes User40 Customer1 Yes User34 Customer1 No Neither 31 Customer1 Yes User38 Customer4 NoUser12 Customer2 NoUser1
You can drop the extraneous column using df.drop
:
In [24]:df.drop('count',axis=1)Out[24]:CompanyNameHighPriorityQualityIssue5Customer3NoUser3Customer3NoEquipment7Customer3YesEquipment6Customer3YesUser0Customer1YesUser4Customer1NoNeither1Customer1YesUser8Customer4NoUser2Customer2NoUser
Solution 2:
The top-voted answer needs a minor addition: sort
was deprecated in favour of sort_values
and sort_index
.
sort_values
will work like this:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
df['count'] = \
df.groupby('a')['a']\
.transform(pd.Series.value_counts)
df.sort_values('count', inplace=True, ascending=False)
print('df sorted: \n{}'.format(df))
dfsorted:
abcount011221321221
Solution 3:
I think there must be a better way to do it, but this should work:
Preparing the data:
import io
data = """
CompanyName HighPriority QualityIssue
Customer1 Yes User
Customer1 Yes User
Customer2 No User
Customer3 No Equipment
Customer1 No Neither
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer4 No User
"""
df = pd.read_table(io.StringIO(data), sep=r"\s+")
And doing the transformation:
# create a (sorted) data frame that lists the customers with their number of occurrences
count_df = pd.DataFrame(df.CompanyName.value_counts())
# join the count data frame back with the original data frame
new_index = count_df.merge(df[["CompanyName"]], left_index=True, right_on="CompanyName")
# output the original data frame in the order of the new index.
df.reindex(new_index.index)
The output:
CompanyNameHighPriorityQualityIssue3Customer3NoEquipment5Customer3NoUser6Customer3YesUser7Customer3YesEquipment0Customer1YesUser1Customer1YesUser4Customer1NoNeither8Customer4NoUser2Customer2NoUser
It's probably not intuitive what happens here, but at the moment I cannot think of a better way to do it. I tried to comment as much as possible.
The tricky part here is that the index of count_df
is the (unique) occurrences of the customers. Therefore, I join the index of count_df
(left_index=True
) with the CompanyName
column of df
(right_on="CompanyName"
).
The magic here is that count_df
is already sorted by the number of occurrences, that's why we need no explicit sorting. So all we have to do is to reorder the rows of the original data frame by the rows of the joined data frame and we get the expected result.
Post a Comment for "Sorting Entire Csv By Frequency Of Occurence In One Column"