Skip to content Skip to sidebar Skip to footer

Sorting Entire Csv By Frequency Of Occurence In One Column

I have a large CSV file, which is a log of caller data. A short snippet of my file: CompanyName High Priority QualityIssue Customer1 Yes User Customer1

Solution 1:

This seems to do what you want, basically add a count column by performing a groupby and transform with value_counts and then you can sort on that column:

In [22]:

df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort('count', ascending=False)
Out[22]:
  CompanyName HighPriority QualityIssue count
5   Customer3           NoUser43   Customer3           No    Equipment     47   Customer3          Yes    Equipment     46   Customer3          Yes         User40   Customer1          Yes         User34   Customer1           No      Neither     31   Customer1          Yes         User38   Customer4           NoUser12   Customer2           NoUser1

You can drop the extraneous column using df.drop:

In [24]:df.drop('count',axis=1)Out[24]:CompanyNameHighPriorityQualityIssue5Customer3NoUser3Customer3NoEquipment7Customer3YesEquipment6Customer3YesUser0Customer1YesUser4Customer1NoNeither1Customer1YesUser8Customer4NoUser2Customer2NoUser

Solution 2:

The top-voted answer needs a minor addition: sort was deprecated in favour of sort_values and sort_index.

sort_values will work like this:

import pandas as pd
    df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
    df['count'] = \
    df.groupby('a')['a']\
    .transform(pd.Series.value_counts)
    df.sort_values('count', inplace=True, ascending=False)
    print('df sorted: \n{}'.format(df))
dfsorted:
abcount011221321221

Solution 3:

I think there must be a better way to do it, but this should work:

Preparing the data:

import io
data = """
CompanyName  HighPriority     QualityIssue
Customer1         Yes             User
Customer1         Yes             User
Customer2         No              User
Customer3         No              Equipment
Customer1         No              Neither
Customer3         No              User
Customer3         Yes             User
Customer3         Yes             Equipment
Customer4         No              User
"""
df = pd.read_table(io.StringIO(data), sep=r"\s+")

And doing the transformation:

# create a (sorted) data frame that lists the customers with their number of occurrences
count_df = pd.DataFrame(df.CompanyName.value_counts())

# join the count data frame back with the original data frame
new_index = count_df.merge(df[["CompanyName"]], left_index=True, right_on="CompanyName")

# output the original data frame in the order of the new index.
df.reindex(new_index.index)

The output:

CompanyNameHighPriorityQualityIssue3Customer3NoEquipment5Customer3NoUser6Customer3YesUser7Customer3YesEquipment0Customer1YesUser1Customer1YesUser4Customer1NoNeither8Customer4NoUser2Customer2NoUser

It's probably not intuitive what happens here, but at the moment I cannot think of a better way to do it. I tried to comment as much as possible.

The tricky part here is that the index of count_df is the (unique) occurrences of the customers. Therefore, I join the index of count_df (left_index=True) with the CompanyName column of df (right_on="CompanyName").

The magic here is that count_df is already sorted by the number of occurrences, that's why we need no explicit sorting. So all we have to do is to reorder the rows of the original data frame by the rows of the joined data frame and we get the expected result.

Post a Comment for "Sorting Entire Csv By Frequency Of Occurence In One Column"