Skip to content Skip to sidebar Skip to footer

Finding Duplicates In A List Of Lists

I am using Python 2.7 and am trying to de-duplicate a list of lists and merge the values of the duplicates. Right now I have: original_list = [['a', 1], ['b', 1], ['a', 1], ['b', 1

Solution 1:

Lots of good answers, but they all use rather more code than I would for this, so here's my take, for what it's worth:

totals = {}
for k,v in original_list:
  totals[k] = totals.get(k,0) + v

# totals = {'a': 2, 'c': 2, 'b': 7}

Once you have a dict like that, from any of these answers, you can use items to get a list of tuples:

totals.items()
# => [('a', 2), ('c', 2), ('b', 7)]

And map list across the tuples to get a list of lists:

map(list, totals.items())
# => [['a', 2], ['c', 2], ['b', 7]]

And sort if you want them in order:

sorted(map(list, totals.items()))
# => [['a', 2], ['b', 7], ['c', 2]]

Solution 2:

>>> from collections import Counter
>>> lst = [['a', 1], ['b', 1], ['a', 1], ['b', 1], ['b', 2], ['c', 2], ['b', 3]]
>>> c = Counter(x for x, c in lst for _ in xrange(c))

Counter({'b': 7, 'a': 2, 'c': 2})

>>> map(list, c.iteritems())
[['a', 2], ['c', 2], ['b', 7]]

Or alternatively, without repeating each item (a, b) b times (@hcwhsa):

>>> from collections import Counter
>>> lst = [['a', 1], ['b', 1], ['a', 1], ['b', 1], ['b', 2], ['c', 2], ['b', 3]]
>>> c = sum((Counter(**{k:v}) for k, v in lst), Counter())

Counter({'b': 7, 'a': 2, 'c': 2})

>>> map(list, c.iteritems())
[['a', 2], ['c', 2], ['b', 7]]

Solution 3:

SOLUTION

Use collections.Counter:

from collections import Counter
original_list = [['a', 1], ['b', 1], ['a', 1], ['b', 1], ['b', 2], ['c', 2], ['b', 3]]
result = Counter()
for k, v in original_list:
     result.update({k:v})

map(list, result.items())
# [['a', 2], ['c', 2], ['b', 7]]

FINDINGS

So, lot of answers, views and upvotes. I even earned my first Nice answer out of nothing (in last 2 days I made lot of answers worth of more research and efforts). In view of this, I decided to do at least some research and test solutions performance with a simple script written from scratch. Do not include code directly in answer for the sake of size.

Each function is named for it's author an easily can be found in question. thefourtheye's solution now equals one of Mark Reed and is evaluated in original form, thefourtheye2 states for itertools.groupby based solution.

Each was tested several times (samples), each sample in turn invoked several function iterations. I evaluated min, max and standard deviation for samples times.

Here we go, running probing test for 10 times.

testing: thefourtheye, kroolik2, void, kroolik, alko, reed, visser
   10 samples
   10 iterations each
         author   min     avg     max    stddev
           reed 0.000000.000000.000000.00000
         visser 0.000000.001500.015000.00450
   thefourtheye 0.000000.001600.016000.00480
  thefourtheye2 0.000000.003100.016000.00620
           alko 0.000000.006300.016000.00772void0.015000.015400.016000.00049
       kroolik2 0.047000.064300.078000.00831
        kroolik 0.328000.343800.375000.01716

Look at bottom two rows: at this point kroolik solutions were disqualified since with it any reasonable amount of samples*iterations will be performed for hours. Here goes final tests. I manually added number of upvotes to ouptut:

testing: thefourtheye, kroolik2, void, kroolik, alko, reed, visser
   100 samples
  1000 iterations each
         author  upvotes   min     avg     max    stddev
           reed  [20]    0.062000.081740.156000.01841
   thefourtheye   [5]    0.062000.099710.203000.01911
         visser   [6]    0.109000.123920.235000.02263
  thefourtheye2          0.250000.296740.890000.07183
           alko  [11]    0.562000.623091.047000.08438void   [3]    1.500001.654802.391000.18721
        kroolik  [14]     [DSQ]

Solution 4:

If the order doesnt matter, you can use this

original_list = [['a', 1], ['b', 1], ['a', 1], ['b', 1], ['b', 2], ['c', 2], ['b', 3]]
myDict = {}
forfirst, second in original_list:
    myDict[first] = myDict.get(first, 0) + second
result = [[key, value] forkey, value in myDict.items()]
print result

Or you can use groupby and the code becomes a oneliner

original_list = [['a', 1], ['b', 1], ['a', 1], ['b', 1], ['b', 2], ['c', 2], ['b', 3]]
from itertools import groupby
print [[key, sum(item[1] for item inlist(group))]
       for key, group in groupby(sorted(original_list), lambda x:x[0])]

Output

[['a', 2], ['b', 7], ['c', 2]]

Solution 5:

You can use collections.defaultdict:

original_list = [['a', 1], ['b', 1], ['a', 1], ['b', 1], ['b', 2], ['c', 2], ['b', 3]]
import collections
data = collections.defaultdict(list)
for item in original_list:
    data[item[0]].append(item[1])

output = {key: sum(values) for key, values indata.items()}
print output
# gives: {'a': 2, 'c': 2, 'b': 7}

Post a Comment for "Finding Duplicates In A List Of Lists"