Skip to content Skip to sidebar Skip to footer

Iterating Through String Word At A Time In Python

I have a string buffer of a huge text file. I have to search a given words/phrases in the string buffer. Whats the efficient way to do it ? I tried using re module matches. But As

Solution 1:

Iterating word-by-word through the contents of a file (the Wizard of Oz from Project Gutenberg, in my case), three different ways:

from __future__ import with_statement
import time
import re
from cStringIO import StringIO

defword_iter_std(filename):
    start = time.time()
    withopen(filename) as f:
        for line in f:
            for word in line.split():
                yield word
    print'iter_std took %0.6f seconds' % (time.time() - start)

defword_iter_re(filename):
    start = time.time()
    withopen(filename) as f:
        txt = f.read()
    for word in re.finditer('\w+', txt):
        yield word
    print'iter_re took %0.6f seconds' % (time.time() - start)

defword_iter_stringio(filename):
    start = time.time()
    withopen(filename) as f:
        io = StringIO(f.read())
    for line in io:
        for word in line.split():
            yield word
    print'iter_io took %0.6f seconds' % (time.time() - start)

woo = '/tmp/woo.txt'for word in word_iter_std(woo): passfor word in word_iter_re(woo): passfor word in word_iter_stringio(woo): pass

Resulting in:

% python /tmp/junk.py
iter_std took 0.016321 seconds
iter_re took 0.028345 seconds
iter_io took 0.016230 seconds

Solution 2:

This sounds like the sort of problem where a trie would really help. You should probably use some sort of compressed trie like a Patricia/radix trie. As long as you can fit the whole dictionary of words/phrases that you are looking for in the trie, this will greatly reduce the time complexity. How it will work is you take the beginning of a word and descend the trie until you find the longest match and increment the counter in that node. This might mean that you have to ascend the trie if a partial match doesn't pan out. Then you would proceed to the beginning of the next word and do it again. The advantage of the trie is that you are searching through the whole dictionary with each search through the trie (each look-up should take about O(m) where m is the average length of a word/phrase in your dictionary).

If you can't fit the whole dictionary into one trie, then you could split the dictionary into a few tries (one for all words/phrases starting with a-l, one for m-z for instance) and do a sweep through the whole corpus for each trie.

Solution 3:

If the re module can't do it fast, you're going to be hard pressed doing it any faster. Either way you need to read the entire file. You might consider fixing your regular expression (can you provide one?). Maybe some background on what you are trying to accomplish too.

Solution 4:

You could try doing it the other way around...instead of processing the text corpus 2,000,000 times (once for each word), process it only once. For every single word in the corpus, increment a hash table or similar to store the count of that word. A simple example in pseudocode:

word_counts = new hash<string,int>
for each word in corpus:
  ifexists(word_counts[word]):
    word_counts[word]++
  else:
    word_counts[word] = 1

You might be able to speed it up by initializing the word_counts ahead of time with the full list of words, this not needing that if statement...not sure.

Solution 5:

As xyld said, I do not think that you can beat the speed of the re module, although it would help if you posted your regexes and possibly the code as well. All I can add is try profiling before optimizing. You may be quite surprised when you see where most of the processing goes. I use hotshot to profile my code and am quite happy with it. You can find a good introduction to python profiling here http://onlamp.com/pub/a/python/2005/12/15/profiling.html.

Post a Comment for "Iterating Through String Word At A Time In Python"