Skip to content Skip to sidebar Skip to footer

Create A Dictionary With 'word Groups'

I would like to do some text analysis on job descriptions and was going to use nltk. I can build a dictionary and remove the stopwords, which is part of what I want. However in add

Solution 1:

Sounds like what you want do is use collocations from nltk.

Solution 2:

Tokenize your multi-word expressions into tuples, then put them in a set for easy lookup. The easiest way is to use nltk.ngrams which allows you to iterate directly over the ngrams in your text. Since your sample data includes a trigram, here's a search for n up to 3.

raw_keywords = [ 'data scientist', 'machine learning', 'natural language processing', 
         'data', 'scientist', 'focus', 'machine', 'learning', 'natural''language', 'processing']
keywords = set(tuple(term.split()) for term in raw_keywords)

tokens = nltk.word_tokenize(text.lower())
# Scan text once for each ngram size. for n in1, 2, 3:
    for ngram in nltk.ngrams(tokens, n):
        if ngram in keywords:
            print(ngram)

If you have huge amounts of text, you could check you you'll get a speed-up by iterating over maximal ngrams only (with the option pad_right=True to avoid missing small ngram sizes). The number of lookups is the same both ways, so I doubt it will make much difference, except in the order of returned results.

for ngram in nltk.ngrams(tokens, n, pad_right=True):
    for k in range(n):
        if ngram[:k+1] inkeywords:
            print(ngram[:k+1])

As for stopword removal: If you remove them, you'll produce ngrams where there were none before, e.g., "sewing machine and learning center" will match "machine learning" after stopword removal. You'll have to decide if this is something you want, or not. If it were me I would remove punctuation before the keyword scan, but leave the stopwords in place.

Solution 3:

Thanks @Batman, I played around a bit with collocations and ended up only needing a couple of lines of code. (Obviously 'meaningful text' should be a lot longer to find actual collocations)

meaningful_text = 'As a Data Scientist, you will focus on machine 
            learning and Natural Language Processing'from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(meaningful_text))
scored = finder.score_ngrams(bigram_measures.raw_freq)
sorted(scored, key=lambda s: s[1], reverse=True)

Post a Comment for "Create A Dictionary With 'word Groups'"