Create A Dictionary With 'word Groups'
Solution 1:
Sounds like what you want do is use collocations
from nltk.
Solution 2:
Tokenize your multi-word expressions into tuples, then put them in a set for easy lookup. The easiest way is to use nltk.ngrams
which allows you to iterate directly over the ngrams in your text. Since your sample data includes a trigram, here's a search for n
up to 3.
raw_keywords = [ 'data scientist', 'machine learning', 'natural language processing',
'data', 'scientist', 'focus', 'machine', 'learning', 'natural''language', 'processing']
keywords = set(tuple(term.split()) for term in raw_keywords)
tokens = nltk.word_tokenize(text.lower())
# Scan text once for each ngram size. for n in1, 2, 3:
for ngram in nltk.ngrams(tokens, n):
if ngram in keywords:
print(ngram)
If you have huge amounts of text, you could check you you'll get a speed-up by iterating over maximal ngrams only (with the option pad_right=True
to avoid missing small ngram sizes). The number of lookups is the same both ways, so I doubt it will make much difference, except in the order of returned results.
for ngram in nltk.ngrams(tokens, n, pad_right=True):
for k in range(n):
if ngram[:k+1] inkeywords:
print(ngram[:k+1])
As for stopword removal: If you remove them, you'll produce ngrams where there were none before, e.g., "sewing machine and learning center" will match "machine learning" after stopword removal. You'll have to decide if this is something you want, or not. If it were me I would remove punctuation before the keyword scan, but leave the stopwords in place.
Solution 3:
Thanks @Batman, I played around a bit with collocations
and ended up only needing a couple of lines of code. (Obviously 'meaningful text' should be a lot longer to find actual collocations)
meaningful_text = 'As a Data Scientist, you will focus on machine
learning and Natural Language Processing'from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(meaningful_text))
scored = finder.score_ngrams(bigram_measures.raw_freq)
sorted(scored, key=lambda s: s[1], reverse=True)
Post a Comment for "Create A Dictionary With 'word Groups'"