Skip to content Skip to sidebar Skip to footer

Find Different Realization Of A Word In A Sentence String - Python

(This question is with regards to string checking in general and not Natural Language Procesisng per se, but if you view it as an NLP problem, imagine it's not a langauge that curr

Solution 1:

I recommend having a look at the stem package of NLTK: http://nltk.org/api/nltk.stem.html

Using it you can "remove morphological affixes from words, leaving only the word stem. Stemming algorithms aim to remove those affixes required for eg. grammatical role, tense, derivational morphology leaving only the stem of the word."

If your language is not covered by NLTK currently, you should consider extending NLTK. If you really need something simple and don't bother about NLTK, then you should still write your code as a collection of small, easy to combine utility functions, for example:

import string 

defvariation(stem, word):
    return word.lower() in [stem, stem + 'es', stem + 's']

defvariations(sentence, stem):
    sentence = cleanPunctuation(sentence).split()
    return ( (i, w) for i, w inenumerate(sentence) if variation(stem, w) )

defcleanPunctuation(sentence):
    exclude = set(string.punctuation)
    return''.join(ch for ch in sentence if ch notin exclude)

deffirstVariation(sentence, stem):
    for i, w  in variations(sentence, stem):
        return i, w

sentence = "First coach, here another two coaches. Coaches are nice."print firstVariation(sentence, 'coach')

# print all variations/forms of 'coach' found in the sentence:print"\n".join([str(i) + ' ' + w for i,w in variations(sentence, 'coach')])

Solution 2:

Morphology is typically a finite-state phenomenon, so regular expressions are the perfect tool to handle it. Build an RE that matches all of the cases with a function like:

definflect(stem):
    """Returns an RE that matches all inflected forms of stem."""
    pat = "^[%s%s]%s(?:e?s)$" % (stem[0], stem[0].upper(), re.escape(stem[1:]))
    return re.compile(pat)

Usage:

>>>sentence = "this is a sentence with the Coaches">>>target = inflect("coach")>>>[(i, w) for i, w inenumerate(sentence.split()) if re.match(target, w)]
[(6, 'Coaches')]

If the inflection rules get more complicated than this, consider using Python's verbose REs.

Post a Comment for "Find Different Realization Of A Word In A Sentence String - Python"