Finding Word Stems

When we use a web search engine, we usually don't mind (or even notice) if the words in the document differ from our search terms in having different endings. A query for laptops finds documents containing laptop and vice versa. Indeed, laptop and laptops are just two forms of the same dictionary word (or lemma). For some language processing tasks we want to ignore word endings, and just deal with word stems.

There are various ways we can pull out the stem of a word. Here's a simple-minded approach that just strips off anything that looks like a suffix: >>> def stem(word):

... for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:

Although we will ultimately use NLTK's built-in stemmers, it's interesting to see how we can use regular expressions for this task. Our first step is to build up a disjunction of all the suffixes. We need to enclose it in parentheses in order to limit the scope of the disjunction.

>>> re.findall(r'A.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing') ['ing']

Here, re.findall() just gave us the suffix even though the regular expression matched the entire word. This is because the parentheses have a second function, to select substrings to be extracted. If we want to use the parentheses to specify the scope of the disjunction, but not to select the material to be output, we have to add ?:, which is just one of many arcane subtleties of regular expressions. Here's the revised version.

>>> re.findall(r'A.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing') ['processing']

However, we'd actually like to split the word into stem and suffix. So we should just parenthesize both parts of the regular expression:

>>> re.findall(r'A(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing') [('process', 'ing')]

This looks promising, but still has a problem. Let's look at a different word, processes:

>>> re.findall(r'A(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes') [('processe', 's')]

The regular expression incorrectly found an -s suffix instead of an -es suffix. This demonstrates another subtlety: the star operator is "greedy" and so the .* part of the expression tries to consume as much of the input as possible. If we use the "non-greedy" version of the star operator, written *?, we get what we want:

>>> re.findall(r'A(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes') [('process', 'es')]

This works even when we allow an empty suffix, by making the content of the second parentheses optional:

>>> re.findall(r'A(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language') [('language', '')]

This approach still has many problems (can you spot them?), but we will move on to define a function to perform stemming, and apply it to a whole text: >>> def stem(word):

... regexp = r'A(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$' ... stem, suffix = re.findall(regexp, word)[0] ... return stem

>>> raw = DENNIS: Listen, strange women lying in ponds distributing swords

... is no basis for a system of government. Supreme executive power derives from

... a mandate from the masses, not from some farcical aquatic ceremony

>>> tokens = nltk.word_tokenize(raw) >>> [stem(t) for t in tokens]

['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond', 'distribut', 'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']

Notice that our regular expression removed the s from ponds but also from is and basis. It produced some non-words, such as distribut and deriv, but these are acceptable stems in some applications.

Was this article helpful?

0 0

Post a comment