Searching Tokenized Text

You can use a special kind of regular expression for searching across multiple words in a text (where a text is a list of tokens). For example, "<a> <man>" finds all instances of a man in the text. The angle brackets are used to mark token boundaries, and any whitespace between the angle brackets is ignored (behaviors that are unique to NLTK's findall() method for texts). In the following example, we include <.*> ©, which will match any single token, and enclose it in parentheses so only the matched word (e.g., monied) and not the matched phrase (e.g., a monied man) is produced. The second example finds three-word phrases ending with the word bro . The last example finds sequences of three or more words starting with the letter l ©.

>>> from nltk.corpus import gutenberg, nps_chat

>>> moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))

>>> moby.findall(r"<a> (<.*>) <man>") O

monied; nervous; dangerous; white; white; white; pious; queer; good;

mature; white; Cape; great; wise; wise; butterless; white; fiendish;

pale; furious; better; certain; complete; dismasted; younger; brave;

brave; brave; brave

>>> chat.findall(r"<.*> <.*> <bro>") ©

you rule bro; telling you bro; u twizted bro

>>> chat.findall(r"<l.*>{3,}") ©

lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la la la; lovely lol lol love; lol lol lol.; la la la; la la la

■ > Your Turn: Consolidate your understanding of regular expression pat terns and substitutions using nltk.re_show(p, s), which annotates the ' ' istring s to show every place where pattern p was matched, and, which provides a graphical interface for exploring regular expressions. For more practice, try some of the exercises on regular expressions at the end of this chapter.

It is easy to build search patterns when the linguistic phenomenon we're studying is tied to particular words. In some cases, a little creativity will go a long way. For instance, searching a large text corpus for expressions of the form x and other ys allows us to discover hypernyms (see Section 2.5): >>> from nltk.corpus import brown

>>> hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))

>>> hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")

speed and other activities; water and other liquids; tomb and other landmarks; Statues and other monuments; pearls and other jewels;

charts and other items; roads and other features; figures and other objects; military and other areas; demands and other factors;

abstracts and other compilations; iron and other metals

With enough text, this approach would give us a useful store of information about the taxonomy of objects, without the need for any manual labor. However, our search results will usually contain false positives, i.e., cases that we would want to exclude. For example, the result demands and other factors suggests that demand is an instance of the type factor, but this sentence is actually about wage demands. Nevertheless, we could construct our own ontology of English concepts by manually correcting the output of such searches.

This combination of automatic and manual processing is the most common way for new corpora to be constructed. We will return to this in Chapter 11.

Searching corpora also suffers from the problem of false negatives, i.e., omitting cases that we would want to include. It is risky to conclude that some linguistic phenomenon doesn't exist in a corpus just because we couldn't find any instances of a search pattern. Perhaps we just didn't think carefully enough about suitable patterns.

Your Turn: Look for instances of the pattern as x as y to discover information about entities and their properties.

Was this article helpful?

0 0

Post a comment