Chunking with Regular Expressions

a noun. The second rule matches one or more proper nouns. We also define an example sentence to be chunked ©, and run the chunker on this input O.

Example 7-2. Simple noun phrase chunker. grammar = r

NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and nouns {<NNP>+} # chunk sequences of proper nouns cp = nltk.RegexpParser(grammar)

sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), O

("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]

(NP Rapunzel/NNP)

let/VBD

down/RP

The $ symbol is a special character in regular expressions, and must be backslash escaped in order to match the tag PP$.

If a tag pattern matches at overlapping locations, the leftmost match takes precedence. For example, if we apply a rule that matches two consecutive nouns to a text containing three consecutive nouns, then only the first two nouns will be chunked:

>>> nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")]

>>> grammar = "NP: {<NN><NN>} # Chunk two consecutive nouns"

>>> cp = nltk.RegexpParser(grammar)

Once we have created the chunk for money market, we have removed the context that would have permitted fund to be included in a chunk. This issue would have been avoided with a more permissive chunk rule, e.g., NP: {<NN>+}.

We have added a comment to each of our chunk rules. These are optional; when they are present, the chunker prints these comments as part of its tracing output.

Was this article helpful?

0 0

Post a comment