Exploring Text Corpora

In Section 5.2, we saw how we could interrogate a tagged corpus to extract phrases matching a particular sequence of part-of-speech tags. We can do the same work more easily with a chunker, as follows:

>>> cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')

(CHUNK combined/VBN to/TO achieve/VB) (CHUNK continue/VB to/TO place/VB) (CHUNK serve/VB to/TO protect/VB) (CHUNK wanted/VBD to/TO wait/VB) (CHUNK allowed/VBN to/TO place/VB) (CHUNK expected/VBN to/TO become/VB)

(CHUNK seems/VBZ to/TO overtake/VB) (CHUNK want/VB to/TO buy/VB)

Your Turn: Encapsulate the previous example inside a function find_chunks() that takes a chunk string like "CHUNK: {<V.*> <TO> <V.*>}" as an argument. Use it to search the corpus for several other patterns, such as four or more nouns in a row, e.g., "NOUNS: {<N.*>{4,}}".

Was this article helpful?

0 0

Post a comment