Reading IOB Format and the CoNLL Chunking Corpus

Using the corpora module we can load Wall Street Journal text that has been tagged then chunked using the IOB notation. The chunk categories provided in this corpus are NP, VP, and PP. As we have seen, each sentence is represented using multiple lines, as shown here:

he PRP B-NP accepted VBD B-VP the DT B-NP position NN I-NP

A conversion function chunk.conllstr2tree() builds a tree representation from one of these multiline strings. Moreover, it permits us to choose any subset of the three chunk types to use, here just for NP chunks:

>>> text = ''' he PRP B-NP accepted VBD B-VP the DT B-NP position NN I-NP of IN B-PP vice NN B-NP chairman NN I-NP of IN B-PP Carlyle NNP B-NP Group NNP I-NP , , O a DT B-NP merchant NN I-NP banking NN I-NP concern NN I-NP . . O

>>> nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()

>>> nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()

We can use the NLTK corpus module to access a larger amount of chunked text. The CoNLL-2000 Chunking Corpus contains 270k words of Wall Street Journal text, divided into "train" and "test" portions, annotated with part-of-speech tags and chunk tags in the IOB format. We can access the data using nltk.corpus.conll2000. Here is an example that reads the 100th sentence of the "train" portion of the corpus:

>>> from nltk.corpus import conll2000

>>> print conll2000.chunked_sents('train.txt')[99]

As you can see, the CoNLL-2000 Chunking Corpus contains three chunk types: NP chunks, which we have already seen; VP chunks, such as has already delivered; and PP

chunks, such as because of. Since we are only interested in the NP chunks right now, we can use the chunk_types argument to select them:

>>> print conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99] (S

Over/IN

(NP Mr./NNP Stone/NNP) told/VBD

Was this article helpful?

0 -1

Post a comment