As befits their intermediate status between tagging and parsing (Chapter 8), chunk structures can be represented using either tags or trees. The most widespread file representation uses IOB tags. In this scheme, each token is tagged with one of three special chunk tags, I (inside), O (outside), or B (begin). A token is tagged as B if it marks the beginning of a chunk. Subsequent tokens within the chunk are tagged I. All other tokens are tagged O. The B and I tags are suffixed with the chunk type, e.g., B-NP, I-NP. Of course, it is not necessary to specify a chunk type for tokens that appear outside a chunk, so these are just labeled O. An example of this scheme is shown in Figure 7-3.
IOB tags have become the standard way to represent chunk structures in files, and we will also be using this format. Here is how the information in Figure 7-3 would appear in a file:
We PRP B-NP saw VBD O the DT B-NP little JJ I-NP yellow JJ I-NP dog NN I-NP
In this representation there is one token per line, each with its part-of-speech tag and chunk tag. This format permits us to represent more than one chunk type, so long as the chunks do not overlap. As we saw earlier, chunk structures can also be represented using trees. These have the benefit that each chunk is a constituent that can be manipulated directly. An example is shown in Figure 7-4.
NLTK uses trees for its internal representation of chunks, but provides methods for converting between such trees and the IOB format.
Was this article helpful?