Information Extraction Architecture

Figure 7-1 shows the architecture for a simple information extraction system. It begins by processing a document using several of the procedures discussed in Chapters 3 and 5: first, the raw text of the document is split into sentences using a sentence segmenter, and each sentence is further subdivided into words using a tokenizer. Next, each sentence is tagged with part-of-speech tags, which will prove very helpful in the next step, named entity recognition. In this step, we search for mentions of potentially interesting entities in each sentence. Finally, we use relation recognition to search for likely relations between different entities in the text.

Nltk Information Extraction

Figure 7-1. Simple pipeline architecture for an information extraction system. This system takes the raw text of a document as its input, and generates a list of (entity, relation, entity) tuples as its output. For example, given a document that indicates that the company Georgia-Pacific is located in Atlanta, it might generate the tuple ([ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']).

To perform the first three tasks, we can define a function that simply connects together NLTK's default sentence segmenter O, word tokenizer ©, and part-of-speech tagger ©:

>>> def ie_preprocess(document):

... sentences = nltk.sent_tokenize(document) O

... sentences = [nltk.word_tokenize(sent) for sent in sentences] ©

... sentences = [nltk.pos_tag(sent) for sent in sentences] ©

W e

saw

t h e

e 1 11 o w

d o g

PRP

VBO

OT

13

NN

NP

NP

Figure 7-2. Segmentation and labeling at both the Token and Chunk levels.

Figure 7-2. Segmentation and labeling at both the Token and Chunk levels.

Remember that our program samples assume you begin your interactive session or your program with import nltk, re, pprint.

Next, in named entity recognition, we segment and label the entities that might participate in interesting relations with one another. Typically, these will be definite noun phrases such as the knights who say "ni", or proper names such as Monty Python. In some tasks it is useful to also consider indefinite nouns or noun chunks, such as every student or cats, and these do not necessarily refer to entities in the same way as definite NPs and proper names.

Finally, in relation extraction, we search for specific patterns between pairs of entities that occur near one another in the text, and use those patterns to build tuples recording the relationships between the entities.

Was this article helpful?

0 -1

Post a comment