Processing Feature Structures

In this section, we will show how feature structures can be constructed and manipulated in NLTK. We will also discuss the fundamental operation of unification, which allows us to combine the information contained in two different feature structures.

Feature structures in NLTK are declared with the FeatStruct() constructor. Atomic feature values can be strings or integers.

>>> fs1 = nltk.FeatStruct(TENSE='past', NUM='sg') >>> print fs1 [ NUM = 'sg' ] [ TENSE = 'past' ]

A feature structure is actually just a kind of dictionary, and so we access its values by indexing in the usual way. We can use our familiar syntax to assign values to features:

>>> fs1 = nltk.FeatStruct(PER=3, NUM='pl', GND='fem')

We can also define feature structures that have complex values, as discussed earlier.

>>> fs2 = nltk.FeatStruct(POS='N', AGR=fsl) >>> print fs2 [ [ CASE = 'acc' ] ]

An alternative method of specifying feature structures is to use a bracketed string consisting of feature-value pairs in the format feature=value, where values may themselves be feature structures:

>>> print nltk.FeatStruct("[POS='N', AGR=[PER=3, NUM='pl', GND='fem']]") [ [ PER = 3 ]

Feature structures are not inherently tied to linguistic objects; they are general-purpose structures for representing knowledge. For example, we could encode information about a person in a feature structure:

>>> print nltk.FeatStruct(name= [ age = 33 ]

In the next couple of pages, we are going to use examples like this to explore standard operations over feature structures. This will briefly divert us from processing natural language, but we need to lay the groundwork before we can get back to talking about grammars. Hang on tight!

It is often helpful to view feature structures as graphs, more specifically, as directed acyclic graphs (DAGs). (21) is equivalent to the preceding AVM.

The feature names appear as labels on the directed arcs, and feature values appear as labels on the nodes that are pointed to by the arcs.

Just as before, feature values can be complex:

NUMBER STREET

'rue Pascal1

When we look at such graphs, it is natural to think in terms of paths through the graph. A feature path is a sequence of arcs that can be followed from the root node. We will represent paths as tuples of arc labels. Thus, ('ADDRESS', 'STREET') is a feature path whose value in (22) is the node labeled 'rue Pascal'.

Now let's consider a situation where Lee has a spouse named Kim, and Kim's address is the same as Lee's. We might represent this as (23).

However, rather than repeating the address information in the feature structure, we can "share" the same sub-graph between different arcs:

'rue Pascal1

In other words, the value of the path ('ADDRESS') in (24) is identical to the value of the path ('SPOUSE ', 'ADDRESS'). DAGs such as (24) are said to involve structure sharing or reentrancy. When two paths have the same value, they are said to be equivalent.

In order to indicate reentrancy in our matrix-style representations, we will prefix the first occurrence of a shared feature structure with an integer in parentheses, such as (1) . Any later reference to that structure will use the notation ->(1) , as shown here.

>>> print nltk.FeatStruct( [NAME='Lee', ADDRESS=(l)[NUMBER=74, STREET='rue Pascal'],

[ SPOUSE = [ ADDRESS -> (l) ] [ [ NAME = 'Kim' ]

The bracketed integer is sometimes called a tag or a coindex. The choice of integer is not significant. There can be any number of tags within a single feature structure.

>>> print nltk.FeatStruct("[A='a', B=(l)[C='c'], D->(l), E->(l)]")

Was this article helpful?

0 0

Post a comment