## Well Formed Substring Tables

The simple parsers discussed in the previous sections suffer from limitations in both completeness and efficiency. In order to remedy these, we will apply the algorithm design technique of dynamic programming to the parsing problem. As we saw in Section 4.7, dynamic programming stores intermediate results and reuses them when appropriate, achieving significant efficiency gains. This technique can be applied to syntactic parsing, allowing us to store partial solutions to the parsing task and then look them up as necessary in order to efficiently arrive at a complete solution. This approach to parsing is known as chart parsing. We introduce the main idea in this section; see the online materials available for this chapter for more implementation details.

Dynamic programming allows us to build the PP in my pajamas just once. The first time we build it we save it in a table, then we look it up when we need to use it as a subconstituent of either the object NP or the higher VP. This table is known as a well-formed substring table, or WFST for short. (The term "substring" refers to a contiguous sequence of words within a sentence.) We will show how to construct the WFST bottom-up so as to systematically record what syntactic constituents have been found.

Let's set our input to be the sentence in (2). The numerically specified spans of the WFST are reminiscent of Python's slice notation (Section 3.2). Another way to think about the data structure is shown in Figure 8-6, a data structure known as a chart.

I shot an elephant in my pajamas

I shot an elephant in my pajamas

Figure 8-6. The chart data structure: Words are the edge labels of a linear graph structure.

Figure 8-6. The chart data structure: Words are the edge labels of a linear graph structure.

In a WFST, we record the position of the words by filling in cells in a triangular matrix: the vertical axis will denote the start position of a substring, while the horizontal axis will denote the end position (thus shot will appear in the cell with coordinates (1, 2)). To simplify this presentation, we will assume each word has a unique lexical category, and we will store this (not the word) in the matrix. So cell (1, 2) will contain the entry V. More generally, if our input string is a\ai ... an, and our grammar contains a production of the form A ^ ai, then we add A to the cell (i-1, i).

So, for every word in text, we can look up in our grammar what category it belongs to.

>>> text = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'] [V -> 'shot']

For our WFST, we create an (n-1) x (n-1) matrix as a list of lists in Python, and initialize it with the lexical categories of each token in the init_wfst() function in Example 8-3. We also define a utility function display() to pretty-print the WFST for us. As expected, there is a V in cell (1, 2).

Example 8-3. Acceptor using well-formed substring table.

def init_wfst(tokens, grammar): numtokens = len(tokens)

wfst = [[None for i in range(numtokens+1)] for j in range(numtokens+1)] for i in range(numtokens):

productions = grammar.productions(rhs=tokens[i]) wfst[i][i+1] = productions.lhs() return wfst def complete_wfst(wfst, tokens, grammar, trace=False):

index = dict((p.rhs(), p.lhs()) for p in grammar.productions())

numtokens = len(tokens)

for span in range(2, numtokens+1):

for start in range(numtokens+1-span): end = start + span for mid in range(start+1, end):

nt1, nt2 = wfst[start][mid], wfst[mid][end] if nt1 and nt2 and (nt1,nt2) in index: wfst[start][end] = index[(nt1,nt2)] if trace:

print "[%s] %3s [%s] %3s [%s] ==> [%s] %3s [%s]" % \ (start, nt1, mid, nt2, end, start, index[(nt1,nt2)], end)

return wfst def display(wfst, tokens):

print '\nWFST ' + ' '.join([("%-4d" % i) for i in range(1, len(wfst))]) for i in range(len(wfst)-1): print "%d " % i, for j in range(1, len(wfst)):

print "%-4s" % (wfst[i][j] or '.'), print

>>> tokens = "I shot an elephant in my pajamas".split()

>>> wfst0 = init_wfst(tokens, groucho_grammar)

>>> wfstl = complete_wfst(wfst0, tokens, groucho_grammar)

Returning to our tabular representation, given that we have Det in cell (2, 3) for the word an, and N in cell (3, 4) for the word elephant, what should we put into cell (2, 4) for an elephant? We need to find a production of the form A ^ Det N. Consulting the grammar, we know that we can enter NP in cell (0, 2).

More generally, we can enter A in (i, j) if there is a production A ^ B C, and we find non-terminal B in (i, k) and C in (k, j). The program in Example 8-3 uses this rule to complete the WFST. By setting trace to True when calling the function complete_wfst(), we see tracing output that shows the WFST being constructed:

>>> wfstl = complete_wfst(wfst0, tokens, groucho_grammar, trace=True)

 VP  PP  ==>  VP   NP  VP  ==>  S 

For example, this says that since we found Det at wfst and N at wfst, we can add NP to wfst.

To help us easily retrieve productions by their righthand sides, we create an index for the grammar. This is an example of a space-time trade-off: we do a reverse lookup on the grammar, instead of having to check through entire list of productions each time we want to look up via the righthand side.

We conclude that there is a parse for the whole input string once we have constructed an S node in cell (0, 7), showing that we have found a sentence that covers the whole input. The final state of the WFST is depicted in Figure 8-7.

Notice that we have not used any built-in parsing functions here. We've implemented a complete primitive chart parser from the ground up!

WFSTs have several shortcomings. First, as you can see, the WFST is not itself a parse tree, so the technique is strictly speaking recognizing that a sentence is admitted by a i^i1- Figure 8-7. The chart data structure: Non-terminals are represented as extra edges in the chart.

grammar, rather than parsing it. Second, it requires every non-lexical grammar production to be binary. Although it is possible to convert an arbitrary CFG into this form, we would prefer to use an approach without such a requirement. Third, as a bottom-up approach it is potentially wasteful, being able to propose constituents in locations that would not be licensed by the grammar.

Finally, the WFST did not represent the structural ambiguity in the sentence (i.e., the two verb phrase readings). The VP in cell (2,8) was actually entered twice, once for a V NP reading, and once for a VP PP reading. These are different hypotheses, and the second overwrote the first (as it happens, this didn't matter since the lefthand side was the same). Chart parsers use a slightly richer data structure and some interesting algorithms to solve these problems (see Section 8.8).

Your Turn: Try out the interactive chart parser application nltk.app.chartparser().