## Whats the Use of Syntax Beyond ngrams

We gave an example in Chapter 2 of how to use the frequency information in bigrams to generate text that seems perfectly acceptable for small sequences of words but rapidly degenerates into nonsense. Here's another pair of examples that we created by computing the bigrams over the text of a children's story, The Adventures of Buster Brown (included in the Project Gutenberg Selection Corpus):

(4) a. He roared with me the pail slip down his back b. The worst part and clumsy looking for whoever heard light

You intuitively know that these sequences are "word-salad," but you probably find it hard to pin down what's wrong with them. One benefit of studying grammar is that it provides a conceptual framework and vocabulary for spelling out these intuitions. Let's take a closer look at the sequence the worst part and clumsy looking. This looks like a coordinate structure, where two phrases are joined by a coordinating conjunction such as and, but, or or. Here's an informal (and simplified) statement of how coordination works syntactically:

Coordinate Structure: if vi and V2 are both phrases of grammatical category X, then vi and v2 is also a phrase of category X.

Here are a couple of examples. In the first, two NPs (noun phrases) have been conjoined to make an NP, while in the second, two APs (adjective phrases) have been conjoined to make an AP.

(5) a. The book's ending was (NP the worst part and the best part) for me. b. On land they are (AP slow and clumsy looking).

What we can't do is conjoin an NP and an AP, which is why the worst part and clumsy looking is ungrammatical. Before we can formalize these ideas, we need to understand the concept of constituent structure.

Constituent structure is based on the observation that words combine with other words to form units. The evidence that a sequence of words forms such a unit is given by substitutabilityâ€”that is, a sequence of words in a well-formed sentence can be replaced by a shorter sequence without rendering the sentence ill-formed. To clarify this idea, consider the following sentence:

(6) The little bear saw the fine fat trout in the brook.

The fact that we can substitute He for The little bear indicates that the latter sequence is a unit. By contrast, we cannot replace little bear saw in the same way. (We use an asterisk at the start of a sentence to indicate that it is ungrammatical.)

(7) a. He saw the fine fat trout in the brook. b. *The he the fine fat trout in the brook.

In Figure 8-1, we systematically substitute longer sequences by shorter ones in a way which preserves grammaticality. Each sequence that forms a unit can in fact be replaced by a single word, and we end up with just two elements.

 the little bear saw the fine fat trout in the brook the bear saw the trout in it He saw it there He ran there He ran
 Det the Adj N little bear V saw Det the Adj Adj N fine fat trout P in Det N the brook Det the Nom bear saw Det the Nom trout P in NP it NP He saw NP it there NP He VP ran there He VP ran

Figure 8-2. Substitution of word sequences plus grammatical categories: This diagram reproduces Figure 8-1 along with grammatical categories corresponding to noun phrases (NP), verb phrases (VP), prepositional phrases (PP), and nominals (Nom).

Figure 8-2. Substitution of word sequences plus grammatical categories: This diagram reproduces Figure 8-1 along with grammatical categories corresponding to noun phrases (NP), verb phrases (VP), prepositional phrases (PP), and nominals (Nom).

In Figure 8-2, we have added grammatical category labels to the words we saw in the earlier figure. The labels NP, VP, and PP stand for noun phrase, verb phrase, and prepositional phrase, respectively.

If we now strip out the words apart from the topmost row, add an S node, and flip the figure over, we end up with a standard phrase structure tree, shown in (8). Each node in this tree (including the words) is called a constituent. The immediate constituents of S are NP and VP.

As we saw in Section 8.1, sentences can have arbitrary length. Consequently, phrase structure trees can have arbitrary depth. The cascaded chunk parsers we saw in Section 7.4 can only produce structures of bounded depth, so chunking methods aren't applicable here.

As we will see in the next section, a grammar specifies how the sentence can be subdivided into its immediate constituents, and how these can be further subdivided until we reach the level of individual words.