The Bibliography Of The Linguist Tamanji

A computational model of human parsing. Journal of Psycholinguistic Research, 18 129-144, 1989. Abney, 1991 Steven P. Abney. Parsing by chunks. In Robert C. Berwick, Steven P. Abney, and Carol Tenny, editors, Principle-Based Parsing Computation and Psycho-linguistics, volume 44 of Studies in Linguistics and Philosophy. Kluwer Academic Publishers, Dordrecht, 1991. Abney, 1996a Steven Abney. Part-of-speech tagging and partial parsing. In Ken Church, Steve Young, and...

Well Formed Substring Tables

The simple parsers discussed in the previous sections suffer from limitations in both completeness and efficiency. In order to remedy these, we will apply the algorithm design technique of dynamic programming to the parsing problem. As we saw in Section 4.7, dynamic programming stores intermediate results and reuses them when appropriate, achieving significant efficiency gains. This technique can be applied to syntactic parsing, allowing us to store partial solutions to the parsing task and...

Context Free Grammar A Simple Grammar

Let's start off by looking at a simple context-free grammar CFG . By convention, the lefthand side of the first production is the start-symbol of the grammar, typically S, and all well-formed trees must have this symbol as their root label. In NLTK, context-free grammars are defined in the nltk.grammar module. In Example 8-1 we define a grammar and show how to parse a simple sentence admitted by the grammar. Example 8-1. A simple context-free grammar. VP - gt V NP V NP PP PP - gt P NP V - gt...

Gutenberg Corpus

NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books, hosted at http www.gu tenberg.org . We begin by getting the Python interpreter to load the NLTK package, then ask to see nltk.corpus.gutenberg.fileids , the file identifiers in this corpus gt gt gt import nltk gt gt gt nltk.corpus.gutenberg.fileids 'austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt',...

Searching Text

Programming Languages Over Time

There are many ways to examine the context of a text apart from simply reading it. A concordance view shows us every occurrence of a given word, together with some context. Here we look up the word monstrous in Moby Dick by entering text1 followed by a period, then the term concordance, and then placing monstrous in parentheses gt gt gt Building index Displaying 11 of 11 matches ong the former , one was of a most monstrous size . This came towards us , ON OF THE PSALMS . Touching that monstrous...

Dynamic Programming

Dynamic programming is a general technique for designing algorithms which is widely used in natural language processing. The term programming is used in a different sense to what you might expect, to mean planning or scheduling. Dynamic programming is used when a problem contains overlapping subproblems. Instead of computing solutions to these subproblems repeatedly, we simply store them in a lookup table. In the remainder of this section, we will introduce dynamic programming, but in a rather...

General NGram Tagging

When we perform a language processing task based on unigrams, we are using one item of context. In the case of tagging, we consider only the current token, in isolation from any larger context. Given such a model, the best we can do is tag each word with its a priori most likely tag. This means we would tag a word such as wind with the same tag, regardless of whether it appears in the context the wind or to wind. An n-gram tagger is a generalization of a unigram tagger whose context is the...

Defensive Programming

In order to avoid some of the pain of debugging, it helps to adopt some defensive programming habits. Instead of writing a 20-line program and then testing it, build the program bottom-up out of small pieces that are known to work. Each time you combine these pieces to make a larger unit, test it carefully to see that it works as expected. Consider adding assert statements to your code, specifying properties of a variable, e.g., assert isinstance text, list . If the value of the text variable...

Shift Reduce Parsing

A simple kind of bottom-up parser is the shift-reduce parser. In common with all bottom-up parsers, a shift-reduce parser tries to find sequences of words and phrases that correspond to the righthand side of a grammar production, and replace them with the lefthand side, until the whole sentence is reduced to an S. The shift-reduce parser repeatedly pushes the next input word onto a stack Section 4.1 this is the shift operation. If the top n items on the stack match the n items on the righthand...

The Word Net Hierarchy

Wordnet Hierarchy

WordNet synsets correspond to abstract concepts, and they don't always have corresponding words in English. These concepts are linked together in a hierarchy. Some concepts are very general, such as Entity, State, Event these are called unique beginners or root synsets. Others, such as gas guzzler and hatchback, are much more specific. A small portion of a concept hierarchy is illustrated in Figure 2-8. Figure 2-8. Fragment of WordNet concept hierarchy Nodes correspond to synsets edges indicate...

Querying a Database

Suppose we have a program that lets us type in a natural language question and gives us back the right answer 1 a. Which country is Athens in b. Greece. How hard is it to write such a program And can we just use the same techniques that we've encountered so far in this book, or does it involve something new In this section, we will show that solving the task in a restricted domain is pretty straightforward. But we will also see that to address the problem in a more general way, we have to open...

Counting Vocabulary

The most obvious fact about texts that emerges from the preceding examples is that they differ in the vocabulary they use. In this section, we will see how to use the computer to count the words in a text in a variety of useful ways. As before, you will jump right in and experiment with the Python interpreter, even though you may not have studied Python systematically yet. Test your understanding by modifying the examples, and trying the exercises at the end of the chapter. Let's begin by...

Propositional Logic

A logical language is designed to make reasoning formally explicit. As a result, it can capture aspects of natural language which determine whether a set of sentences is consistent. As part of this approach, we need to develop logical representations of a sentence 9 that formally capture the truth-conditions of 9. We'll start off with a simple example 8 Klaus chased Evi and Evi ran away . Let's replace the two sub-sentences in 8 by 9 and respectively, and put amp for the logical operator...

Recursive Descent Parsing

The simplest kind of parser interprets a grammar as a specification of how to break a high-level goal into several lower-level subgoals. The top-level goal is to find an S. The S NP VP production permits the parser to replace this goal with two subgoals find an NP, then find a VP. Each of these subgoals can be replaced in turn by sub-subgoals, using productions that have NP and VP on their lefthand side. Eventually, this expansion process leads to subgoals such as find the word telescope. Such...

Spoken Dialogue Systems

Spoken Dialogue Systems

In the history of artificial intelligence, the chief measure of intelligence has been a linguistic one, namely the Turing Test can a dialogue system, responding to a user's text input, perform so naturally that we cannot distinguish it from a human-generated response In contrast, today's commercial dialogue systems are very limited, but still perform useful functions in narrowly defined domains, as we see here U When is Saving Private Ryan playing S Saving Private Ryan is not playing at the...

Quantifier Ambiguity Revisited

One important limitation of the methods described earlier is that they do not deal with scope ambiguity. Our translation method is syntax-driven, in the sense that the semantic representation is closely coupled with the syntactic analysis, and the scope of the quantifiers in the semantics therefore reflects the relative scope of the corresponding NPs in the syntactic parse tree. Consequently, a sentence like 26 , repeated here, will always be translated as 53a , not 53b . 53 a. all x. girl x -...

Naive Bayes Classifiers

Bayes Network Tutorial

In naive Bayes classifiers, every feature gets a say in determining which label should be assigned to a given input value. To choose a label for an input value, the naive Bayes classifier begins by calculating the prior probability of each label, which is determined by checking the frequency of each label in the training set. The contribution from each feature is then combined with this prior probability, to arrive at a likelihood estimate for each label. The label whose likelihood estimate is...

Reading IOB Format and the CoNLL Chunking Corpus

Using the corpora module we can load Wall Street Journal text that has been tagged then chunked using the IOB notation. The chunk categories provided in this corpus are NP, VP, and PP. As we have seen, each sentence is represented using multiple lines, as shown here he PRP B-NP accepted VBD B-VP the DT B-NP position NN I-NP A conversion function chunk.conllstr2tree builds a tree representation from one of these multiline strings. Moreover, it permits us to choose any subset of the three chunk...

Precision and Recall

False Positives False Negatives

Another instance where accuracy scores can be misleading is in search tasks, such as information retrieval, where we are attempting to find documents that are relevant to a particular task. Since the number of irrelevant documents far outweighs the number of relevant documents, the accuracy score for a model that labels every document as irrelevant would be very close to 100 . It is therefore conventional to employ a different set of measures for search tasks, based on the number of items in...

Exercises

1. 0 Create a variable phrase containing a list of words. Experiment with the operations described in this chapter, including addition, multiplication, indexing, slicing, and sorting. 2. 0 Use the corpus module to explore austen-persuasion.txt. How many word tokens does this book have How many word types 3. 0 Use the Brown Corpus reader nltk.corpus.brown.words or the Web Text Corpus reader nltk.corpus.webtext.words to access some sample text in two different genres. 4. o Read in the texts of...

Python Gematria

1. 0 Find out more about sequence objects using Python's help facility. In the interpreter, type help str , help list , and help tuple . This will give you a full list of the functions supported by each type. Some functions have special names flanked with underscores as the help documentation shows, each such function corresponds to something more familiar. For example x._getitem__ y is just a long- 2. 0 Identify three operations that can be performed on both tuples and lists. Identify three...

Entropy and Information Gain

As was mentioned before, there are several methods for identifying the most informative feature for a decision stump. One popular alternative, called information gain, measures how much more organized the input values become when we divide them up using a given feature. To measure how disorganized the original set of input values are, we calculate entropy of their labels, which will be high if the input values have highly varied labels, and low if many input values all have the same label. In...

Transitive Verbs

Our next challenge is to deal with sentences containing transitive verbs, such as 46 . The output semantics that we want to build is exists x. dog x amp chase angus, x . Let's look at how we can use A-abstraction to get this result. A significant constraint on possible solutions is to require that the semantic representation of a dog be independent of whether the NP acts as subject or object of the sentence. In other words, we want to get the formula just shown as our output while sticking to...

Named Entity Recognition

At the start of this chapter, we briefly introduced named entities NEs . Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, and so on. Table 7-3 lists some of the more commonly used types of NEs. These should be self-explanatory, except for FACILITY human-made artifacts in the domains of architecture and civil engineering and GPE geo-political entities such as city, state province, and country. Table 7-3. Commonly used...