Train A Unigram Tagger And Run It On Some New Text

0 Search the Web for spoof newspaper headlines, to find such gems as British Left Waffles on Falkland Islands, and Juvenile Court to Try Shooting Defendant. Manually tag these headlines to see whether knowledge of the part-of-speech tags removes the ambiguity. 2. o Working with someone else, take turns picking a word that can be either a noun or a verb (e.g., contest) the opponent has to predict which one is likely to be the most frequent in the Brown Corpus. Check the opponent's prediction,...

Well Formed Substring Tables

The simple parsers discussed in the previous sections suffer from limitations in both completeness and efficiency. In order to remedy these, we will apply the algorithm design technique of dynamic programming to the parsing problem. As we saw in Section 4.7, dynamic programming stores intermediate results and reuses them when appropriate, achieving significant efficiency gains. This technique can be applied to syntactic parsing, allowing us to store partial solutions to the parsing task and...

Context Free Grammar A Simple Grammar

Let's start off by looking at a simple context-free grammar CFG . By convention, the lefthand side of the first production is the start-symbol of the grammar, typically S, and all well-formed trees must have this symbol as their root label. In NLTK, context-free grammars are defined in the nltk.grammar module. In Example 8-1 we define a grammar and show how to parse a simple sentence admitted by the grammar. Example 8-1. A simple context-free grammar. VP - gt V NP V NP PP PP - gt P NP V - gt...

Shoebox and Toolbox Lexicons

Perhaps the single most popular tool used by linguists for managing data is Toolbox, previously known as Shoebox since it replaces the field linguist's traditional shoebox full of file cards. Toolbox is freely downloadable from http www.sil.org computing toolbox . A Toolbox file consists of a collection of entries, where each entry is made up of one or more fields. Most fields are optional or repeatable, which means that this kind of lexical resource cannot be treated as a table or spreadsheet....

Searching Text

Programming Languages Over Time

There are many ways to examine the context of a text apart from simply reading it. A concordance view shows us every occurrence of a given word, together with some context. Here we look up the word monstrous in Moby Dick by entering text1 followed by a period, then the term concordance, and then placing monstrous in parentheses gt gt gt Building index Displaying 11 of 11 matches ong the former , one was of a most monstrous size . This came towards us , ON OF THE PSALMS . Touching that monstrous...

Exercises

1. 0 Can you come up with grammatical sentences that probably have never been uttered before Take turns with a partner. What does this tell you about human language 2. 0 Recall Strunk and White's prohibition against using a sentence-initial however to mean although. Do a web search for however used at the start of the sentence. How widely used is this construction 3. 0 Consider the sentence Kim arrived or Dana left and everyone cheered. Write down the parenthesized forms to show the relative...

Defensive Programming

In order to avoid some of the pain of debugging, it helps to adopt some defensive programming habits. Instead of writing a 20-line program and then testing it, build the program bottom-up out of small pieces that are known to work. Each time you combine these pieces to make a larger unit, test it carefully to see that it works as expected. Consider adding assert statements to your code, specifying properties of a variable, e.g., assert isinstance text, list . If the value of the text variable...

Pronouncing Dictionary

A slightly richer kind of lexical resource is a table or spreadsheet , containing a word plus some properties in each row. NLTK includes the CMU Pronouncing Dictionary for U.S. English, which was designed for use by speech synthesizers. gt gt gt entries nltk.corpus.cmudict.entries gt gt gt for entry in entries 39943 39951 print entry 'fir', 'F', 'ER1' 'fire', 'F', 'AY1', 'ER0' 'fire', 'F', 'AY1', 'R' 'firearm', 'F', 'AY1', 'ER0', 'AA2', 'R', 'M' 'firearm', 'F', 'AY1', 'R', 'AA2', 'R', 'M'...

Shift Reduce Parsing

A simple kind of bottom-up parser is the shift-reduce parser. In common with all bottom-up parsers, a shift-reduce parser tries to find sequences of words and phrases that correspond to the righthand side of a grammar production, and replace them with the lefthand side, until the whole sentence is reduced to an S. The shift-reduce parser repeatedly pushes the next input word onto a stack Section 4.1 this is the shift operation. If the top n items on the stack match the n items on the righthand...

The Word Net Hierarchy

Wordnet Hierarchy

WordNet synsets correspond to abstract concepts, and they don't always have corresponding words in English. These concepts are linked together in a hierarchy. Some concepts are very general, such as Entity, State, Event these are called unique beginners or root synsets. Others, such as gas guzzler and hatchback, are much more specific. A small portion of a concept hierarchy is illustrated in Figure 2-8. Figure 2-8. Fragment of WordNet concept hierarchy Nodes correspond to synsets edges indicate...

Querying a Database

Suppose we have a program that lets us type in a natural language question and gives us back the right answer 1 a. Which country is Athens in b. Greece. How hard is it to write such a program And can we just use the same techniques that we've encountered so far in this book, or does it involve something new In this section, we will show that solving the task in a restricted domain is pretty straightforward. But we will also see that to address the problem in a more general way, we have to open...

Propositional Logic

A logical language is designed to make reasoning formally explicit. As a result, it can capture aspects of natural language which determine whether a set of sentences is consistent. As part of this approach, we need to develop logical representations of a sentence 9 that formally capture the truth-conditions of 9. We'll start off with a simple example 8 Klaus chased Evi and Evi ran away . Let's replace the two sub-sentences in 8 by 9 and respectively, and put amp for the logical operator...

Recursive Descent Parsing

The simplest kind of parser interprets a grammar as a specification of how to break a high-level goal into several lower-level subgoals. The top-level goal is to find an S. The S NP VP production permits the parser to replace this goal with two subgoals find an NP, then find a VP. Each of these subgoals can be replaced in turn by sub-subgoals, using productions that have NP and VP on their lefthand side. Eventually, this expansion process leads to subgoals such as find the word telescope. Such...

The Structure of TIMIT

Like the Brown Corpus, which displays a balanced selection of text genres and sources, TIMIT includes a balanced selection of dialects, speakers, and materials. For each of eight dialect regions, 50 male and female speakers having a range of ages and educational backgrounds each read 10 carefully chosen sentences. Two sentences, read by all speakers, were designed to bring out dialect variation 1 a. she had your dark suit in greasy wash water all year b. don't ask me to carry an oily rag like...

Naive Bayes Classifiers

Bayes Network Tutorial

In naive Bayes classifiers, every feature gets a say in determining which label should be assigned to a given input value. To choose a label for an input value, the naive Bayes classifier begins by calculating the prior probability of each label, which is determined by checking the frequency of each label in the training set. The contribution from each feature is then combined with this prior probability, to arrive at a likelihood estimate for each label. The label whose likelihood estimate is...

Precision and Recall

False Positives False Negatives

Another instance where accuracy scores can be misleading is in search tasks, such as information retrieval, where we are attempting to find documents that are relevant to a particular task. Since the number of irrelevant documents far outweighs the number of relevant documents, the accuracy score for a model that labels every document as irrelevant would be very close to 100 . It is therefore conventional to employ a different set of measures for search tasks, based on the number of items in...

Noun Phrase Chunking

We will begin by considering the task of noun phrase chunking, or NP-chunking, where we search for chunks corresponding to individual noun phrases. For example, here is some Wall Street Journal text with NP-chunks marked using brackets 2 The DT market NN for IN system-management NN software NN for IN Digital NNP 's POS hardware NN is VBZ fragmented JJ enough RB that IN a DT giant NN such JJ as IN Computer NNP Associates NNPS should MD do VB well RB there RB . . As we can see, NP-chunks are...

Python Gematria

1. 0 Find out more about sequence objects using Python's help facility. In the interpreter, type help str , help list , and help tuple . This will give you a full list of the functions supported by each type. Some functions have special names flanked with underscores as the help documentation shows, each such function corresponds to something more familiar. For example x._getitem__ y is just a long- 2. 0 Identify three operations that can be performed on both tuples and lists. Identify three...

Entropy and Information Gain

As was mentioned before, there are several methods for identifying the most informative feature for a decision stump. One popular alternative, called information gain, measures how much more organized the input values become when we divide them up using a given feature. To measure how disorganized the original set of input values are, we calculate entropy of their labels, which will be high if the input values have highly varied labels, and low if many input values all have the same label. In...

Transitive Verbs

Our next challenge is to deal with sentences containing transitive verbs, such as 46 . The output semantics that we want to build is exists x. dog x amp chase angus, x . Let's look at how we can use A-abstraction to get this result. A significant constraint on possible solutions is to require that the semantic representation of a dog be independent of whether the NP acts as subject or object of the sentence. In other words, we want to get the formula just shown as our output while sticking to...

Named Entity Recognition

At the start of this chapter, we briefly introduced named entities NEs . Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, and so on. Table 7-3 lists some of the more commonly used types of NEs. These should be self-explanatory, except for FACILITY human-made artifacts in the domains of architecture and civil engineering and GPE geo-political entities such as city, state province, and country. Table 7-3. Commonly used...