Accessing Text from the Web and from Disk Electronic Books

A small sample of texts from Project Gutenberg appears in the NLTK corpus collection. However, you may be interested in analyzing other texts from Project Gutenberg. You can browse the catalog of 25,000 free online books at http www.gutenberg.org cata log , and obtain a URL to an ASCII text file. Although 90 of the texts in Project Gutenberg are in English, it includes material in over 50 other languages, including Catalan, Chinese, Dutch, Finnish, French, German, Italian, Portuguese, and...

Contemporary Philosophical Divides

The contrasting approaches to NLP described in the preceding section relate back to early metaphysical debates about rationalism versus empiricism and realism versus idealism that occurred in the Enlightenment period of Western philosophy. These debates took place against a backdrop of orthodox thinking in which the source of all knowledge was believed to be divine revelation. During this period of the 17th and 18th centuries, philosophers argued that human reason or sensory experience has...

Functions The Foundation of Structured Programming

Functions provide an effective way to package and reuse program code, as already explained in Section 2.3. For example, suppose we find that we often want to read text from an HTML file. This involves several steps opening the file, reading it in, normalizing whitespace, and stripping HTML markup. We can collect these steps into a function, and give it a name such as get_text(), as shown in Example 4-1. Example 4-1. Read text from a file. import re Read text from a file, normalizing whitespace...

Fundamental Data Types

Despite its complexity, the TIMIT Corpus contains only two fundamental data types, namely lexicons and texts. As we saw in Chapter 2, most lexical resources can be represented using a record structure, i.e., a key plus one or more fields, as shown in Figure 11-3. A lexical resource could be a conventional dictionary or comparative wordlist, as illustrated. It could also be a phrasal lexicon, where the key field is a phrase rather than a single word. A thesaurus also consists of...

Incrementally Updating a Dictionary

We can employ dictionaries to count occurrences, emulating the method for tallying words shown in Figure 1-3. We begin by initializing an empty defaultdict, then process each part-of-speech tag in the text. If the tag hasn't been seen before, it will have a zero count by default. Each time we encounter a tag, we increment its count using the + operator (see Example 5-3). Example 5-3. Incrementally updating a dictionary, and sorting by value. > > > counts nltk.defaultdict(int) > > >...

Maximizing Entropy

The intuition that motivates Maximum Entropy classification is that we should build a model that captures the frequencies of individual joint-features, without making any unwarranted assumptions. An example will help to illustrate this principle. Suppose we are assigned the task of picking the correct word sense for a given word, from a list of 10 possible senses (labeled A J). At first, we are not told anything more about the word or the senses. There are many probability distributions that we...

Operating on Every Element

In Section 1.3, we saw some examples of counting items other than words. Let's take a closer look at the notation we used > > > len(w) for w in text1 1, 4, 4, 2, 6, 8, 4, 1, 9, 1, 1, 8, 2, 1, 4, 11, 5, 2, 1, 7, 6, 1, 3, 4, 5, 2, > > > w.upper() for w in text1 ' ', 'MOBY', 'DICK', 'BY', 'HERMAN', 'MELVILLE', '1851', ' ', 'ETYMOLOGY', '.', > > > These expressions have the form f(w) for or w.f() for , where f is a function that operates on a word to compute its length, or to...

Other Python Libraries

There are many other Python libraries, and you can search for them with the help of the Python Package Index at http pypi.python.org . Many libraries provide an interface to external software, such as relational databases (e.g., mysql-python) and large document collections (e.g., PyLucene). Many other libraries give access to file formats such as PDF, MSWord, and XML (pypdf, pywin32, xml.etree), RSS feeds (e.g., feedparser), and electronic mail (e.g., imaplib, email).

Processing Search Engine Results

Google hits for collocations The number of hits for collocations involving the words absolutely or definitely, followed by one of adore, love, like, or prefer. (Liberman, in LanguageLog, 2005) Unfortunately, search engines have some significant shortcomings. First, the allowable range of search patterns is severely restricted. Unlike local corpora, where you write programs to search for arbitrarily complex patterns, search engines generally only allow you to search for individual...

Ranges and Closures

The T9 system is used for entering text on mobile phones (see Figure 3-5). Two or more words that are entered with the same sequence of keystrokes are known as textonyms. For example, both hole and golf are entered by pressing the sequence 4653. What other words could be produced with the same sequence Here we use the regular expression A ghi mno jlk def > > > w for w in wordlist if re.search('A ghi mno jlk def ', w) 'gold', 'golf', 'hold', 'hole' The first part of the expression, A ghi ,...

Standards and Tools

For a corpus to be widely useful, it needs to be available in a widely supported format. However, the cutting edge of NLP research depends on new kinds of annotations, which by definition are not widely supported. In general, adequate tools for creation, publication, and use of linguistic data are not widely available. Most projects must develop their own set of tools for internal use, which is no help to others who lack the necessary resources. Furthermore, we do not have adequate, generally...

Text Corpus Structure

We have seen a variety of corpus structures so far these are summarized in Figure 2-3. The simplest kind lacks any structure it is just a collection of texts. Often, texts are grouped into categories that might correspond to genre, source, author, language, etc. Sometimes these categories overlap, notably in the case of topical categories, as a text can be relevant to more than one topic. Occasionally, text collections have temporal structure, news collections being the most common example....

The Element Tree Interface

Python's ElementTree module provides a convenient way to access data stored in XML files. ElementTree is part of Python's standard library (since Python 2.5), and is also provided as part of NLTK in case you are using Python 2.4. We will illustrate the use of ElementTree using a collection of Shakespeare plays that have been formatted using XML. Let's load the XML file and inspect the raw data, first at the top of the file O, where we see some XML headers and the name of a schema called...

Treebanks and Grammars

The corpus module defines the treebank corpus reader, which contains a 10 sample of the Penn Treebank Corpus. > > > from nltk.corpus import treebank > > > t (NP (NNP Pierre) (NNP Vinken)) (. ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)) We can use this data to help develop a grammar. For example, the program in Example 8-4 uses a simple filter to find verbs that take sentential complements. Assuming we...

Using XML for Linguistic Structures

Thanks to its flexibility and extensibility, XML is a natural choice for representing linguistic structures. Here's an example of a simple lexical entry. < pos> noun< pos> < gloss> any of the larger cetacean mammals having a streamlined body and breathing through a blowhole on the head< gloss> < entry> It consists of a series of XML tags enclosed in angle brackets. Each opening tag, such as < gloss> , is matched with a closing tag, < gloss> together they constitute...

Well Formed Substring Tables

The simple parsers discussed in the previous sections suffer from limitations in both completeness and efficiency. In order to remedy these, we will apply the algorithm design technique of dynamic programming to the parsing problem. As we saw in Section 4.7, dynamic programming stores intermediate results and reuses them when appropriate, achieving significant efficiency gains. This technique can be applied to syntactic parsing, allowing us to store partial solutions to the parsing task and...

Whats the Use of Syntax Beyond ngrams

We gave an example in Chapter 2 of how to use the frequency information in bigrams to generate text that seems perfectly acceptable for small sequences of words but rapidly degenerates into nonsense. Here's another pair of examples that we created by computing the bigrams over the text of a children's story, The Adventures of Buster Brown included in the Project Gutenberg Selection Corpus 4 a. He roared with me the pail slip down his back b. The worst part and clumsy looking for whoever heard...

Brown Corpus

The Brown Corpus was the first million-word electronic corpus of English, created in 1961 at Brown University. This corpus contains text from 500 sources, and the sources have been categorized by genre, such as news, editorial, and so on. Table 2-1 gives an example of each genre for a complete list, see Table 2-1. Example document for each section of the Brown Corpus Table 2-1. Example document for each section of the Brown Corpus Christian Science Monitor Editorials Underwood Probing the...

Context Free Grammar A Simple Grammar

Let's start off by looking at a simple context-free grammar CFG . By convention, the lefthand side of the first production is the start-symbol of the grammar, typically S, and all well-formed trees must have this symbol as their root label. In NLTK, context-free grammars are defined in the nltk.grammar module. In Example 8-1 we define a grammar and show how to parse a simple sentence admitted by the grammar. Example 8-1. A simple context-free grammar. VP - gt V NP V NP PP PP - gt P NP V - gt...

Extracting Encoded Text from Files

Let's assume that we have a small text file, and that we know how it is encoded. For example, polish-lat2.txt, as the name suggests, is a snippet of Polish text from the Polish Wikipedia see This file is encoded as Latin-2, also known as ISO-8859-2. The function nltk.data.find locates the file for us. gt gt gt path The Python codecs module provides functions to read encoded data into Unicode strings, and to write out Unicode strings in encoded form. The codecs.open function takes an encoding...

Nltk Feature Grammar Embedded Function

1. 0 What constraints are required to correctly parse word sequences like I am happy and she is happy but not you is happy or they am happy Implement two solutions for the present tense paradigm of the verb be in English, first taking Grammar 8 as your starting point, and then taking Grammar 20 as the starting point. 2. 0 Develop a variant of grammar in Example 9-1 that uses a feature COUNT to make the distinctions shown here 3. 0 Write a function subsumes that holds of two feature structures...

Gutenberg Corpus

NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books, hosted at http www.gu tenberg.org . We begin by getting the Python interpreter to load the NLTK package, then ask to see nltk.corpus.gutenberg.fileids , the file identifiers in this corpus gt gt gt import nltk gt gt gt nltk.corpus.gutenberg.fileids 'austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt',...

Case and Gender in German

Compared with English, German has a relatively rich morphology for agreement. For example, the definite article in German varies with case, gender, and number, as shown in Table 9-2. Table 9-2. Morphological paradigm for the German definite article Table 9-2. Morphological paradigm for the German definite article Subjects in German take the nominative case, and most verbs govern their objects in the accusative case. However, there are exceptions, such as helfen, that govern the dative case 55...

Searching Text

Programming Languages Over Time

There are many ways to examine the context of a text apart from simply reading it. A concordance view shows us every occurrence of a given word, together with some context. Here we look up the word monstrous in Moby Dick by entering text1 followed by a period, then the term concordance, and then placing monstrous in parentheses gt gt gt Building index Displaying 11 of 11 matches ong the former , one was of a most monstrous size . This came towards us , ON OF THE PSALMS . Touching that monstrous...

Exercises

1. 0 Can you come up with grammatical sentences that probably have never been uttered before Take turns with a partner. What does this tell you about human language 2. 0 Recall Strunk and White's prohibition against using a sentence-initial however to mean although. Do a web search for however used at the start of the sentence. How widely used is this construction 3. 0 Consider the sentence Kim arrived or Dana left and everyone cheered. Write down the parenthesized forms to show the relative...

Choosing the Right Features

Selecting relevant features and deciding how to encode them for a learning method can have an enormous impact on the learning method's ability to extract a good model. Much of the interesting work in building a classifier is deciding what features might be relevant, and how we can represent them. Although it's often possible to get decent performance by using a fairly simple and obvious set of features, there are usually significant gains to be had by using carefully constructed features based...

Linguistic Data and Unlimited Possibilities

Previous chapters have shown you how to process and analyze text corpora, and we have stressed the challenges for NLP in dealing with the vast amount of electronic language data that is growing daily. Let's consider this data more closely, and make the thought experiment that we have a gigantic corpus consisting of everything that has been either uttered or written in English over, say, the last 50 years. Would we be justified in calling this corpus the language of modern English There are a...

Defensive Programming

In order to avoid some of the pain of debugging, it helps to adopt some defensive programming habits. Instead of writing a 20-line program and then testing it, build the program bottom-up out of small pieces that are known to work. Each time you combine these pieces to make a larger unit, test it carefully to see that it works as expected. Consider adding assert statements to your code, specifying properties of a variable, e.g., assert isinstance text, list . If the value of the text variable...

Pronouncing Dictionary

A slightly richer kind of lexical resource is a table or spreadsheet , containing a word plus some properties in each row. NLTK includes the CMU Pronouncing Dictionary for U.S. English, which was designed for use by speech synthesizers. gt gt gt entries nltk.corpus.cmudict.entries gt gt gt for entry in entries 39943 39951 print entry 'fir', 'F', 'ER1' 'fire', 'F', 'AY1', 'ER0' 'fire', 'F', 'AY1', 'R' 'firearm', 'F', 'AY1', 'ER0', 'AA2', 'R', 'M' 'firearm', 'F', 'AY1', 'R', 'AA2', 'R', 'M'...

Shift Reduce Parsing

A simple kind of bottom-up parser is the shift-reduce parser. In common with all bottom-up parsers, a shift-reduce parser tries to find sequences of words and phrases that correspond to the righthand side of a grammar production, and replace them with the lefthand side, until the whole sentence is reduced to an S. The shift-reduce parser repeatedly pushes the next input word onto a stack Section 4.1 this is the shift operation. If the top n items on the stack match the n items on the righthand...

The Word Net Hierarchy

Wordnet Hierarchy

WordNet synsets correspond to abstract concepts, and they don't always have corresponding words in English. These concepts are linked together in a hierarchy. Some concepts are very general, such as Entity, State, Event these are called unique beginners or root synsets. Others, such as gas guzzler and hatchback, are much more specific. A small portion of a concept hierarchy is illustrated in Figure 2-8. Figure 2-8. Fragment of WordNet concept hierarchy Nodes correspond to synsets edges indicate...

Querying a Database

Suppose we have a program that lets us type in a natural language question and gives us back the right answer 1 a. Which country is Athens in b. Greece. How hard is it to write such a program And can we just use the same techniques that we've encountered so far in this book, or does it involve something new In this section, we will show that solving the task in a restricted domain is pretty straightforward. But we will also see that to address the problem in a more general way, we have to open...

Counting Vocabulary

The most obvious fact about texts that emerges from the preceding examples is that they differ in the vocabulary they use. In this section, we will see how to use the computer to count the words in a text in a variety of useful ways. As before, you will jump right in and experiment with the Python interpreter, even though you may not have studied Python systematically yet. Test your understanding by modifying the examples, and trying the exercises at the end of the chapter. Let's begin by...

Propositional Logic

A logical language is designed to make reasoning formally explicit. As a result, it can capture aspects of natural language which determine whether a set of sentences is consistent. As part of this approach, we need to develop logical representations of a sentence 9 that formally capture the truth-conditions of 9. We'll start off with a simple example 8 Klaus chased Evi and Evi ran away . Let's replace the two sub-sentences in 8 by 9 and respectively, and put amp for the logical operator...

Senses and Synonyms

If we replace the word motorcar in 1a with automobile, to get 1b , the meaning of the sentence stays pretty much the same 1 a. Benz is credited with the invention of the motorcar. b. Benz is credited with the invention of the automobile. Since everything else in the sentence has remained unchanged, we can conclude that the words motorcar and automobile have the same meaning, i.e., they are synonyms. We can explore these words with the help of WordNet gt gt gt from...

Recursive Descent Parsing

The simplest kind of parser interprets a grammar as a specification of how to break a high-level goal into several lower-level subgoals. The top-level goal is to find an S. The S NP VP production permits the parser to replace this goal with two subgoals find an NP, then find a VP. Each of these subgoals can be replaced in turn by sub-subgoals, using productions that have NP and VP on their lefthand side. Eventually, this expansion process leads to subgoals such as find the word telescope. Such...

Procedural Versus Declarative Style

We have just seen how the same task can be performed in different ways, with implications for efficiency. Another factor influencing program development is programming style. Consider the following program to compute the average length of words in the Brown Corpus gt gt gt tokens In this program we use the variable count to keep track of the number of tokens seen, and total to store the combined length of all words. This is a low-level style, not far removed from machine code, the primitive...

The Structure of TIMIT

Like the Brown Corpus, which displays a balanced selection of text genres and sources, TIMIT includes a balanced selection of dialects, speakers, and materials. For each of eight dialect regions, 50 male and female speakers having a range of ages and educational backgrounds each read 10 carefully chosen sentences. Two sentences, read by all speakers, were designed to bring out dialect variation 1 a. she had your dark suit in greasy wash water all year b. don't ask me to carry an oily rag like...

Spoken Dialogue Systems

Spoken Dialogue Systems

In the history of artificial intelligence, the chief measure of intelligence has been a linguistic one, namely the Turing Test can a dialogue system, responding to a user's text input, perform so naturally that we cannot distinguish it from a human-generated response In contrast, today's commercial dialogue systems are very limited, but still perform useful functions in narrowly defined domains, as we see here U When is Saving Private Ryan playing S Saving Private Ryan is not playing at the...

Naive Bayes Classifiers

Bayes Network Tutorial

In naive Bayes classifiers, every feature gets a say in determining which label should be assigned to a given input value. To choose a label for an input value, the naive Bayes classifier begins by calculating the prior probability of each label, which is determined by checking the frequency of each label in the training set. The contribution from each feature is then combined with this prior probability, to arrive at a likelihood estimate for each label. The label whose likelihood estimate is...

Simple Evaluation and Baselines

Now that we can access a chunked corpus, we can evaluate chunkers. We start off by establishing a baseline for the trivial chunk parser cp that creates no chunks gt gt gt from nltk.corpus import conll2000 gt gt gt cp nltk.RegexpParser gt gt gt test_sents conll2000.chunked_sents 'test.txt', chunk_types 'NP' gt gt gt print cp.evaluate test_sents ChunkParse score IOB Accuracy 43.4 Precision 0.0 Recall 0.0 The IOB tag accuracy indicates that more than a third of the words are tagged with O, i.e.,...

Reading IOB Format and the CoNLL Chunking Corpus

Using the corpora module we can load Wall Street Journal text that has been tagged then chunked using the IOB notation. The chunk categories provided in this corpus are NP, VP, and PP. As we have seen, each sentence is represented using multiple lines, as shown here he PRP B-NP accepted VBD B-VP the DT B-NP position NN I-NP A conversion function chunk.conllstr2tree builds a tree representation from one of these multiline strings. Moreover, it permits us to choose any subset of the three chunk...

Precision and Recall

False Positives False Negatives

Another instance where accuracy scores can be misleading is in search tasks, such as information retrieval, where we are attempting to find documents that are relevant to a particular task. Since the number of irrelevant documents far outweighs the number of relevant documents, the accuracy score for a model that labels every document as irrelevant would be very close to 100 . It is therefore conventional to employ a different set of measures for search tasks, based on the number of items in...

Python Gematria

1. 0 Find out more about sequence objects using Python's help facility. In the interpreter, type help str , help list , and help tuple . This will give you a full list of the functions supported by each type. Some functions have special names flanked with underscores as the help documentation shows, each such function corresponds to something more familiar. For example x._getitem__ y is just a long- 2. 0 Identify three operations that can be performed on both tuples and lists. Identify three...

Transitive Verbs

Our next challenge is to deal with sentences containing transitive verbs, such as 46 . The output semantics that we want to build is exists x. dog x amp chase angus, x . Let's look at how we can use A-abstraction to get this result. A significant constraint on possible solutions is to require that the semantic representation of a dog be independent of whether the NP acts as subject or object of the sentence. In other words, we want to get the formula just shown as our output while sticking to...

Named Entity Recognition

At the start of this chapter, we briefly introduced named entities NEs . Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, and so on. Table 7-3 lists some of the more commonly used types of NEs. These should be self-explanatory, except for FACILITY human-made artifacts in the domains of architecture and civil engineering and GPE geo-political entities such as city, state province, and country. Table 7-3. Commonly used...