Inaugural Address Corpus

in Section 1.1, we looked at the inaugural Address Corpus, but treated it as a single text. The graph in Figure 1-2 used "word offset" as one of the axes; this is the numerical index of the word in the corpus, counting from the first word of the first address. However, the corpus is actually a collection of 55 texts, one for each presidential address. An interesting property of this collection is its time dimension:

>>> from nltk.corpus import inaugural >>> inaugural.fileids()

['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ...] >>> [fileid[:4] for fileid in inaugural.fileids()]

['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', ...]

Notice that the year of each text appears in its filename. To get the year out of the filename, we extracted the first four characters, using fileid[:4].

Let's look at how the words America and citizen are used over time. The following code converts the words in the inaugural corpus to lowercase using w.lower() , then checks whether they start with either of the "targets" america or citizen using startswith() O. Thus it will count words such as American's and Citizens. We'll learn about conditional frequency distributions in Section 2.2; for now, just consider the output, shown in Figure 2-1.

>>> cfd = nltk.ConditionalFreqDist(

... for fileid in inaugural.fileids()

... for w in inaugural.words(fileid)

... if w.lower().startswith(target)) O >>> cfd.plot()

Figure 2-1. Plot of a conditional frequency distribution: All words in the Inaugural Address Corpus that begin with america or citizen are counted; separate counts are kept for each address; these are plotted so that trends in usage over time can be observed; counts are not normalized for document length.

Figure 2-1. Plot of a conditional frequency distribution: All words in the Inaugural Address Corpus that begin with america or citizen are counted; separate counts are kept for each address; these are plotted so that trends in usage over time can be observed; counts are not normalized for document length.

Was this article helpful?

0 0

Post a comment