Gutenberg Corpus

NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books, hosted at We begin by getting the Python interpreter to load the NLTK package, then ask to see nltk.corpus.gutenberg.fileids(), the file identifiers in this corpus: >>> import nltk

>>> nltk.corpus.gutenberg.fileids()

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

Let's pick out the first of these texts—Emma by Jane Austen—and give it a short name, emma, then find out how many words it contains:

>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')


In Section 1.1, we showed how you could carry out concordancing of a text such as text1 with the command text1.concordance(). However, this assumes that you are using one of the nine texts obtained as a result of doing from import *. Now that you have started examining data from nltk.corpus, as in the previous example, you have to employ the following pair of statements to perform concordancing and other tasks from Section 1.1:

>>> emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt')) >>> emma.concordance("surprize")

When we defined emma, we invoked the words() function of the gutenberg object in NLTK's corpus package. But since it is cumbersome to type such long names all the time, Python provides another version of the import statement, as follows:

>>> from nltk.corpus import gutenberg >>> gutenberg.fileids()

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', ...] >>> emma = gutenberg.words('austen-emma.txt')

Let's write a short program to display other information about each text, by looping over all the values of fileid corresponding to the gutenberg file identifiers listed earlier and then computing statistics for each text. For a compact output display, we will make sure that the numbers are all integers, using int().

>>> for fileid in gutenberg.fileids(): ... num_chars = len(gutenberg.raw(fileid)) O ... num_words = len(gutenberg.words(fileid)) ... num_sents = len(gutenberg.sents(fileid))

... num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)])) ... print int(num_chars/num_words), int(num_words/num_sents), int(num_words/num_vocab), fileid

4 21 26 austen-emma.txt 4 23 16 austen-persuasion.txt 4 24 22 austen-sense.txt 4 33 79 bible-kjv.txt 4 18 5 blake-poems.txt 4 17 14 bryant-stories.txt 4 17 12 burgess-busterbrown.txt 4 16 12 carroll-alice.txt 4 17 11 chesterton-ball.txt 4 19 11 chesterton-brown.txt 4 16 10 chesterton-thursday.txt 4 18 24 edgeworth-parents.txt 4 24 15 melville-moby_dick.txt 4 52 10 milton-paradise.txt 4 12 8 shakespeare-caesar.txt 4 13 7 shakespeare-hamlet.txt 4 13 6 shakespeare-macbeth.txt 4 35 12 whitman-leaves.txt

This program displays three statistics for each text: average word length, average sentence length, and the number of times each vocabulary item appears in the text on average (our lexical diversity score). Observe that average word length appears to be a general property of English, since it has a recurrent value of 4. (in fact, the average word length is really 3, not 4, since the num_chars variable counts space characters.) By contrast average sentence length and lexical diversity appear to be characteristics of particular authors.

The previous example also showed how we can access the "raw" text of the book O, not split up into tokens. The raw() function gives us the contents of the file without any linguistic processing. So, for example, len(gutenberg.raw('blake-poems.txt') tells us how many letters occur in the text, including the spaces between words. The sents() function divides the text up into its sentences, where each sentence is a list of words:

>>> macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt') >>> macbeth_sentences

[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...] >>> macbeth_sentences[1037]

['Double', ',', 'double', ',', 'toile', 'and', 'trouble', ';', 'Fire', 'burne', ',', 'and', 'Cauldron', 'bubble'] >>> longest_len = max([len(s) for s in macbeth_sentences]) >>> [s for s in macbeth_sentences if len(s) == longest_len]

[['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', ...], ...]

Most NLTK corpus readers include a variety of access methods apart from words(), raw(), and sents(). Richer linguistic content is available from some corpora, such as part-of-speech tags, dialogue tags, syntactic trees, and so forth; we will see these in later chapters.

Was this article helpful?

0 -2


  • Freddie
    How to find what are the attributes available in corpus.gutenberg?
    1 year ago
  • johanna
    How to see books in nltk gutenberg?
    19 days ago

Post a comment