Frequency Distributions

How can we automatically identify the words of a text that are most informative about the topic and genre of the text? Imagine how you might go about finding the 50 most frequent words of a book. One method would be to keep a tally for each vocabulary item, like that shown in Figure 1-3. The tally would need thousands of rows, and it would be an exceedingly laborious process—so laborious that we would rather assign the task to a machine.

Word Tally the mmm mi been mm I

persevere |

nation m III

Figure 1-3. Counting words appearing in a text (a frequency distribution).

The table in Figure 1-3 is known as a frequency distribution , and it tells us the frequency of each vocabulary item in the text. (In general, it could count any kind of observable event.) It is a "distribution" since it tells us how the total number of word tokens in the text are distributed across the vocabulary items. Since we often need frequency distributions in language processing, NLTK provides built-in support for them. Let's use a FreqDist to find the 50 most frequent words of Moby Dick. Try to work out what is going on here, then read the explanation that follows.

>>> fdistl = FreqDist(textl) O >>> fdistl

<FreqDist with 260819 outcomes> >>> vocabularyl = fdist1.keys() © >>> vocabularyl[:50] O

[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', , '-',

'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like'] >>> fdistl['whale']

When we first invoke FreqDist, we pass the name of the text as an argument O. We can inspect the total number of words ("outcomes") that have been counted up © — 260,819 in the case of Moby Dick. The expression keys() gives us a list of all the distinct types in the text ©, and we can look at the first 50 of these by slicing the list O.

■ > Your Turn: Try the preceding frequency distribution example for your self, for text2. Be careful to use the correct parentheses and uppercase 1 4 i i • i •

7k*.' letters. it you get an error message NameError: name FreqDist is not defined, you need to start your work with from nltk.book import *.

Do any words produced in the last example help us grasp the topic or genre of this text? Only one word, whale, is slightly informative! It occurs over 900 times. The rest of the words tell us nothing about the text; they're just English "plumbing." What proportion of the text is taken up with such words? We can generate a cumulative frequency plot for these words, using fdist1.plot(50, cumulative=True), to produce the graph in Figure 1-4. These 50 words account for nearly half the book!

Figure 1-4. Cumulative frequency plot for the 50 most frequently used words in Moby Dick, which account for nearly half of the tokens.

If the frequent words don't help us, how about the words that occur once only, the so-called hapaxes? View them by typing fdist1.hapaxes(). This list contains lexicographer, cetological, contraband, expostulations, and about 9,000 others. It seems that there are too many rare words, and without seeing the context we probably can't guess what half of the hapaxes mean in any case! Since neither frequent nor infrequent words help, we need to try something else.

Was this article helpful?

0 0

Post a comment