Incrementally Updating a Dictionary

We can employ dictionaries to count occurrences, emulating the method for tallying words shown in Figure 1-3. We begin by initializing an empty defaultdict, then process each part-of-speech tag in the text. If the tag hasn't been seen before, it will have a zero count by default. Each time we encounter a tag, we increment its count using the += operator (see Example 5-3).

Example 5-3. Incrementally updating a dictionary, and sorting by value.

>>> counts = nltk.defaultdict(int) >>> from nltk.corpus import brown

>>> for (word, tag) in brown.tagged_words(categories='news'): ... counts[tag] += 1

['FW', 'DET', 'WH', , 'VBZ', 'VB+PPO', , ')', 'ADJ', 'PRO', '*', '-', ...]

>>> from operator import itemgetter

>>> sorted(counts.items(), key=itemgetter(l), reverse=True) [('N', 22226), ('P', 10845), ('DET', 10648), ('NP', 8336), ('V', 7313), ...] >>> [t for t, c in sorted(counts.items(), key=itemgetter(1), reverse=True)] ['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', 'VD', ...]

The listing in Example 5-3 illustrates an important idiom for sorting a dictionary by its values, to show words in decreasing order of frequency. The first parameter of sorted() is the items to sort, which is a list of tuples consisting of a POS tag and a frequency. The second parameter specifies the sort key using a function itemget ter(). In general, itemgetter(n) returns a function that can be called on some other sequence object to obtain the nth element:

8336

The last parameter of sorted() specifies that the items should be returned in reverse order, i.e., decreasing values of frequency.

There's a second useful programming idiom at the beginning of Example 5-3, where we initialize a defaultdict and then use a for loop to update its values. Here's a schematic version:

>>> my_dictionary = nltk.defaultdict(function to create default value) >>> for item in sequence:

... my_dictionary[item_key] is updated with information about item

Here's another instance of this pattern, where we index words according to their last two letters:

>>> last_letters = nltk.defaultdict(list) >>> words = nltk.corpus.words.words('en') >>> for word in words: ... key = word[-2:] ... last_letters[key].append(word)

['abactinally', 'abandonedly', 'abasedly', 'abashedly', 'abashlessly', 'abbreviately', 'abdominally', 'abhorrently', 'abidingly', 'abiogenetically', 'abiologically', ...] >>> last_letters['zy']

['blazy', 'bleezy', 'blowzy', 'boozy', 'breezy', 'bronzy', 'buzzy', 'Chazy', ...]

The following example uses the same pattern to create an anagram dictionary. (You might experiment with the third line to get an idea of why this program works.)

>>> anagrams = nltk.defaultdict(list) >>> for word in words: ... key = ''.join(sorted(word)) ... anagrams[key].append(word)

['entrail', 'latrine', 'ratline', 'reliant', 'retinal', 'trenail']

Since accumulating words like this is such a common task, NLTK provides a more convenient way of creating a defaultdict(list), in the form of nltk.Index():

>>> anagrams = nltk.Index((''.join(sorted(w)), w) for w in words) >>> anagrams['aeilnrt']

['entrail', 'latrine', 'ratline', 'reliant', 'retinal', 'trenail']

nltk.Index is a defaultdict(list) with extra support for initialization. Similarly, nltk.FreqDist is essentially a defaultdict(int) with extra support for initialization (along with sorting and plotting methods).

Was this article helpful?

0 0

Post a comment