Default Dictionaries

If we try to access a key that is not in a dictionary, we get an error. However, it's often useful if a dictionary can automatically create an entry for this new key and give it a default value, such as zero or the empty list. Since Python 2.5, a special kind of dictionary called a defaultdict has been available. (It is provided as nltk.defaultdict for the benefit of readers who are using Python 2.4.) In order to use it, we have to supply a parameter which can be used to create the default value, e.g., int, float, str, list, dict, tuple.

>>> frequency = nltk.defaultdict(int) >>> frequency['colorless'] = 4 >>> frequency['ideas'] 0

>>> pos = nltk.defaultdict(list) >>> pos['sleep'] = ['N', 'V']

These default values are actually functions that convert other objects to the specified type (e.g., int("2"), list("2")). When they are called with no parameter—say, int(), list()—they return 0 and [] respectively.

The preceding examples specified the default value of a dictionary entry to be the default value of a particular data type. However, we can specify any default value we like, simply by providing the name of a function that can be called with no arguments to create the required value. Let's return to our part-of-speech example, and create a dictionary whose default value for any entry is 'N' O. When we access a non-existent entry ©, it is automatically added to the dictionary ©.

>>> pos = nltk.defaultdict(lambda: 'N') O >>> pos['colorless'] = 'AD]' >>> pos['blog'] © 'N'

v This example used a lambda expression, introduced in Section 4.4. This

•jf, lambda expression specifies no parameters, so we call it using paren' ' f jtf theses with no arguments. Thus, the following definitions of f and g are equivalent:

Let's see how default dictionaries could be used in a more substantial language processing task. Many language processing tasks—including tagging—struggle to correctly process the hapaxes of a text. They can perform better with a fixed vocabulary and a guarantee that no new words will appear. We can preprocess a text to replace low-frequency words with a special "out of vocabulary" token, UNK, with the help of a default dictionary. (Can you work out how to do this without reading on?)

We need to create a default dictionary that maps each word to its replacement. The most frequent n words will be mapped to themselves. Everything else will be mapped to UNK.

>>> alice = nltk.corpus.gutenberg.words('carroll-alice.txt')

>>> mapping = nltk.defaultdict(lambda: 'UNK')

>>> alice2 = [mapping[v] for v in alice] >>> alice2[:100]

['UNK', 'Alice', , 's', 'Adventures', 'in', 'Wonderland', 'by', 'UNK', 'UNK',

'UNK', 'UNK', 'CHAPTER', 'I', '.', 'UNK', 'the', 'Rabbit', '-', 'UNK', 'Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'UNK', 'into', 'the', 'book', 'her', 'sister', 'was', 'UNK', ',', 'but', 'it', 'had', 'no', 'pictures', 'or', 'UNK',

'in', 'it', ',', , 'and', 'what', 'is', 'the', 'use', 'of', 'a', 'book', ",'",

'thought', 'Alice', , 'without', 'pictures', 'or', 'conversation', "?'", ...]

Was this article helpful?

0 0

Post a comment