Pronouncing Dictionary

A slightly richer kind of lexical resource is a table (or spreadsheet), containing a word plus some properties in each row. NLTK includes the CMU Pronouncing Dictionary for U.S. English, which was designed for use by speech synthesizers.

>>> entries = nltk.corpus.cmudict.entries()

127012

>>> for entry in entries[39943:39951]: ... print entry

('fir', ['F', 'ER1']) ('fire', ['F', 'AY1', 'ER0']) ('fire', ['F', 'AY1', 'R'])

('firearm', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M']) ('firearm', ['F', 'AY1', 'R', 'AA2', 'R', 'M']) ('firearms', ['F', 'AY1', 'ER0', 'AA2', 'R', 'M', 'Z']) ('firearms', ['F', 'AY1', 'R', 'AA2', 'R', 'M', 'Z']) ('fireball', ['F', 'AY1', 'ER0', 'B', 'AO2', 'L'])

For each word, this lexicon provides a list of phonetic codes—distinct labels for each contrastive sound—known as phones. Observe that fire has two pronunciations (in U.S. English): the one-syllable F AY1 R, and the two-syllable F AY1 ER0. The symbols in the CMU Pronouncing Dictionary are from the Arpabet, described in more detail at http://en.wikipedia.org/wiki/Arpabet.

Each entry consists of two parts, and we can process these individually using a more complex version of the for statement. Instead of writing for entry in entries:, we replace entry with two variable names, word, pron . Now, each time through the loop, word is assigned the first part of the entry, and pron is assigned the second part of the entry:

>>> for word, pron in entries: O if len(pron) == 3:

ph1, ph2, ph3 = pron © if ph1 == 'P' and ph3 == 'T': print word, ph2, pait EY1 pat AE1 pate EY1 patt AE1 peart ER1 peat IY1 peet IY1 peete IY1 pert ER1 pet EH1 pete IY1 pett EH1 piet IY1 piette IY1 pit IH1 pitt IH1 pot AA1 pote OW1 pott AA1 pout AW1 puett UW1 purt ER1 put UH1 putt AH1

The program just shown scans the lexicon looking for entries whose pronunciation consists of three phones ©. If the condition is true, it assigns the contents of pron to three new variables: ph1, ph2, and ph3. Notice the unusual form of the statement that does that work ©.

Here's another example of the same for statement, this time used inside a list comprehension. This program finds all words whose pronunciation ends with a syllable sounding like nicks. You could use this method to find rhyming words.

>>> [word for word, pron in entries if pron[-4:] == syllable]

["atlantic's", 'audiotronics', 'avionics', 'beatniks', 'calisthenics', 'centronics', 'chetniks', "clinic's", 'clinics', 'conics', 'cynics', 'diasonics', "dominic's", 'ebonics', 'electronics', "electronics'", 'endotronics', "endotronics'", 'enix', ...]

Notice that the one pronunciation is spelled in several ways: nics, niks, nix, and even ntic's with a silent t, for the word atlantic's. Let's look for some other mismatches between pronunciation and writing. Can you summarize the purpose of the following examples and explain how they work?

>>> [w for w, pron in entries if pron[-l] == 'M' and w[-l] == 'n'] ['autumn', 'column', 'condemn', 'damn', 'goddamn', 'hymn', 'solemn'] >>> sorted(set(w[:2] for w, pron in entries if pron[0] == 'N' and w[0] != 'n')) ['gn', 'kn', 'mn', 'pn']

The phones contain digits to represent primary stress (1), secondary stress (2), and no stress (0). As our final example, we define a function to extract the stress digits and then scan our lexicon to find words having a particular stress pattern. >>> def stress(pron):

... return [char for phone in pron for char in phone if char.isdigit()] >>> [w for w, pron in entries if stress(pron) == ['0', '1', '0', '2', '0']] ['abbreviated', 'abbreviating', 'accelerated', 'accelerating', 'accelerator', 'accentuated', 'accentuating', 'accommodated', 'accommodating', 'accommodative', 'accumulated', 'accumulating', 'accumulative', 'accumulator', 'accumulators', ...] >>> [w for w, pron in entries if stress(pron) == ['0', '2', '0', '1', '0']] ['abbreviation', 'abbreviations', 'abomination', 'abortifacient', 'abortifacients', 'academicians', 'accommodation', 'accommodations', 'accreditation', 'accreditations', 'accumulation', 'accumulations', 'acetylcholine', 'acetylcholine', 'adjudication', ...]

* . A subtlety of this program is that our user-defined function stress() is invoked inside the condition of a list comprehension. There is also a ' v f jt? doubly nested for loop. There's a lot going on here, and you might want to return to this once you've had more experience using list comprehensions.

We can use a conditional frequency distribution to help us find minimally contrasting sets of words. Here we find all the p words consisting of three sounds ©, and group them according to their first and last sounds O.

>>> cfd = nltk.ConditionalFreqDist(p3)

... print template, wordlist[:70] + "..."

P-CH perch puche poche peach petsche poach pietsch putsch pautsch piche pet...

P-K pik peek pic pique paque polk perc poke perk pac pock poch purk pak pa... P-L pil poehl pille pehl pol pall pohl pahl paul perl pale paille perle po... P-N paine payne pon pain pin pawn pinn pun pine paign pen pyne pane penn p... P-P pap paap pipp paup pape pup pep poop pop pipe paape popp pip peep pope... P-R paar poor par poore pear pare pour peer pore parr por pair porr pier... P-S pearse piece posts pasts peace perce pos pers pace puss pesce pass pur... P-T pot puett pit pete putt pat purt pet peart pott pett pait pert pote pa... P-Z pays p.s pao's pais paws p.'s pas pez paz pei's pose poise peas paiz p...

Rather than iterating over the whole dictionary, we can also access it by looking up particular words. We will use Python's dictionary data structure, which we will study systematically in Section 5.3. We look up a dictionary by specifying its name, followed by a key (such as the word 'fire') inside square brackets O.

>>> prondict = nltk.corpus.cmudict.dict()

Traceback (most recent call last):

File "<stdin>", line 1, in <module> KeyError: 'blog'

>>> prondict['blog'] = [['B', 'L', 'AA1', 'G']] 0 >>> prondict['blog'] [['B', 'L', 'AA1', 'G']]

If we try to look up a non-existent key 0, we get a KeyError. This is similar to what happens when we index a list with an integer that is too large, producing an IndexEr ror. The word blog is missing from the pronouncing dictionary, so we tweak our version by assigning a value for this key © (this has no effect on the NLTK corpus; next time we access it, blog will still be absent).

We can use any lexical resource to process a text, e.g., to filter out words having some lexical property (like nouns), or mapping every word of the text. For example, the following text-to-speech function looks up each word of the text in the pronunciation dictionary:

>>> text = ['natural', 'language', 'processing'] >>> [ph for w in text for ph in prondict[w][0]]

['N', 'AE1', 'CH', 'ER0', 'AH0', 'L', 'L', 'AE1', 'NG', 'G', 'W', 'AH0', 'JH', 'P', 'R', 'AA1', 'S', 'EH0', 'S', 'IH0', 'NG']

Was this article helpful?

+1 0

Responses

  • Marta
    What is cmu pronouncing dictionary?
    8 years ago

Post a comment