Wordlist Corpora

NLTK includes some corpora that are nothing more than wordlists. The Words Corpus is the /usr/dict/words file from Unix, used by some spellcheckers. We can use it to find unusual or misspelled words in a text corpus, as shown in Example 2-3.

Example 2-3. Filtering a text: This program computes the vocabulary of a text, then removes all items that occur in an existing wordlist, leaving just the uncommon or misspelled words.

def unusual_words(text):

text_vocab = set(w.lower() for w in text if w.isalpha()) english_vocab = set(w.lower() for w in nltk.corpus.words.words()) unusual = text_vocab.difference(english_vocab) return sorted(unusual)

>>> unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt')) ['abbeyland', 'abhorrence', 'abominably', 'abridgement', 'accordant', 'accustomary', 'adieus', 'affability', 'affectedly', 'aggrandizement', 'alighted', 'allenham', 'amiably', 'annamaria', 'annuities', 'apologising', 'arbour', 'archness', ...] >>> unusual_words(nltk.corpus.nps_chat.words())

['aaaaaaaaaaaaaaaaa', 'aaahhhh', 'abou', 'abourted', 'abs', 'ack', 'acros', 'actualy', 'adduser', 'addy', 'adoted', 'adreniline', 'ae', 'afe', 'affari', 'afk', 'agaibn', 'agurlwithbigguns', 'ahah', 'ahahah', 'ahahh', 'ahahha', 'ahem', 'ahh', ...]

There is also a corpus of stopwords, that is, high-frequency words such as the, to, and also that we sometimes want to filter out of a document before further processing. Stopwords usually have little lexical content, and their presence in a text fails to distinguish it from other texts.

>>> from nltk.corpus import stopwords >>> stopwords.words('english')

['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across',

'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', ...]

Let's define a function to compute what fraction of words in a text are not in the stop-words list:

>>> def content_fraction(text):

... stopwords = nltk.corpus.stopwords.words('english')

... content = [w for w in text if w.lower() not in stopwords]

>>> content_fraction(nltk.corpus.reuters.words()) 0.65997695393285261

Thus, with the help of stopwords, we filter out a third of the words of the text. Notice that we've combined two different kinds of corpus here, using a lexical resource to filter the content of a text corpus.

How many words of four letters or more can you make from those shown here? Each letter may be used once per word. Each word must contain the center letter and there must be at least one nine-letter word. No plurals ending in "s"; no foreign words; no pr oper names. 21 words, good; 32 words, very good; 42 words, excellent.

Figure 2-6. A word puzzle: A grid of randomly chosen letters with rules for creating words out of the letters; this puzzle is known as "Target."

A wordlist is useful for solving word puzzles, such as the one in Figure 2-6. Our program iterates through every word and, for each one, checks whether it meets the conditions. It is easy to check obligatory letter © and length O constraints (and we'll only look for words with six or more letters here). It is trickier to check that candidate solutions only use combinations of the supplied letters, especially since some of the supplied letters appear twice (here, the letter v). The FreqDist comparison method © permits us to check that the frequency of each letter in the candidate word is less than or equal to the frequency of the corresponding letter in the puzzle.

>>> puzzle_letters = nltk.FreqDist('egivrvonl') >>> obligatory = 'r'

>>> wordlist = nltk.corpus.words.words() >>> [w for w in wordlist if len(w) >= 6 ... and obligatory in w

['glover', 'gorlin', 'govern', 'grovel', 'ignore', 'involver', 'lienor', 'linger', 'longer', 'lovering', 'noiler', 'overling', 'region', 'renvoi', 'revolving', 'ringle', 'roving', 'violer', 'virole']

One more wordlist corpus is the Names Corpus, containing 8,000 first names categorized by gender. The male and female names are stored in separate files. Let's find names that appear in both files, i.e., names that are ambiguous for gender:

>>> female_names = names.words('female.txt')

>>> [w for w in male_names if w in female_names]

['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis', 'Alfie', 'Ali', 'Alix', 'Allie', 'Allyn', 'Andie', 'Andrea', 'Andy', 'Angel', 'Angie', 'Ariel', 'Ashley', 'Aubrey', 'Augustine', 'Austin', 'Averil', ...]

It is well known that names ending in the letter a are almost always female. We can see this and some other patterns in the graph in Figure 2-7, produced by the following code.

Remember that name[-1] is the last letter of name.

>>> cfd = nltk.ConditionalFreqDist( ... (fileid, name[-1])

1800

1600

1400

1200

w 1000 n o

(OpQUTJO)''-! Ol ,C —>-ii — gco > £ x p^ «

Samples

Figure 2-7. Conditional frequency distribution: This plot shows the number of female and male names ending with each letter of the alphabet; most names ending with a, e, or i are female; names ending in h and l are equally likely to be male or female; names ending in k, o, r, s, and t are likely to be male.

(OpQUTJO)''-! Ol ,C —>-ii — gco > £ x p^ «

Samples

Figure 2-7. Conditional frequency distribution: This plot shows the number of female and male names ending with each letter of the alphabet; most names ending with a, e, or i are female; names ending in h and l are equally likely to be male or female; names ending in k, o, r, s, and t are likely to be male.

Was this article helpful?

0 0

Post a comment