Gender Identification

In Section 2.4, we saw that male and female names have some distinctive characteristics. Names ending in a, e, and i are likely to be female, while names ending in k, o, r, s, and t are likely to be male. Let's build a classifier to model these differences more precisely.

The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features. For this example, we'll start by just looking at the final letter of a given name. The following feature extractor function builds a dictionary containing relevant information about a given name:

>>> def gender_features(word): ... return {'last_letter': word[-l]} >>> gender_features('Shrek') {'last_letter': 'k'}

The dictionary that is returned by this function is called a feature set and maps from features' names to their values. Feature names are case-sensitive strings that typically provide a short human-readable description of the feature. Feature values are values with simple types, such as Booleans, numbers, and strings.

* > Most classification methods require that features be encoded using sim ple value types, such as Booleans, numbers, and strings. But note that ' * ¬ĽA* just because a feature has a simple type, this does not necessarily mean that the feature's value is simple to express or compute; indeed, it is even possible to use very complex and informative values, such as the output of a second supervised classifier, as features.

Now that we've defined a feature extractor, we need to prepare a list of examples and corresponding class labels:

>>> from nltk.corpus import names >>> import random

>>> names = ([(name, 'male') for name in names.words('male.txt')] + ... [(name, 'female') for name in names.words('female.txt')])

>>> random.shuffle(names)

Next, we use the feature extractor to process the names data, and divide the resulting list of feature sets into a training set and a test set. The training set is used to train a new "naive Bayes" classifier.

>>> featuresets = [(gender_features(n), g) for (n,g) in names] >>> train_set, test_set = featuresets[500:], featuresets[:500] >>> classifier = nltk.NaiveBayesClassifier.train(train_set)

We will learn more about the naive Bayes classifier later in the chapter. For now, let's just test it out on some names that did not appear in its training data:

>>> classifier.classify(gender_features('Neo')) 'male'

>>> classifier.classify(gender_features('Trinity')) 'female'

Observe that these character names from The Matrix are correctly classified. Although this science fiction movie is set in 2199, it still conforms with our expectations about names and genders. We can systematically evaluate the classifier on a much larger quantity of unseen data:

>>> print nltk.classify.accuracy(classifier, test_set) 0.758

Finally, we can examine the classifier to determine which features it found most effective for distinguishing the names' genders:

>>> classifier.show_most_informative_features(5) Most Informative Features last_letter = 'a' female : male = 38.3 : 1.0

This listing shows that the names in the training set that end in a are female 38 times more often than they are male, but names that end in k are male 31 times more often than they are female. These ratios are known as likelihood ratios, and can be useful for comparing different feature-outcome relationships.

Your Turn: Modify the gender_features() function to provide the classifier with features encoding the length of the name, its first letter, and any other features that seem like they might be informative. Retrain the classifier with these new features, and test its accuracy.

When working with large corpora, constructing a single list that contains the features of every instance can use up a large amount of memory. In these cases, use the function nltk.classify.apply_features, which returns an object that acts like a list but does not store all the feature sets in memory:

>>> from nltk.classify import apply_features

>>> train_set = apply_features(gender_features, names[500:])

>>> test_set = apply_features(gender_features, names[:500])

Was this article helpful?

0 0

Post a comment