Semantic Similarity

We have seen that synsets are linked by a complex network of lexical relations. Given a particular synset, we can traverse the WordNet network to find synsets with related meanings. Knowing which words are semantically related is useful for indexing a collection of texts, so that a search for a general term such as vehicle will match documents containing specific terms such as limousine.

Recall that each synset has one or more hypernym paths that link it to a root hypernym such as entity.n.01. Two synsets linked to the same root may have several hypernyms in common (see Figure 2-8). If two synsets share a very specific hypernym—one that is low down in the hypernym hierarchy—they must be closely related.

>>> right.lowest_common_hypernyms(minke)

>>> right.lowest_common_hypernyms(orca)

>>> right.lowest_common_hypernyms(tortoise)

[Synset('vertebrate.n.01')]

>>> right.lowest_common_hypernyms(novel)

Of course we know that whale is very specific (and baleen whale even more so), whereas vertebrate is more general and entity is completely general. We can quantify this concept of generality by looking up the depth of each synset:

>>> wn.synset('baleen_whale.n.01').min_depth() 14

>>> wn.synset('whale.n.02').min_depth() 13

>>> wn.synset('vertebrate.n.01').min_depth() 8

>>> wn.synset('entity.n.01').min_depth() 0

Similarity measures have been defined over the collection of WordNet synsets that incorporate this insight. For example, path_similarity assigns a score in the range 0-1 based on the shortest path that connects the concepts in the hypernym hierarchy (-1 is returned in those cases where a path cannot be found). Comparing a synset with itself will return 1. Consider the following similarity scores, relating right whale to minke whale, orca, tortoise, and novel. Although the numbers won't mean much, they decrease as we move away from the semantic space of sea creatures to inanimate objects.

>>> right.path_similarity(minke) 0.25

>>> right.path_similarity(orca) 0.16666666666666666 >>> right.path_similarity(tortoise) 0.076923076923076927 >>> right.path_similarity(novel) 0.043478260869565216

Several other similarity measures are available; you can type help(wn) for more information. NLTK also includes VerbNet, a hierarchical verb A' lexicon linked to WordNet. It can be accessed with nltk.corpus.verb net.

Was this article helpful?

0 -1

Post a comment