Getting Close Enough with difflib

People often make mistakes when entering text. The difflib module gives you a way to find strings that are close but not exact matches to a given string:

>>> import difflib

>>> right = 'The quick brown fox' >>> wrong = 'THe quack brown fix'

>>> matcher = difflib.SequenceMatcher(None, right, wrong)

0.842105263158

The ratio() method returns a floating point number between 0 and 1 that indicates how close the match is. Higher numbers indicate a closer match. A match higher than 0.6 is usually considered "good" (maybe not by medieval manuscript-illuminating monks, but good enough for the modern world).

To compare a single word against a list of words, use the difflib module's get_close_matches() method. It finds the words with the highest match ratio:

>>> import difflib

>>> mylist = ['quick', 'quack'

, 'quark

', 'quart']

>>> difflib.get close matches(

'quote',

mylist)

['quart'] _ _

DSU stands for "decorate, sort, undecorate," and it's a clever trick for sorting sequence objects by one of their elements.

TECHNICAL This use of the word decorate doesn't have anything to do with the STUFF decorator feature discussed in Chapter 16. Python has just gotten so big that it's having to recycle the words it uses to describe tools!

Suppose you have a list of address tuples, like this:

addresses = [

('123

Main St',

'Anytown', '12345'),

('123

Second St'

, 'Othertown', '54321'),

('456

Second St'

, 'Nowhereville', '11111'),

]

('777

Morris St'

, 'Filenes Basement', '99999')

Now suppose you want to sort them by zip code. The sort() method of lists lets you pass a comparison function, but we don't recommend that because it's clumsy and slow. A better way is to create a new list that is decorated with the zip field, so that each tuple becomes a 2-tuple where the first element is the zip code and the second element is the old tuple:

After you have the list of 2-tuples, you can sort the list and then undecorate it (remove the extra zip field).

These two functions do DSU on such a list. The first one uses a list comprehension to create the tmp list and the return value. (See Chapter 16 for more on list comprehensions.)

def dsu zip(addresses):

Here's what happens when you run the function:

[('456 Second St', 'Nowhereville', '11111'), ('123 Main St', 'Anytown', '12345'), ('123 Second St', 'Othertown', '54321'), ('777 Morris St', 'Filenes Basement', '99999')]

Tip There's another way to perform a DSU sort. Python 2.4 adds a new key parameter to the sort() method and the sorted() built-in function. Use this with the itemgetter() class from the operator module for a simpler DSU-type solution:

>>> import operator

>>> third item = operator.itemgetter(2) # get 3rd item from tuple >>> addresses.sort(key=third item) >>> addresses

[('456 Second St', 'Nowhereville', '11111'), ('123 Main St', 'Anytown', '12345'), ('123 Second St', 'Othertown', '54321'), ('777 Morris St', 'Filenes Basement', '99999')]

TECHNICAL The key parameter requires a function, which itemgetter() STUFF supplies. A similar attrgetter() function works with class instances to retrieve attribute values based on their names.

Was this article helpful?

+3 -1

Responses

  • richard
    What is considered a good difflib ratio?
    8 years ago

Post a comment