To see how sets are used, suppose we have several files recording observations of birds in the Canadian Arctic and we want to know which species we have seen. Each file is formatted like this:
canada goose canada goose long-tailed jaeger canada goose snow goose canada goose canada goose northern fulmar
Method Call set1 .difference(set2) set1 .intersection(set2) set1 .issubset(set2) set1 .issuperset(set2) set1 .union(set2)
Operator set1 - set2 set1 & set2 set1 <= set2 set1 >= set2 set1 | set2 set1 Aset2
Figure 9.3: Set operations
Here's our program:
# Find the different bird types observed, birds = set()
for filename in sys.argv[1:]: infile = open(filename, 'r') for line in infile:
name = line.stripO birds.add(name) infile.close()
# Print the birds. for b in birds:
The first statement after the import creates a new empty set and assigns it to the variable birds. The loop then reads the names of birds from each of the input files specified on the command line. Any whitespace before or after the bird's name is stripped away by line.stripO; the program then uses birds.add(name) to make sure the new name is in the set. If the name is not already there, set.add puts it in the set; otherwise, the set isn't changed.
Once all the names have been read, the program loops over the values in the set to print them. This works exactly like a loop over the values in a list, except that the order in which values are encountered is arbitrary: there is no guarantee that they will come out in the order in which they were added, in alphabetical order, in order by length, or anything else.
A-H I-O f P-Z
A-H I-O / P-Z
Marie Curie (nee Sklodowska)
Should be filed under A-H
Figure 9.4: Moving items in a filing cabinet
Sets are stored in a data structure called a hash table. Each time an item is added to a set, Python calculates a hash code for the item, which is an integer that is guaranteed to be the same for items with equal values:
Download setdict/hashcode.cmd >>> help(hash)
Help on built-in function hash in module _builtin_:
hash(object) -> integer
Return a hash value for the object. Two objects with the same value have the same hash value. The reverse is not necessarily true, but likely.
To see whether a value is in a set, Python simply recalculates the hash code for that value and checks the corresponding location in the hash table. This is a lot faster than searching the whole set every time a value needs to be looked up. However, this scheme works only if an item's hash code cannot change after it has been added to the hash table. To see why, consider what would happen if you filed patient records in alphabetical order but then went back and changed someone's name (see Figure 9.4). Their record would now be out of place, so if someone else tried to look it up, they wouldn't find it.
Because hash codes are computed from values, if a list's contents are changed, the hash code for that list will change. For this reason, Python allows sets to contain only immutable values (see Section 5.2, Modifying Lists, on page 85). Again, Booleans, numbers, strings, and tuples are allowed, but lists are not:
Traceback (most recent call last):
File "<stdin>", line 1, in <module> TypeError: list objects are unhashable
This is actually one of the reasons tuples were invented: they allow multipart values like ('Albert', 'Einstein') to be added to sets.
But this means that we can't store a set of sets. Sets themselves can't be immutable, since we need to add and remove values, so a set can't contain another one. To solve this problem, Python has another data type called a frozen set. As the name implies, frozen sets are sets whose contents can't be changed, just as tuples are lists whose values can't be modified. An empty frozen set (which isn't particularly useful) is created using frozenset(); to create a frozen set that contains some values, use frozenset(values), where values is a list, tuple, set, or other collection.
Was this article helpful?