Space Time Tradeoffs

We can sometimes significantly speed up the execution of a program by building an auxiliary data structure, such as an index. The listing in Example 4-7 implements a simple text retrieval system for the Movie Reviews Corpus. By indexing the document collection, it provides much faster lookup.

Example 4-7. A simple text retrieval system.

def raw(file):

contents = open(file).read() contents = re.sub(r'<.*?>', ' ', contents) contents = re.sub('\s+', ' ', contents) return contents def snippet(doc, term): # buggy text = ' '*30 + raw(doc) + ' '*30 pos = text.index(term) return text[pos-30:pos+30]

print "Building Index..."

files = nltk.corpus.movie_reviews.abspaths()

idx = nltk.Index((w, f) for f in files for w in raw(f).split())

query = raw_input("query> ") if query in idx:

for doc in idx[query]:

print snippet(doc, query)

else:

print "Not found"

A more subtle example of a space-time trade-off involves replacing the tokens of a corpus with integer identifiers. We create a vocabulary for the corpus, a list in which each word is stored once, then invert this list so that we can look up any word to find its identifier. Each document is preprocessed, so that a list of words becomes a list of integers. Any language models can now work with integers. See the listing in Example 4-8 for an example of how to do this for a tagged corpus.

Example 4-8. Preprocess tagged corpus data, converting all words and tags to integers.

def preprocess(tagged_corpus): words = set() tags = set()

for sent in tagged_corpus: for word, tag in sent: words.add(word) tags.add(tag) wm = dict((w,i) for (i,w) in enumerate(words)) tm = dict((t,i) for (i,t) in enumerate(tags))

return [[(wm[w], tm[t]) for (w,t) in sent] for sent in tagged_corpus]

Another example of a space-time trade-off is maintaining a vocabulary list. If you need to process an input text to check that all words are in an existing vocabulary, the vocabulary should be stored as a set, not a list. The elements of a set are automatically indexed, so testing membership of a large set will be much faster than testing membership of the corresponding list.

We can test this claim using the timeit module. The Timer class has two parameters: a statement that is executed multiple times, and setup code that is executed once at the beginning. We will simulate a vocabulary of 100,000 items using a list O or set © of integers. The test statement will generate a random item that has a 50% chance of being in the vocabulary ©.

>>> from timeit import Timer >>> vocab_size = 100000

>>> setup_list = "import random; vocab = range(%d)" % vocab_size O >>> setup_set = "import random; vocab = set(range(%d))" % vocab_size © >>> statement = "random.randint(0, %d) in vocab" % vocab_size * 2 © >>> print Timer(statement, setup_list).timeit(1000) 2.78092288971

>>> print Timer(statement, setup_set).timeit(1000) 0.0037260055542

Performing 1,000 list membership tests takes a total of 2.8 seconds, whereas the equivalent tests on a set take a mere 0.0037 seconds, or three orders of magnitude faster!

Was this article helpful?

0 0

Post a comment