206 indexed Files contain the word 'variant'
Figure 19.1 The Page Indexer application
The algorithm we will use for indexing is this: For each HTML file that is encountered, its text is read, entities are converted to the equivalent Unicode character, and the HTML tags are stripped out. Then the text is split into
We will begin by looking at two extracts from the application's main form, which is in file chap19/pageindexer.pyw.
self.fileCount = 0
self.filenamesForWords = collections.defaultdict(set) self.commonWords = set() self.lock = QReadWriteLock() self.path = QDir.homePath()
The fileCount variable is used to keep track of how many files have been indexed so far. The filenamesForWords default dictionary's keys are words and its values are sets of filenames. The commonWords set holds words that have occurred in at least 250 files. The read/write lock is used to ensure that access to the filenamesForWords dictionary and to the commonWords set are protected since they will be read in the primary thread and read and written in the secondary thread. The QDir.homePath() method returns the user's home directory; we use it to set an initial search path.
self.walker = walker.Walker(self.lock, self) self.connect(self.walker, SIGNAL("indexed(QString)"), self.indexed)
self.connect(self.walker, SIGNAL("finished(bool)"), self.finished) self.connect(self.pathButton, SIGNAL("clicked()"), self.setPath)
self.connect(self.findEdit, SIGNAL("returnPressed()"), self.find)
The secondary thread is in the walker module (so named because it walks the filesystem), and the QThread subclass is called Walker. Whenever the thread indexes a new file it emits a signal with the filename. It also emits a finished() signal when it has indexed all the files in the path it was given when it was started.
Signals emitted in one thread that are intended for another work asynchronously, that is, they don't block. But they work only if there is an event words and each word of 3-25 characters in length inclusive that isn t in the set of common words is added to the filenamesForWords default dictionary. Each of the dictionary's keys is a unique word, and each associated value is a set of the filenames where the word occurs. If any word occurs in more than 250 files, it is deleted from the dictionary and added to the set of common words. This ensures that the dictionary is kept to a reasonable size and means that searches for words like "and" and "the" won't work—which is a good thing, since such words are likely to match in thousands of files, far too many to be useful.
loop at the receiving end. This means that secondary threads can pass information to the primary thread using signals, but not the other way around—unless we run a separate event loop in a secondary thread (which is possible). Behind the scenes, when cross-thread signals are emitted, instead of calling the relevant method directly as is done for signals emitted and received in the same thread, PyQt puts an event onto the receiving thread's event queue with any data that was passed. When the receiver's event loop gets around to reading the event, it responds to it by calling the relevant method with any data that was passed.
- Figure 19.2 A schematic of typical PyQt inter-thread communication
As Figure 19.2 shows, the primary thread normally passes information to secondary threads using method calls, and secondary threads pass information to the primary thread using signals. Another communication mechanism, used by both primary and secondary threads, is to use shared data. Such data must have accesses protected—for example, by mutexes or read/write locks.
If the user clicks the Set Path button, the setPath() method is called, and if the user presses Enter in the find line edit, the find() method is called.
The Form class is a QDialog, but we have designed it so that if the user presses Esc while the indexing is ongoing, the indexing will stop, and if the user presses Esc when the indexing has finished (or been stopped), the application will terminate. We will see how this is done when we look at the accept() and reject() reimplementations.
self.pathButton.setEnabled(False) if self.walker.isRunning(): self.walker.stop() self.walker.wait() path = QFileDialog.getExistingDirectory(self,
"Choose a Path to Index", self.path) if path.isEmpty():
self.statusLabel.setText("Click the 'Set Path' "
"button to start indexing") self.pathButton.setEnabled(True)
return self.path = QDir.toNativeSeparators(path)
self.fileCount = 0
self.filenamesForWords = collections.defaultdict(set) self.commonWords = set() self.walker.initialize(unicode(self.path), self.filenamesForWords, self.commonWords) self.walker.start()
When the user clicks Set Path, we begin by disabling the button and then stopping the thread if it is running. The stop() method is a custom one of our own. The wait() method is one inherited from QThread; it blocks until the thread has finished running, that is, until the run() method returns. In the stop() method, we indirectly ensure that the run() method finishes as soon as possible after stop() has been called, as we will see in the next section.
The user interface is set up by moving the keyboard focus to the find line edit, setting the path label to the chosen path, and clearing the status label that is used to keep the user informed about progress. The files list widget lists those files that contain the word in the find line edit. We don't need to protect access to the filenamesForWords default dictionary or to the commonWords set since the only thread running at this point is the primary thread.
We finish off by initializing the walker thread with the path and references to the data structures we want it to populate, and then we call start() to start it executing.
def indexed(self, fname):
self.statusLabel.setText(fname) self.fileCount += 1 if self.fileCount % 25 == 0:
indexedWordCount = len(self.filenamesForWords) commonWordCount = len(self.commonWords)
Next we get the path the user chose (or return, if they canceled). We have used QDir.toNativeSeparators() since internally PyQt always uses "/" to separate paths, but on Windows we want to show "\"s instead. The toNa-tiveSeparators() method was introduced with Qt 4.2; for earlier versions use QDir.convertSeparators() instead. By default, getExistingDirectory() shows only directories because there is an optional fourth argument with a default value of QFileDialog.ShowDirsOnly; if we want filenames to be visible, we can clear this flag by passing QFileDialog.Options().
self.lock.unlock() self.wordsIndexedLCD.display(indexedWordCount) self.commonWordsLCD.display(commonWordCount) elif self.fileCount % 101 == 0:
self.lock.lockForRead() words = self.commonWords.copy() finally:
Whenever the walker thread finishes indexing a file, it emits an indexed() signal with the filename; this signal is connected to the Form.indexed() method shown earlier. We update the status label to show the name of the file that has just been indexed, and every 25 files we also update the file count, words indexed, and common words LCD widgets. We use a read lock to ensure that the shared data structures are safe to read from, and we do the minimum amount of work inside the context of the lock, updating the LCD widgets only after the lock has been released.
For every 101st file processed we update the common words list widget. Again we use a read lock, and we use set.copy() to ensure that we do not refer to the shared data once the lock has been released.
def finished(self, completed):
self.statusLabel.setText("Indexing complete" \
if completed else "Stopped") self.finishedIndexing()
When the thread has been stopped or has finished, it emits a finished() signal, connected to this method and passing a Boolean to indicate whether it completed. We update the status label and call our finishedIndexing() method to update the user interface.
def finishedIndexing(self): self.walker.wait()
self.filesIndexedLCD.display(self.fileCount) self.wordsIndexedLCD.display(len(self.filenamesForWords)) self.commonWordsLCD.display(len(self.commonWords)) self.pathButton.setEnabled(True)
When the indexing has finished we call QThread.wait() to make sure that the thread's run() method has finished. Then we update the user interface based on the current values of the shared data structures. We don't need to protect access to the dictionary or the set because the walker thread is not running.
Using a Context Manager for Unlocking
In this chapter, we use try ... finally blocks to ensure that locks are unlocked after use. Python 2.6 offers an alternative approach using the new with keyword, in conjunction with a context manager. Context managers are explained in http://www.python.org/dev/peps/pep-0343; suffice it to say that we can make a context manager by creating a class that has two special methods:_enter_() and_exit_().Then, instead of writing code like this:
self.lock.lockForRead() found = word in self.commonWords finally:
we can write something much simpler and shorter:
found = word in self.commonWords
This works because the semantics of the object given to the with statement (at its simplest) are:
# statements, e.g., found = word in self.commonWords finally:
The ReadLocker context manager class itself is also easy to implement, assuming it is passed a QReadWriteLock object:
If fact, since PyQt 4.1, QReadLocker and QWriteLocker can be used as context managers, so with Python 2.6 (or Python 2.5 with a from_future_import with_statement), we don't need to use try ... finally to guarantee unlocking, and can instead write code like this:
found = word in self.commonWords
The files pageindexer_26.pyw and walker_26.py in chap19 use this approach.
At any time during the indexing, the user can interact with the user interface with no freezing or performance degradation. In particular, they can enter text in the find line edit and press Enter to populate the files list widget with those files that contain the word they typed. If they press Enter more than once with a bit of time between presses, the list of files may change, because in the interval more files may have been indexed. The find() method is slightly long, so we will review it in two parts.
word = unicode(self.findEdit.text()) if not word:
self.statusLabel.setText("Enter a word to find in files") return self.statusLabel.clear() self.filesListWidget.clear() word = word.lower() if " " in word:
self.lock.lockForRead() found = word in self.commonWords finally:
self.lock.unlock() if found:
"Common words like '%s' are not indexed" % word)
If the user enters a word to find, we clear the status label and the file list widget and look for the word in the set of common words. If the word was found, it is too common to be indexed, so we just give an informative message and return.
files = self.filenamesForWords.get(word, set()).copy() finally:
self.lock.unlock() if not files:
"No indexed file contains the word '%s'" % word)
return files = [QDir.toNativeSeparators(name) for name in \
sorted(files, key=unicode.lower)] self.filesListWidget.addItems(files) self.statusLabel.setText(
"%d indexed files contain the word '%s'" % ( len(files), word))
If the user's word is not in the set of common words, it might be in the index. We access the filenamesForWords default dictionary using a read lock, and copy the set of files that match the word. The set will be empty if no files have the word, but in either case, the set we have is a copy, so there is no risk of accessing shared data outside the context of a lock. If there are matching files we add them to the files list widget, sorted and using platform-native path separators.
The sorted() function returns its first argument (e.g., a list or set), in sorted order. It can be given a comparison function as the second argument, but here we have specified a "key". This has the effect of doing a DSU (decorate, sort, undecorate) sort that is the equivalent of:
templist = [(fname.lower(), fname) for fname in files] templist.sort()
files = [fname for key, fname in templist]
This is more efficient than using a comparison function because each item is lowercased just once rather than every time it is used in a comparison.
if self.walker.isRunning(): self.walker.stop() self.finishedIndexing() else:
If the user presses Esc, the reject() method is called. If indexing is in progress, we call stop() on the thread and then call finishedIndexing(); the finishedIn-dexing() method calls wait(). Otherwise, indexing has either been stopped by a previous Esc key press or has finished; either way, we call accept() to terminate the application.
def closeEvent(self, event=None): self.walker.stop() self.walker.wait()
When the application is terminated, either by the accept() call that occurs in the reject() method, or by other means, such as the user clicking the close X button, the close event is called. Here we make sure that indexing has been stopped and that the thread has finished so that a clean termination takes place.
All the indexing work has been done by the walker secondary thread. This thread has been controlled by the primary thread calling its methods (e.g., start() and stop()), and has notified the primary thread of its status (file indexed, indexing finished) through PyQt's signals and slots mechanism. The the shared data has been accessed—for example, when the user has asked which files contain a particular word, or when the data has been updated by the walk er thread, using the protection of a read/write lock. In the following section we will see how the Walker thread is implemented,how it emits its signals, and how it populates the data structures it is given.