The workings of the Spider class

To see how the heart of the spider.py program processes information, look at the Spider class. To use this class, you have to create an instance of it.

REMEMBER Instances are objects created from classes (see Chapter 13 for the details):

• An instance is like a document created from that template. The parts of the Spider class perform the following tasks:

• The run() method does the real work. It's a simple loop that o Processes new pages (from the self._links_to_process list) until there are no more pages o Calls the process_page() method each time through the loop, using a new item from the self.links_to_process list.

• The process_page() method contains a loop that includes an if statement. (See Chapter 10 to find more about if statements and loops.)

• The loop sends a URL to the get_page() function, which opens the Web page, reads it, and returns its contents. Then the loop sends the contents of the Web page to the find_links() function, which logs and returns a list containing all the links in that URL.

• The statement if link not in self.URLs and self.url_in_ site(link): checks whether this link is already in the self.URLs set and whether it's part of the Web site we're analyzing. (This spider only checks links internal to the Web site of the original URL. If you want to check links to other Web sites, you need some different code here.) If the link passes the tests, it's added to the self.URLs set and the self._links_to_process list.

Was this article helpful?

0 0

Post a comment