Listing Polite Getpy

import robotparser import urlparse import urllib def PoliteGet(url):

Return an open url-file, or None if URL is forbidden

RoboBuddy=robotparser.RobotFileParser() # Grab the host-name from the URL:

URLTuple=urlparse.urlparse(url) RobotURL="http://"+URLTuple[1]+"/robots.txt" RoboBuddy.set_url(RobotURL) RoboBuddy.read()

if RoboBuddy.can_fetch("I,Robot",url):

return urllib.urlopen(url) else:

URL="http://www.nexor.com/cgi-bin/rfcsearch/location?2449" print "Forbidden:",(PoliteGet(URL)==None) URL="http://www.yahoo.com/r/sq" print "Allowed:",(PoliteGet(URL)==None)

You can manually pass a list of robots.txt lines to a RobotFileParser by calling the method parse(lines).

If your parser runs for many days or weeks, you may want to re-read robots.txt periodically. RobotFileParser keeps a "last updated" timestamp. Call the method modified to set the timestamp to the current time. (This is done automatically when you call read or parse.) Call mtime to retrieve the timestamp, in ticks.

0 0

Post a comment