■Tip You can find out more about the HTTP headers, including a short description and a link to the appropriate RFC specification document, at http : //www .cs.tut. fi/~j korpela/http. html.

Another useful method of the response object is geturl (). This method returns the actual URL of the retrieved document. It is possible that the initial URL will respond with an HTTP redirect, and you'll actually end up retrieving a page from a completely different URL. In that case you may want to check the origin of the page. One possible cause of a redirect is that you are trying to access restricted content without prior authentication. In this case you will most likely be redirected to the login page of the web site.

>>> res = urllib2.urlopen('http://www.hsbc.co.uk/') >>> res.geturl()

'http://www.hsbc.co.uk/1/2/' >>>

The resulting contents can then be passed to Beautiful Soup for the HTML interpretation and parsing. The result is an HTML document object that implements various methods for searching and extracting data from the document. The argument supplied to the BeautifulSoup constructor is just a string, which means you can use any string as an argument, not only the one that you've just retrieved from the web site.

>>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup(html) >>> type(soup)

<class 'BeautifulSoup.BeautifulSoup'> >>>

Was this article helpful?

0 0

Post a comment