Dealing with HTML

Much of the text on the Web is in the form of HTML documents. You can use a web browser to save a page as text to a local file, then access this as described in the later section on files. However, if you're going to do this often, it's easiest to get Python to do the work directly. The first step is the same as before, using urlopen. For fun we'll pick a BBC News story called "Blondes to die out in 200 years," an urban legend passed along by the BBC as established scientific fact:

>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" >>> html = urlopen(url).read() >>> html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

You can type print html to see the HTML content in all its glory, including meta tags, an image map, JavaScript, forms, and tables.

Getting text out of HTML is a sufficiently common task that NLTK provides a helper function nltk.clean_html(), which takes an HTML string and returns raw text. We can then tokenize this to get our familiar text structure:

>>> raw = nltk.clean_html(html) >>> tokens = nltk.word_tokenize(raw) >>> tokens

['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', , 'to', 'die', 'out', ...]

This still contains unwanted material concerning site navigation and related stories. With some trial and error you can find the start and end indexes of the content and select the tokens of interest, and initialize a text as before.

>>> tokens = tokens[96:399] >>> text = nltk.Text(tokens) >>> text.concordance('gene')

they say too few people now carry the gene for blondes to last beyond the next tw t blonde hair is caused by a recessive gene . In order for a child to have blonde to have blonde hair , it must have the gene on both sides of the family in the gra there is a disadvantage of having that gene or by chance . They don ' t disappear ondes would disappear is if having the gene was a disadvantage and I do not think

The Web can be thought of as a huge corpus of unannotated text. Web search engines provide an efficient means of searching this large quantity of text for relevant linguistic examples. The main advantage of search engines is size: since you are searching such a large set of documents, you are more likely to find any linguistic pattern you are interested in. Furthermore, you can make use of very specific patterns, which would match only one or two examples on a smaller example, but which might match tens of thousands of examples when run on the Web. A second advantage of web search engines is that they are very easy to use. Thus, they provide a very convenient tool for quickly checking a theory, to see if it is reasonable. See Table 3-1 for an example.

> For more sophisticated processing of HTML, use the Beautiful Soup package, available at http://www.crummy.com/software/BeautifulSoup/.

Was this article helpful?

0 0

Post a comment