Accessing Text from the Web and from Disk Electronic Books

A small sample of texts from Project Gutenberg appears in the NLTK corpus collection. However, you may be interested in analyzing other texts from Project Gutenberg. You can browse the catalog of 25,000 free online books at log/, and obtain a URL to an ASCII text file. Although 90% of the texts in Project Gutenberg are in English, it includes material in over 50 other languages, including Catalan, Chinese, Dutch, Finnish, French, German, Italian, Portuguese, and Spanish (with more than 100 texts each).

Text number 2554 is an English translation of Crime and Punishment, and we can access it as follows.

>>> from urllib import urlopen

>>> url = ""

'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

The read() process will take a few seconds as it downloads this large book. If you're using an Internet proxy that is not correctly detected by Python, you may need to specify the proxy manually as follows:

>>> proxies = {'http': ''} >>> raw = urlopen(url, proxies=proxies).read()

The variable raw contains a string with 1,176,831 characters. (We can see that it is a string, using type(raw).) This is the raw content of the book, including many details we are not interested in, such as whitespace, line breaks, and blank lines. Notice the \r and \n in the opening line of the file, which is how Python displays the special carriage return and line-feed characters (the file must have been created on a Windows machine). For our language processing, we want to break up the string into words and punctuation, as we saw in Chapter 1. This step is called tokenization, and it produces our familiar structure, a list of words and punctuation.


['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']

Notice that NLTK was needed for tokenization, but not for any of the earlier tasks of opening a URL and reading it into a string. If we now take the further step of creating an NLTK text from this list, we can carry out all of the other linguistic processing we saw in Chapter 1, along with the regular list operations, such as slicing:

>>> text = nltk.Text(tokens) >>> type(text) <type 'nltk.text.Text'> >>> text[1020:1060]

['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in', 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', 'which', 'he', 'lodged', 'in', 'S', '.', 'Place', 'and', 'walked', 'slowly', ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K', '.', 'bridge', '.'] >>> text.collocations()

Katerina Ivanovna; Pulcheria Alexandrovna; Avdotya Romanovna; Pyotr Petrovitch; Project Gutenberg; Marfa Petrovna; Rodion Romanovitch; Sofya Semyonovna; Nikodim Fomitch; did not; Hay Market; Andrey Semyonovitch; old woman; Literary Archive; Dmitri Prokofitch; great deal; United States; Praskovya Pavlovna; Porfiry Petrovitch; ear rings

Notice that Project Gutenberg appears as a collocation. This is because each text downloaded from Project Gutenberg contains a header with the name of the text, the author, the names of people who scanned and corrected the text, a license, and so on. Sometimes this information appears in a footer at the end of the file. We cannot reliably detect where the content begins and ends, and so have to resort to manual inspection of the file, to discover unique strings that mark the beginning and the end, before trimming raw to be just the content and nothing else:

>>> raw.rfind("End of Project Gutenberg's Crime") 1157681

>>> raw = raw[5303:1157681] O >>> raw.find("PART I") 0

The find() and rfind() ("reverse find") methods help us get the right index values to use for slicing the string O. We overwrite raw with this slice, so now it begins with "PART I" and goes up to (but not including) the phrase that marks the end of the content.

This was our first brush with the reality of the Web: texts found on the Web may contain unwanted material, and there may not be an automatic way to remove it. But with a small amount of extra work we can extract the material we need.

Was this article helpful?

0 0

Post a comment