Reading Local Files

In order to read a local file, we need to use Python's built-in open() function, followed by the read() method. Supposing you have a file document.txt, you can load its contents like this:

>>> f = open('document.txt') >>> raw = f.read()

Your Turn: Create a file called document.txt using a text editor, and type in a few lines of text, and save it as plain text. If you are using IDLE, select the New Window command in the File menu, typing the required text into this window, and then saving the file as document.txt inside the directory that IDLE offers in the pop-up dialogue box. Next, in the Python interpreter, open the file using f = open('document.txt'), then inspect its contents using print f.read().

Various things might have gone wrong when you tried this. If the interpreter couldn't find your file, you would have seen an error like this:

>>> f = open('document.txt') Traceback (most recent call last): File "<pyshell#7>", line 1, in -toplevel-f = open('document.txt')

IOError: [Errno 2] No such file or directory: 'document.txt'

To check that the file that you are trying to open is really in the right directory, use IDLE's Open command in the File menu; this will display a list of all the files in the directory where IDLE is running. An alternative is to examine the current directory from within Python:

>>> import os >>> os.listdir('.')

Another possible problem you might have encountered when accessing a text file is the newline conventions, which are different for different operating systems. The built-in open() function has a second parameter for controlling how the file is opened: open('do cument.txt', 'rU'). 'r' means to open the file for reading (the default), and 'U' stands for "Universal", which lets us ignore the different conventions used for marking new-lines.

Assuming that you can open the file, there are several methods for reading it. The read() method creates a string with the contents of the entire file:

'Time flies like an arrow.\nFruit flies like a banana.\n'

Recall that the '\n' characters are newlines; this is equivalent to pressing Enter on a keyboard and starting a new line.

We can also read a file one line at a time using a for loop:

Time flies like an arrow.

Fruit flies like a banana.

Here we use the strip() method to remove the newline character at the end of the input line.

NLTK's corpus files can also be accessed using these methods. We simply have to use nltk.data.find() to get the filename for any corpus item. Then we can open and read it in the way we just demonstrated:

>>> path = nltk.data.find('corpora/gutenberg/melville-moby_dick.txt') >>> raw = open(path, 'rU').read()

Was this article helpful?

0 0

Post a comment