Working with Files

Learning to deal with files is key to processing textual data. Often, text that you have to process is contained in a text file such as a logfile, config file, or application data file. When you need to consolidate the data that you are analyzing, you often need to create a report file of some sort or put it into a text file for further analysis. Fortunately, Python contains an easy-to-use built-in type called file that can help you do all of those things.

Creating files

It may seem counterintuitive, but in order to read an existing file, you have to create a new file object. But don't confuse creating a new file object with creating a new file. Writing to a file requires that you create a new file object and might require that you create a new file on disk, so it may be less counterintuitive than creating a file object for reading would be. The reason that you create a file object is so that you can interact with that file on disk.

In order to create a file object, you use the built-in function open(). Here is an example of code that opens a file for reading: In [1]: infile = open("foo.txt", "r")

In [2]: print infile.read() Some Random Lines

Text.

Because open is built-in, you don't need to import a module. Open() takes three parameters: a filename, the mode in which the file should be opened, and a buffer size. Only the first parameter, filename, is mandatory. The most common values for mode, are "r" (read mode; this is the default), "w" (write mode), and "a" (append mode). A complementary mode that can be added to the other modes is "b," or binary mode. The third parameter, buffer size, directs the operating the way to buffer the file.

In the previous example, we specified that we would like to open() the "file foo.txt" in read mode and be able to refer to that new readable file object with the variable infile. Once we have infile, we are free to call the read() method on it, which reads the entire contents of the file.

Creating a file for writing is very similar to the way we created the file for reading. Instead of using an "r" flag, you use a "w" flag: In [1]: outputfile = open("foo_out.txt", "w")

In [2]: outputfile.write("This is\nSome\nRandom\nOutput Text\n")

In this example, we specified that we would like to open() the file "foo_out.txt" in write mode and be able to refer to that new writable file object with the variable output file. Once we have outputfile, we can write() some text to it and close() the file.

While these are the simple ways of creating files, you probably want to get in the habit of creating files in a more error-tolerant way. It is good practice to wrap your file opens with a try/finally block, especially when you are using write() calls. Here is an example of a writeable file wrapped in a try/finally block:

...: f = open('writeable.txt', 'w') ...: f.write('quick line here\n') ... : finally: ...: f.close()

This way of writing files causes the close() method to be called when an exception happens somewhere in the try block. Actually, it lets the close() method be closed even when no exception occurred in the try block. Finally blocks are executed after try blocks complete, whether an exception is found or not.

A new idiom in Python 2.5 is the with statement, which lets you use context managers.

A context manager is simply an object with an_enter_() and_exit__(). When an object is created in the with expression, the context manager's_enter_() method is called. When the with block completes, even if an exception occurs, the context manager's _exit_() is called. File objects have_enter_() and_exit__() methods defined. On_exit_(), the file object's close() method is called. Here is an example of the with statement:

In [1]: from _future_ import with_statement

In [2]: with open('writeable.txt', 'w') as f: ...: f.write('this is a writeable file\n')

Even though we didn't call close() on file object f, the context manager closes it after exiting the with block: In [3]: f

Out[3]: <closed file 'writeable.txt', mode 'w' at 0x1382770> In [4]: f.write("this won't work")

ValueError Traceback (most recent call last)

/Users/jmjones/<ipython console> in <module>() ValueError: I/O operation on closed file

As we expected, the file object is closed. While it is a good practice to handle possible exceptions and make sure your file objects are closed when you expect them to be, for the sake of simplicity and clarity, we will not do so for all examples.

For a complete list of the methods available on file objects, see the File Objects section of Python Library Reference at http://docs.python.org/lib/bltin-file-objects.html.

Reading files

Once you have a readable file object, which you opened with the r flag, there are three common file methods that will prove useful for getting data contained in the file: read(), readline(), and readlines(). Read(), not surprisingly, reads data from an open file object, returns the bytes that it has read, and returns a string object of those bytes. Read() takes an optional bytes parameter, which specifies the number of bytes to read. If no bytes are specified, read() tries to read to the end of the file. If more bytes are specified than there are bytes in the file, read() will read until the end of the file and return the bytes which it has read.

Given the following file:

[email protected]:~/some_random_directory$ cat foo.txt Some Random Lines

Text.

Read() works on a file like this. In [1]: f = open("foo.txt", "r")

Notice that the newlines are shown as a \n character sequence; that is the standard way to refer to a newline.

And if we only wanted the first 5 bytes of the file, we could do something like this: In [1]: f = open("foo.txt", "r")

The next method for getting text from a file is the readline() method. The purpose of readline() is to read one line of text at a time from a file. Readline() takes one optional parameter: size. Size specifies the maximum number of bytes that readline() will read before returning a string, whether it has reached the end of the line or not. So, in the following example, the program will read the first line of text from the foo.txt, and then it will read the first 7 bytes of text from the second line, followed by the remainder of the second line:

In [1]: f = open("foo.txt", "r")

The final file method that we will discuss for getting text out of a file is readlines(). Readlines() is not a typo, nor is it a cut-and-paste error from the previous example. Readlines() reads in all of the lines in a file. Well, that is almost true. Readlines() has a sizehint option that specifies the approximate total number of bytes to read in. In the following example, we created a file, biglines.txt, that contains 10,000 lines, each of which contains 80 characters. We then open the file, state that we want the first N lines in the file, which will total about 1024 bytes (the number of lines and bytes that were read) and then we read the rest of the lines in the file: In [1]: f = open("biglines.txt", "r")

Out[7]: 801738

Command [3] shows that we read 102 lines and command [4] shows that those lines totaled 8,262 bytes. How is 1,024 the "approximate" number of bytes read if the actual number of bytes read was 8,262? It rounded up to the internal buffer size, which is about 8 KB. The point is that sizehint does not always do what you think it might, so it's something to keep in mind.

Writing files

Sometimes you have to do something with files other than just reading data in from them; sometimes you have to create your own file and write data out to it. There are two common file methods that you will need to know in order to write data to files. The first method, which was demonstrated earlier, is write(). write() takes one parameter: the string to write to the file. Here is an example of data being written to a file using the write() method:

In [1]: f = open("some_writable_file.txt", "w")

In [4]: g = open("some_writable_file.txt", "r")

In command [1], we opened the file with the w mode flag, which means writable. Command [2] writes two lines to the file. In command [4], we are using the variable name g this time for the file object to cut down on confusion, although we could have used f again. And command [5] shows that the data we wrote to the file is the same as what comes out when we read() it again.

The next common data writing method is writelines(). Writelines() takes one mandatory parameter: a sequence that writelines() will write to the open file. The sequence can be any type of iterable object such as a list, tuple, list comprehension (which is a list), or a generator. Here is an example of a generator expression writelines() used with writelines to write data to a file:

In [1]:

f = open("writelines outfile.txt'

, "w")

In [2]:

f.writelines("%s\n" % i for i in

range(10))

In [3]:

f.close()

In [4]:

g = open("writelines outfile.txt"

, "r")

In [5]:

g.read()

Out[5]:

'0\n1\n2\n3\n4\n5\n6\n7\n8\n9\n'

And here is an example of a generator function being used to write data to a file (this is functionally equivalent to the previous example, but it uses more code):

f = open("writelines_generator_function_outfile", "w") f.writelines(myRange(10))

g = open("writelines_generator_function_outfile", "r")

It is important to note that writelines() does not write a newline (\n) for you; you have to supply the \n in the sequence you pass in to it. It's also important to know you don't have to use it only to write line-based information to your file. Perhaps a better name would have been something like writeiter(). In the previous examples, we happened to write text that had a newline, but there is no reason that we had to.

Additional resources

For more information on file objects, please see Chapter 7 of Learning Python by David Ascher and Mark Lutz (O'Reilly) (also online in Safari at http://safari.oreilly.com/ 0596002815/lpython2-chp-7-sect-2) or the File Objects section of the Python Library Reference (available online at http://docs.python.org/lib/bltin-file-objects.html).

For more information on generator expressions, please see the "generator expressions" section of the Python Reference Manual (available online at http://docs.python.org/ref/ genexpr.html). For more information on the yield statement, see the "yield statement" section of the Python Reference Manual (available online at http://docs.python.org/ref/ yield.html).

0 0

Post a comment