Traversing Through Large Data Files

If I have to read and process large amounts of data, I cannot use the simplistic approach of loading everything into memory and then processing it. And I will most definitely be dealing with large volumes of data here. Depending on your situation this might be different, but busy systems are likely to have gigabytes of log data generated on an hourly basis. Obviously all this data cannot be loaded into memory at once.

The solution to this problem is to use generators. The generator function allows you to produce output (reading the lines from a file) without actually loading the whole file into memory. If you just simply need to read the file line by line, you don't really need to encapsulate the readline() function, as you can simply write:

f = open('file.txt', 'r') for line in f: print line

However, if you need to manipulate the file data and use the result, it might be a good idea to write your own generator function that performs required calculations and produces the results. For example, you might want to write a generator that searches for a particular string in the file and prints the string plus few lines before that string and few lines after it. This is where generators come in handy.

Was this article helpful?

0 0

Post a comment