Chapter Persistence and Databases

Introduction

Recipe 7.1. Serializing Data Using the marshal Module

Recipe 7.2. Serializing Data Using the pickle and cPickle Modules

Recipe 7.3. Using Compression with Pickling

Recipe 7.4. Using the cPickle Module on Classes and Instances

Recipe 7.5. Holding Bound Methods in a Picklable Way

Recipe 7.6. Pickling Code Objects

Recipe 7.7. Mutating Objects with shelve

Recipe 7.8. Using the Berkeley DB Database

Recipe 7.9. Accesssing a MySQL Database

Recipe 7.10. Storing a BLOB in a MySQL Database

Recipe 7.11. Storing a BLOB in a PostgreSQL Database

Recipe 7.12. Storing a BLOB in a SQLite Database

Recipe 7.13. Generating a Dictionary Mapping Field Names to Column Numbers

Recipe 7.14. Using dtuple for Flexible Accessto Query Results

Recipe 7.15. Pretty-Printing the Contents of Database Cursors

Recipe 7.16. Using a Single Parameter-Passing Style Across Various DB API Modules

Recipe 7.17. Using Microsoft Jet via ADO

Recipe 7.18. Accessing a JDBC Database from a Jython Servlet

Recipe 7.19. Using ODBC to Get Excel Data with Jython

Team LiB ^ previous next t

Credit: Aaron Watters, Software Consultant

There are three kinds of people in this world: those who can count and those who can't.

However, there are only two kinds of computer programs: toy programs and programs that interact with some kind of persistent databases. That is to say, most real computer programs must retrieve stored information and record information for future use. These days, this description applies to almost every computer game, which can typically save and restore the state of the game at any time. So when I refer to toy programs, I mean programs written as exercises, or for the fun of programming. Nearly all real programs (such as programs that people get paid to write) have some persistent database storage/retrieval component.

When I was a Fortran programmer in the 1980s, I noticed that although almost every program had to retrieve and store information, they almost always did it using home-grown methods. Furthermore, since the storage and retrieval parts of the program were the least interesting components from the programmer's point of view, these parts of the program were frequently implemented very sloppily and were hideous sources of intractable bugs. This repeated observation convinced me that the study and implementation of database systems sat at the core of programming pragmatics, and that the state of the art as I saw it then required much improvement.

Later, in graduate school, I was delighted to find an impressive and sophisticated body of work relating to the implementation of database systems. The literature of database systems covered issues of concurrency, fault tolerance, distribution, query optimization, database design, and transaction semantics, among others. In typical academic fashion, many of the concepts had been elaborated to the point of absurdity (such as the silly notion of conditional multivalued dependencies), but much of the work was directly related to the practical implementation of reliable and efficient storage and retrieval systems. The starting point for much of this work was E.F. Codd's seminal paper, "A Relational Model of Data for Large Shared Data Banks."!!!

[11 E.F. Codd, "A Relational Model of Data for Large Shared Data Banks," Communications of the ACM, 13, no. 6 (1970), pp. 377-87, http://www.acm.org/classics/nov95/toc.html.

Among my fellow graduate students, and even among most of the faculty, the same body of knowledge was either disregarded or regarded with some scorn. Everyone recognized that knowledge of conventional relational technology could be lucrative, but they generally considered such knowledge to be on the same level as knowing how to write (or more importantly, maintain) COBOL programs. This situation was not helped by the fact that the emerging database interface standard, SQL (which is now very well established), looked like an extension of COBOL and bore little obvious relationship to any modern programming language.

More than a decade later, there is little indication that anything will soon overtake SQL-based relational technology for the majority of data-based applications. In fact, relational-database technology seems more pervasive than ever. The largest software vendorsIBM, Microsoft, and Oracleall provide various relational-database implementations as crucial components of their core offerings. Other large software firms, such as SAP and PeopleSoft, essentially provide layers of software built on top of a relational-database core.

Generally, relational databases have been augmented rather than replaced. Enterprise softwareengineering dogma frequently espouses three-tier systems, in which the bottom tier is a carefully designed relational database, the middle tier defines a view of the database as business objects, and the top tier consists of applications or transactions that manipulate the business objects, with effects that ultimately translate to changes in the underlying relational tables.

Microsoft's Open Database Connectivity (ODBC) standard provides a common programming API

for SQL-based relational databases that permits programs to interact with many different database engines with no or few changes. For example, a Python program could be first implemented using Microsoft Jetl^l as a backend database for testing and debugging purposes. Once the program is stable, it can be put into production use, remotely accessing, say, a backend DB2 database on an IBM mainframe residing on another continent, by changing (at most) one line of code.

[2] Microsoft Jet is commonly but erroneously known as the "Microsoft Access database." Access is a product that Microsoft sells for designing and implementing database frontends; Jet is a backend that you may download for free from Microsoft's web site.

Relational databases are not appropriate for all applications. In particular, a computer game or engineering design tool that must save and restore sessions should probably use a more direct method of persisting the logical objects of the program than the flat tabular representation encouraged in relational-database design. However, even in domains such as engineering or scientific information, a hybrid approach that uses some relational methods is often advisable. For example, I have seen a complex relational-database schema for archiving genetic-sequencing informationin which the sequences show up as binary large objects (BLOBs)but a tremendous amount of important ancillary information can fit nicely into relational tables. But as the reader has probably surmised, I fear, I speak as a relational zealot.

Within the Python world there are many ways of providing persistence and database functionality. My personal favorite is Gadfly, http://gadfly.sourceforge.net/, a simple and minimal SQL implementation that works primarily with in-memory databases. It is my favorite for no other reason than because it is mine, and its biggest advantage is that, if it becomes unworkable for you, it is easy to switch over to another, industrial-strength SQL engine. Many Gadfly users have started an application with Gadfly (because it was easy to use) and switched later (because they needed more).

However, many people may prefer to start by using other SQL implementations such as MySQL, Microsoft Access, Oracle, Sybase, Microsoft SQL Server, SQLite, or others that provide the advantages of an ODBC interface (which Gadfly does not do).

Python provides a standard interface for accessing relational databases: the Python DB Application Programming Interface (Py-DBAPI), originally designed by Greg Stein. Each underlying database API requires a wrapper implementation of the Py-DBAPI, and implementations are available for just about all underlying database interfaces, notably Oracle and ODBC.

When the relational approach is overkill, Python provides built-in facilities for storing and retrieving data. At the most basic level, the programmer can manipulate files directly, as covered in Chapter 2. A step up from files, the marshal module allows programs to serialize data structures constructed from simple Python types (not including, e.g., classes or class instances). marshal has the advantage of being able to retrieve large data structures with blinding speed. The pickle and cPickle modules allow general storage of objects, including classes, class instances, and circular structures. cPickle is so named because it is implemented in C and is consequently quite fast, but it remains slower than marshal. For access to structured data in a somewhat human-readable form, it is also worth considering storing and retrieving data in XML format (taking advantage of Python's several XML parsing and generation modules), covered in Chapter 12but this option works best for write once, read many-type applications. Serialized data or XML representations may be stored in SQL databases to create a hybrid approach as well.

While marshal and pickle provide basic serialization and deserialization of structures, the application programmer will frequently desire more functionality, such as transaction support and concurrency control. When the relational model doesn't fit the application, a direct object database implementation such as the Z-Object Database (ZODB) might be appropriatesee http://zope.org/Products/ZODB3.2.

I must conclude with a plea to those who are dismissive of relational-database technology. Remember that it is successful for good reasons, and it might be worth considering. To paraphrase Churchill:

text = """ Indeed, it has been said that democracy is the worst form of government, except for all those others that have been tried from time to time. """ import string for a, b in [("democracy", "SQL"), ("government", "database")]:

text = string.replace(text, a, b) print text

Team LiB

^ previous next ^

Credit: Luther Blissett

Was this article helpful?

0 0

Post a comment