How It Works

Here is where things start to get interesting with document management; and as always, when things get interesting, they start to get more complicated. The main thing to note is that this example saves a single document (using the code from the first example) into the document management system, does some simple searching, and then retrieves the document. On the way, of course, is a lot more detail than in the previous examples, so let's look at everything in turn, starting at the top.

First, notice that the docs list is more complex than anything you've seen so far. In addition to having a field of type 'content' to store the document outside the main object, it also defines an index. The index, as explained previously, is a way of having your cake and eating it, too. It enables you to have a simple, easy-to-search table in MySQL that is always synchronized with the main list storage implemented as a directory containing XML files and attachment documents. Whenever a change is made to this list, the change is executed for both the main list storage and the MySQL table that mirrors it.

Here, the index uses a from attribute on several of the fields to extract information about the main attachment. This attachment field is named "content", and so, for example, "content.size" refers to the "size" attribute of the final field XML (you can scan down to the output to see the structure of that field; we'll come back to this later.) This means that you can build simple SQL query, one that uses a

WHERE clause to find particular objects that have been recorded in the repository, such as only those with particular sizes, or those created or edited before or after particular dates, or by particular users. All of this information is always saved by the attachment process so you know that it will be available for you to query.

Note that the storage setup for this simple document management system, although it may encounter issues in scaling, can easily be a test bed for a more industrial-strength version at some later date.

For instance, you could easily replace the back-end with an Oracle database, storing documents in a BLOB field, and continue to use any Python scripts you'd written against the test system with no changes at all. The same applies to whatever workflow you define to be used with this system. Moreover, if you move to a commercial document management system, at most you would have to write an adaptor module to interface the wftk to the new storage system, and continue to use the same scripts and workflow.

The second feature of this new repository definition, and something you haven't learned in this book yet, is that it contains a user list. This user list is very simple, and obviously it isn't built with security in mind; in a real system, you would want to have some more convincing security. However, for demonstration purposes, and in other limited circumstances, this can be valid solution.

This list uses "here" storage, meaning it's a read-only list that is defined and stored right in the repository definition. It defines two users, me and you. Of course, you need a user list because next you're going to register a user before attaching documents, so that the attaching user can be used as a parameter to be searched for. This is needed in most real environments.

Moving along to the SQL definition of the docindex table, note that the primary key of the table has an auto_increment attribute. This is a nifty MySQL feature that assigns a unique key to a record when it's saved; the wftk, after the key field is defined as a key field with attribute keygen set to "auto", will first save the record to the index, ask MySQL what key it was assigned, and then will modify the record in the main storage to match.

Now take a look at the code. There are several new features to study here, the first being the call to user_auth to assign a user to the current session. The current assigned user has very little obvious effect, but it allows workflow, permissions, and the attachment process to note who is taking action. You'll come back to user authorization when you look at more advanced workflow later.

The document object is created and saved in the same way you've created and saved your objects so far, but now you also attach a file to it. Note that attachments are named by field, and that you can attach arbitrary numbers and types of documents to an object. Objects don't all have to be predefined in the list definition.

Because you've already defined an attachment field and have indexed some of its attributes, only attachments with that name will affect the index.

You aren't restricted to attaching the contents of files, though. The attach method can specify an arbitrary string to be written to the attachment. In Python, this gives you a lot of flexibility because you could even attach a pickled object that you wanted to be run on the object when it's retrieved!

When the file is attached, things get interesting. Because this list has MySQL behind it, you can use the power of SQL to search on it. The next few lines build a special list object to perform the query on, and then call the query with the text of an SQL WHERE clause. After the query is done, there is a little manipulation of the data, and then you can retrieve the attachment again and print it as a demonstration.

Looking at the output from this example, you can first see three versions of the object as it's built, saved, and then attached. Remember that after it has been saved, it is given an id field. This field is written by MySQL with a unique key, and this is done automatically by the "autoincrement" option that was set when the table was defined.

After attachment happens, you can see that a new field, the content field, has been written to the object. This field doesn't store the attachment itself, but rather specifies where the adapter can retrieve it when you do want it. Obviously, because attached files can be of any size, it's better not to retrieve them whenever the object is used, because if it's a large file and you don't want it every time, that would slow down your programs a lot.

The descriptor field for an attachment is decorated with various useful attributes, which the wftk can use to retrieve information about the attachment during the attachment process — things that you've seen discussed and that you already know are important, such as information about the time at which events occurred to the document, about the user, and about the size of the file itself.

This is the data that you abstract out for the MySQL index summary, and you'll see it and use it again later in the output. You can also see it by running queries using your MySQL client with what you already know about mysql — for instance, querying with SELECT * FROM docindex.

After the file is attached to the object, the results of the query are returned. The query searches on objects created by user "me," so if you run this example code several times, you'll see that all of those objects are retrieved by this query, which could be more useful in the future when you are looking for multiple results. Of course, you can easily modify the attachment code to do something else, and then the results of this query will change based on what you've done.

Here is the result of running the script:

C:\projects\simple>python docs.py

BEFORE SAVING:

field id="title">Code file simple.py</field>

field id="descr">This script demonstrates wftk reads.</field> </rec>

AFTER SAVING:

<rec list="docs" key="1">

field id="title">Code file simple.py</field>

field id="descr">This script demonstrates wftk reads.</field>

AFTER ATTACHMENT:

<rec list="docs" key="1">

field id="title">Code file simple.py</field>

field id="descr">This script demonstrates wftk reads.</field> field id="id">1</field>

field id="content" type="document" created_on="2005-03-19 20:16:27" edited_on=" 2005-03-19 20:16:27" created_by="me" edited_by="me" size="272" mimetype="" locat ion="_att_1_content_.dat"/> </rec>

<field id="created_by">me</field>

<field id="created_on">2005-03-19 20:16:27</field>

<field id="edited_by">me</field>

<field id="edited_on">2005-03-19 20:16:27</field>

<field id="size">272</field>

<field id="title">Code file simple.py</field>

<field id="descr">This script demonstrates wftk reads.</field>

import wftk repos = wftk.repository('site.opm')

<field id="field2">this is a test value</field> </rec>

print "BEFORE SAVING:"

print "AFTER SAVING:"

print e print l = repos.list('simple') print l

The results of the query show first the list of keys returned by the query, and then an example record after it has been returned. Note that these return records are the returns from MySQL; they have a different structure from the records actually saved to the main list storage. Specifically, you can see that the attachment-specific fields such as size and created_on have been stored as separate fields in the database and that they remain separate fields here in the XML output.

Finally, the output dumps the content of the attachment, which is just the code from the first sample, which was saved.

There are now a hundred different things you could do to make this script serve a specific, useful purpose in your work. One of those is to manage your stored documents in a document retention framework, so let's look at that.

Try It Out A Document Retention Framework

Ready to get your feet wet with something really useful? Try putting together a simple document retention manager. You already know nearly everything you need from the preceding examples; all that's needed is to roll it all up into one program. As noted, you shouldn't be terribly worried at the lack of scalability of this test system; you can easily swap out the back-end for applications with more heavy-duty performance requirements.

This example assumes that you've worked through the last one and that you have a few documents already stored in the docs list. If you didn't define a docs list, this example isn't going to work. Of course, even if you did define a docs list, its contents are going to be pretty bland if you haven't modified the example code, but you'll still get a feel for how this works.

1. Using the text editor of your choice, open the file system.defn in the example repository from the previous examples and add a new list for storing the retention rules you'll be defining:

<repository loglevel="6">

<list id="rules" storage="mysql:wftk" table="rules" order="sort"> <field id="id"/> <field id="sort"/> <field id="name"/> <field id="rule"/> </list>

Make sure you don't change the rest of the repository definition!

2. Start your MySQL client and add the rule table and a couple of rules, like this:

mysql> create table rules (

-> id int(11) not null primary key auto_increment, -> sort int(11), -> name text, -> rule text); Query OK, 0 rows affected (0.01 sec)

mysql> insert into rules (sort, name, rule) values (1, 'files by me', "created_by='me' and to_days(now()) - to_days(edited_on) > 4"); Query OK, 1 row affected (0.01 sec)

mysql> insert into rules (sort, name, rule) values (0, 'files by you', "created_ by='you' and to_days(now()) - to_days(edited_on) > 3"); Query OK, 1 rows affected (0.01 sec)

3. Create a new script called trash.py and enter the following code: import wftk repos = wftk.repository()

rlist = wftk.list (repos, 'rules') rlist.query ("")

rule = wftk.xmlobj(xml=rlist[r])

docs = wftk.list (repos, 'docs') docs.query (rule['rule'])

print " -- Deleting document %s" % d doc = repos.get ('docs', d) doc.delete()

4. Run it (naturally, your output will look different from this):

C:\projects\simple>python trash.py

C:\projects\articles\python_book\ch20_enterprise>python trash.py Running rule 'files by you' Running rule 'files by me' -- Deleting document 2

Was this article helpful?

0 0

Post a comment