XML Example Document

Get Paid To Write Online

Get Paid to Write at Home

Get Instant Access

The following example illustrates a typical XML document, in this case a description of a recipe.

<?xml version="1.0" encoding="iso-8859-1"?> <recipe> <title>

Famous Guacamole

<description>

A southwest favorite!

</description>

<ingredients>

<item num="4"> Large avocados, chopped </item> <item num="1"> Tomato, chopped </item>

<item num="1/2" units="C"> White onion, chopped </item> <item num="2" units="tbl"> Fresh squeezed lemon juice </item> <item num="1"> Jalapeno pepper, diced </item> <item num="1" units="tbl"> Fresh cilantro, minced </item> <item num="1" units="tbl"> Garlic, minced </item> <item num="3" units="tsp"> Salt </item> <item num="12" units="bottles"> Ice-cold beer </item> </ingredients> <directions>

Combine all ingredients and hand whisk to desired consistency. Serve and enjoy with ice-cold beers. </directions> </recipe>

The document consists of elements that start and end with tags such as <title>...</title>. Elements are typically nested and organized into a hierarchy— for example, the <item> elements that appear under <ingredients>.Within each document, a single element is the document root. In the example, this is the <receipe> element. Elements optionally have attributes as shown for the item elements <item num="4">Large avocados, chopped</item>.

Working with XML documents typically involves all of these basic features. For example, you may want to extract text and attributes from specific element types. To locate elements, you have to navigate through the document hierarchy starting at the root element.

xml.dom.minidom

The xml.dom.minicom module provides basic support for parsing an XML document and storing it in memory as a tree structure according to the conventions of DOM. There are two parsing functions:

Parses the contents of file and returns a node representing the top of the document tree. ile is a filename or an already-open file object. parser is an optional SAX2-com-patible parser object that will be used to construct the tree. If omitted, a default parser will be used.

parseString(string [, parser])

The same as parse(), except that the input data is supplied in a string instead of a file.

Nodes

The document tree returned by the parsing functions consists of a collection of nodes linked together. Each node n has the following attributes which can be used to extract information and navigate through the tree structure:

Node Attribute n.attributes n.childNodes n.firstChild n.lastChild n.localName n.namespaceURI n.nextSibling n.nodeName n.nodeType n.nodeValue n.parentNode n.prefix n.previousSibling

Description

Mapping object that holds attribute values (if any). A list of all child nodes of n. The first child of node n. The last child of node n.

Local tag name of an element. If a colon appears in the tag

(for example, '<foo:bar ...>'), then this only contains the part after the colon.

Namespace associated with n, if any.

The node that appears after n in the tree and has the same parent. Is None if n is the last sibling.

The name of the node. The meaning depends on the node type.

Integer describing the node type. It is set to one of the following values which are class variables of the Node class: ATTRIBUTE_NODE, CDATA_SECTION_NODE, COMMENT_NODE, DOCUMENT_FRAGMENT_NODE, DOCUMENT_NODE, DOCUMENT_TYPE_NODE, ELEMENT_NODE, ENTITY_NODE, ENTITY_REFERENCE_NODE, NOTATION_NODE, PROCESSING_INSTRUCTION_NODE, or TEXT_NODE. The value of the node. The meaning depends on the node type.

A reference to the parent node.

Part of a tag name that appears before a colon. For example, the element '<foo:bar ...>' would have a prefix of 'foo'. The node that appears before n in the tree and has the same parent.

In addition to these attributes, all nodes have the following methods. Typically, these are used to manipulate the tree structure.

n.appendChild(child)

Adds a new child node, child, to n. The new child is added at the end of any other children.

n.cloneNode(deep)

Makes a copy of the node n. If deep is True, all child nodes are also cloned. n.hasAttributes()

Returns True if the node has any attributes. n.hasChildNodes()

Returns True if the node has any children. n,insertBefore(newchild, ichild)

Inserts a new child, newehild, before another child, iehild. iehild must already be a child of n.

n.isSameNode(other)

Returns True if the node other refers to the same DOM node as n. n.normalize()

Joins adjacent text nodes into a single text node. n.removeChild(child) Removes child ehild from n. n.replaceChild(newchild,oldchild)

Replaces the child oldehild with newehild. oldehild must already be a child of n.

Although there are many different types of nodes that might appear in a tree, it is most common to work with Document, Element, and Text nodes. Each is briefly described next.

Document Nodes

A Document node d appears at the top of the entire document tree and represents the entire document as a whole. It has the following methods and attributes:

d.documentElement

Contains the root element of the entire document. d.getElementsByTagName( tagname)

Searches all child nodes and returns a list of elements with a given tag name tagname. d.getElementsByTagNameNS(namespaceuri, localname)

Searches all child nodes and returns a list of elements with a given namespace URI and local name.The returned list is an object of type NodeList.

Element Nodes

An Element node e represents a single XML element such as '<foo>...</foo>' .To get the text from an element, you need to look for Text nodes as children. The following attributes and methods are defined to get other information:

e. tagName

The tag name of the element. For example, if the element is defined by '<foo ...>', the tag name is 'foo'.

e.getElementsByTagName(tagname)

Returns a list of all children with a given tag name. e.getElementsByTagNameNS(namespaceuri, localname)

Returns a list of all children with a given tag name in a namespace. namespaceuri and localname are strings that specify the namespace and tag name. If a namespace has been declared using a declaration such as '<foo xmlns:foo="http://www.spam.com/foo">', namespaceuri is set to 'http://www.spam.com/foo'. If searching for a subsequent element '<foo:bar>', localname is set to 'bar' .The returned object is of type NodeList.

e.hasAttribute(name)

Returns True if an element has an attribute with name name. e.hasAttributeNS(namespaceuri, localname)

Returns True if an element has an attribute named by namespaceuri and localname. The arguments have the same meaning as described for getElementsByTagNameNS().

e.getAttribute(name)

Returns the value of attribute name. The return value is a string. If the attribute doesn't exist, an empty string is returned.

e.getAttributeNS(namespaceuri, localname)

Returns the value of the attributed named by namespaceuri and localname.The return value is a string. An empty string is returned if the attribute does not exist.The arguments are the same as described for getElementsByTagNameNS().

Text Nodes

Text nodes are used to represent text data.Text data is stored in the t .data attribute of a Text object t.The text associated with a given document element is always stored in Text nodes that are children of the element.

Utility Functions

The following utility methods are defined on nodes.These are not part of the DOM standard, but are provided by Python for general convenience and for debugging.

Creates a nicely formatted string containing the XML represented by node n and its children. indent specifies an indentation string and defaults to a tab ('\t'). newl specifies the newline character and defaults to '\n'.

Creates a string containing the XML represented by node n and its children. encoding specifies the encoding (for example, 'utf-8'). If no encoding is given, none is specified in the output text.

n.writexml(writer [, indent [, addindent [, newl]]])

Writes XML to writer. writer can be any object that provides a write() method that is compatible with the file interface. indent specifies the indentation of n. It is a string that is prepended to the start of node n in the output. addindent is a string that specifies the incremental indentation to apply to child nodes of n. newl specifies the newline character.

DOM Example

The following example shows how to use the xml.dom.minidom module to parse and extract information from an XML file:

from xml.dom import minidom doc = minidom.parse("recipe.xml")

ingredients = doc.getElementsByTagName("ingredients")[0] items = ingredients.getElementsByTagName("item")

for item in items:

num = item.getAttribute("num")

units = item.getAttribute("units") text = item.firstChild.data.strip() quantity = "%s %s" % (num,units) print("%-10s %s" % (quantity,text))

Note

The xml.dom.minidom module has many more features for changing the parse tree and working with different kinds of XML node types. More information can be found in the online documentation.

xml.etree.ElementTree

The xml.etree.ElementTree module defines a flexible container object ElementTree for storing and manipulating hierarchical data. Although this object is commonly used in conjunction with XML processing, it is actually quite general-purpose—serving a role that's a cross between a list and dictionary.

ElementTree objects

The following class is used to define a new ElementTree object and represents the top level of a hierarchy.

ElementTree([element [, file]])

Creates a new ElementTree object. element is an instance representing the root node of the tree.This instance supports the element interface described next. file is either a filename or a file-like object from which XML data will be read to populate the tree.

An instance tree of ElementTree has the following methods: tree,_setroot(element) Sets the root element to element. tree.find(path)

Finds and returns the first top-level element in the tree whose type matches the given path. path is a string that describes the element type and its location relative to other elements. The following list describes the path syntax:

Path Description

' tag' Matches only top-level elements with the given tag—for example,

<tag>...</tag>. Does not match elements defined at lower levels. A element of type tag embedded inside another element such as <foo><tag>...</tag></foo> is not matched. 'parent/tag' Matches an element with tag ' tag' if it's a child of an element with tag 'parent'. As many path name components can be specified as desired.

'*' Selects all child elements. For example, '*/tag' would match all grandchild elements with a tag name of ' tag'. '.' Starts the search with the current node.

'//' Selects all subelements on all levels beneath an element. For exam ple, './/tag' matches all elements with tag ' tag' at all sublevels.

If you are working with a document involving XML namespaces, the tag strings in a path should have the form '{uri} tag' where uri is a string such as 'http://www.w3.org/TR/html4/' .

tree.findall(path)

Finds all top-level elements in the tree that match the given path and returns them in document order as a list or an iterator.

tree.findtext(path [, default])

Returns the element text for the first top-level element in the tree matching the given path. default is a string to return if no matching element can be found.

tree.getiterator([tag])

Creates an iterator that produces all elements in the tree, in section order, whose tag matches tag. If tag is omitted, then every element in the tree is returned in order.

tree.getroot()

Returns the root element for the tree. tree.parse(source [, parser])

Parses external XML data and replaces the root element with the result. source is either a filename or file-like object representing XML data. parser is an optional instance of TreeBuilder, which is described later.

tree.write(file [, encoding])

Writes the entire contents of the tree to a file. file is either a filename or a file-like object opened for writing. encoding is the output encoding to use and defaults to the interpreter default encoding if not specified ('utf-8' or 'ascii' in most cases).

Creating Elements

The types of elements held in an ElementTree are represented by instances of varying types that are either created internally by parsing a file or with the following construction functions:

Creates a new comment element. text is a string or byte string containing the element text. This element is mapped to XML comments when parsing or writing output.

Creates a new element. tag is the name of the element name. For example, if you were creating an element '<foo> </foo>', tag would be 'foo'. attrib is a dictionary of element attributes specified as strings or byte strings. Any extra keyword arguments supplied in extra are also used to set element attributes.

fromstring(text)

Creates an element from a fragment of XML text in text—the same as XML() described next.

ProcessingInstruction(target [, text])

Creates a new element corresponding to a processing instruction. target and text are both strings or byte strings. When mapped to XML, this element corresponds to '<?target text?>'.

The same as Element(), but it automatically adds the new element as a child of the element in parent.

XML(text)

Creates an element by parsing a fragment of XML code in text. For example, if you set text to '<foo> </foo>', this will create a standard element with a tag of 'foo'.

XMLID(text)

The same as XML( text) except that 'id' attributes are collected and used to build a dictionary mapping ID values to elements. Returns a tuple (elem, idmap) where elem is the new element and idmap is the ID mapping dictionary. For example, XMLID('<foo id="12 3"><bar id="4 56">Hello</bar></foo>') returns (<Element foo>, {'123': <Element foo>, '456': <Element bar>}).

The Element Interface

Although the elements stored in an ElementTree may have varying types, they all support a common interface. If elem is any element, then the following Python operators are defined:

Operator Description elem [n] Returns the nth child element of elem.

elem [n] = newelem Changes the nth child element of elem to a different element newelem.

del elem [n] Deletes the nth child element of elem.

len( elem) Number of child elements of elem.

All elements have the following basic data attributes:

Attribute Description elem .tag String identifying the element type. For example,

<foo>...</foo> has a tag of 'foo'. elem .text Data associated with the element. Usually a string containing text between the start and ending tags of an XML element. elem.tail Additional data stored with the attribute. For XML, this is usu ally a string containing whitespace found after the element's end tag but before the next tag starts. elem .attrib Dictionary containing the element attributes.

Elements support the following methods, some of which emulate methods on dictionaries:

elem.append(subelement)

Appends the element subelement to the list of children. elem.clear()

Clears all of the data in an element including attributes, text, and children. elem.find(path)

Finds the first subelement whose type matches path. elem.findall(path)

Finds all subelements whose type matches path. Returns a list or an iterable with the matching elements in document order.

elem.findtext(path [, default])

Finds the text for the first element whose type patches path. default is a string giving the value to return if there is no match.

Gets the value of attribute key. default is a default value to return if the attribute doesn't exist. If XML namespaces are involved, then key will be a string of the form '{uri} key}' where uri is a string such as 'http://www.w3.org/TR/html4/'.

elem.getchildren()

Returns all subelements in document order. elem.getiterator([tag])

Returns an iterator that produces all subelements whose type matches tag. elem.insert(index, subelement)

Inserts a subelement at position index in the list of children. elem.items()

Returns all element attributes as a list of (name, value) pairs. elem.keys()

Returns a list of all of the attribute names. elem.remove(subelement)

Removes element subelement from the list of children.

elem.set(key, value)

Sets attribute key to value value.

Tree Building

An ElementTree object is easy to create from other tree-like structures.The following object is used for this purpose.

TreeBuilder([element_factory])

A class that builds an ElementTree structure using a series of start(), end(), and data() calls as would be triggered while parsing a file or traversing another tree structure. element_factory is an operation function that is called to create new element instances.

An instance t of TreeBuilder has these methods: t.close()

Closes the tree builder and returns the top-level ElementTree object that has been created.

t.data(data)

Adds text data to the current element being processed. t.end(tag)

Closes the current element being processed and returns the final element object. t.start(tag, attrs)

Creates a new element. tag is the element name, and attrs is a dictionary with the attribute values.

Utility Functions

The following utility functions are defined: dump(elem)

Dumps the element structure of elem to sys.stdout for debugging.The output is usually XML.

iselement(elem)

Checks if elem is a valid element object. iterparse(source [, events])

Incrementally parses XML from source. source is a filename or a file-like object referring to XML data. events is a list of event types to produce. Possible event types are 'start', 'end', 'start-ns', and 'end-ns'. If omitted, only 'end' events are produced. The value returned by this function is an iterator that produces tuples (event, elem) where event is a string such as 'start' or 'end' and elem is the element being processed. For 'start' events, the element is newly created and initially empty except for attributes. For 'end' events, the element is fully populated and includes all subelements.

parse(source)

Fully parses an XML source into an ElementTree object. source is a filename or filelike object with XML data.

tostring(elem)

Creates an XML string representing elem and all of its subelements.

XML Examples

Here is an example of using ElementTree to parse the sample recipe file and print an ingredient list. It is similar to the example shown for DOM.

from xml.etree.ElementTree import ElementTree doc = ElementTree(file="recipe.xml") ingredients = doc.find('ingredients')

for item in ingredients.findall('item'): num = item.get('num') units = item.get('units','') text = item.text.strip() quantity = "%s %s" % (num, units) print("%-10s %s" % (quantity, text))

The path syntax of ElementTree makes it easier to simplify certain tasks and to take shortcuts as necessary. For example, here is a different version of the previous code that uses the path syntax to simply extract all <item>...</item> elements.

from xml.etree.ElementTree import ElementTree doc = ElementTree(file="recipe.xml") for item in doc.findall(".//item"): num = item.get('num') units = item.get('units','') text = item.text.strip() quantity = "%s %s" % (num, units) print("%-10s %s" % (quantity, text))

Consider an XML file 'recipens.xml' that makes use of namespaces:

<?xml version="1.0" encoding="iso-8859-1"?> <recipe xmlns:r="http://www.dabeaz.com/namespaces/recipe"> <r:title> Famous Guacamole </r:title> <r:description> A southwest favorite! </r:description> <r:ingredients>

<r:item num="4"> Large avocados, chopped </r:item>

</r:ingredients> <r:directions>

Combine all ingredients and hand whisk to desired consistency. Serve and enjoy with ice-cold beers. </r:directions> </recipe>

To work with the namespaces, it is usually easiest to use a dictionary that maps the namespace prefix to the associated namespace URI.You then use string formatting operators to fill in the URI as shown here:

from xml.etree.ElementTree import ElementTree doc = ElementTree(file="recipens.xml") ns = {

'r' : 'http://www.dabeaz.com/namespaces/recipe'

ingredients = doc.find('{%(r)s}ingredients' % ns) for item in ingredients.findall('{%(r)s}item' % ns): num = item.get('num') units = item.get('units','') text = item.text.strip() quantity = "%s %s" % (num, units) print("%-10s %s" % (quantity, text))

For small XML files, it is fine to use the ElementTree module to quickly load them into memory so that you can work with them. However, suppose you are working with a huge XML file with a structure such as this:

<?xml version="1.0" encoding="utf-8"?> <music> <album>

<title>A Texas Funeral</title> <artist>Jon Wayne</artist>

<title>Metaphysical Graffiti</title> <artist>The Dead Milkmen</artist>

... continues for 100000 more albums ... </music>

Reading a large XML file into memory tends to consume vast amounts of memory. For example, reading a 10MB XML file may result in an in-memory data structure of more than 100MB. If you're trying to extract information from such files, the easiest way to do it is to use the ElementTree.iterparse() function. Here is an example of itera-tively processing <album> nodes in the previous file:

from xml.etree.ElementTree import iterparse iparse = iterparse("music.xml", ['start','end'])

# Find the top-level music element for event, elem in iparse:

if event == 'start' and elem.tag == 'music': musicNode = elem break

# Get all albums albums = (elem for event, elem in iparse if event == 'end' and elem.tag == 'album')

for album in albums:

# Do some kind of processing musicNode.remove(album) # Throw away the album when done

The key to using iterparse() effectively is to get rid of data that you're no longer using.The last statement musicNode.remove(album) is throwing away each <album> element after we are done processing it (by removing it from its parent). If you monitor the memory footprint of the previous program, you will find that it stays low even if the input file is massive.

Notes

■ The ElementTree module is by far the easiest and most flexible way of handling simple XML documents in Python. However, it does not provide a lot of bells and whistles. For example, there is no support for validation, nor does it provide any apparent way to handle complex aspects of XML documents such as DTDs. For these things, you'll need to install third-party packages. One such package, lxml.etree (at http://codespeak.net/lxml/), provides an ElementTree API to the popular libxml2 and libxslt libraries and provides full support for XPATH, XSLT, and other features.

■ The ElementTree module itself is a third-party package maintained by Fredrik Lundh at http://effbot.org/zone/element-index.htm.At this site you can find versions that are more modern than what is included in the standard library and which offer additional features.

xml.sax

The xml.sax module provides support for parsing XML documents using the SAX2 API.

parse(file, handler [, error_handler])

Parses an XML document, file. file is either the name of a file or an open file object. handler is a content handler object. error_handler is an optional SAX error-handler object that is described further in the online documentation.

parseString(string, handler [, error_handler])

The same as parse() but parses XML data contained in a string instead.

Handler Objects

To perform any processing, you have to supply a content handler object to the parse() or parseString() functions.To define a handler, you define a class that inherits from ContentHandler. An instance c of ContentHandler has the following methods, all of which can be overridden in your handler class as needed:

c.characters(content)

Called by the parser to supply raw character data. content is a string containing the characters.

c.endDocument()

Called by the parser when the end of the document is reached. c.endElement(name)

Called when the end of element name is reached. For example, if '</foo>' is parsed, this method is called with name set to 'foo'.

c.endElementNS(name, qname)

Called when the end of an element involving an XML namespace is reached. name is a tuple of strings (uri, localname) and qname is the fully qualified name. Usually qname is None unless the SAX namespace-prefixes feature has been enabled. For example, if the element is defined as '<foo:bar xmlns:foo="http://spam.com">', then the name tuple is (u'http://spam.com', u'bar').

c.endPrefixMapping(prefix)

Called when the end of an XML namespace is reached. prefix is the name of the namespace.

c.ignorableWhitespace( whitespace)

Called when ignorable whitespace is encountered in a document. whitespace is a string containing the whitespace.

c.processingInstruction(target, data)

Called when an XML processing instruction enclosed in <? ... ?> is encountered. target is the type of instruction, and data is the instruction data. For example, if the instruction is '<?xml-stylesheet href="mystyle.css" type="text/css"?>, target is set to 'xml-stylesheet' and data is the remainder of the instruction text 'href="mystyle.css" type="text/css"'.

c.setDocumentLocator(locator)

Called by the parser to supply a locator object that can be used for tracking line numbers, columns, and other information.The primary purpose of this method is simply to store the locator someplace so that you can use it later—for instance, if you needed to print an error message.The locator object supplied in locator provides four methods—getColumnNumber(), getLineNumber(), getPublicId(), and getSystemId() —that can be used to get location information.

c.skippedEntity(name)

Called whenever the parser skips an entity. name is the name of the entity that was skipped.

c.startDocument()

Called at the start of a document. c.startElement(name, attrs)

Called whenever a new XML element is encountered. name is the name of the element, and attrs is an object containing attribute information. For example, if the XML element is '<foo bar="whatever" spam="yes">', name is set to 'foo' and attrs contains information about the bar and spam attributes.The attrs object provides a number of methods for obtaining attribute information:

Method Description attrs .getLength() Returns the number of attributes attrs. getNames() Returns a list of attribute names attrs. getType(name) Gets the type of attribute name attrs. getValue(name) Gets the value of attribute name c.startElementNS(name, qname, attrs)

Called when a new XML element is encountered and XML namespaces are being used. name is a tuple (uri, localname) and qname is a fully qualified element name (normally set to None unless the SAX2 namespace-prefixes feature has been enabled). attrs is an object containing attribute information. For example, if the XML element is '<foo:bar xmlns:foo="http://spam.com" blah="whatever">', then name is (u'http://spam.com', u'bar'), qname is None, and attrs contains information about the attribute blah.The attrs object has the same methods as used in when accessing attributes in the startElement() method shown earlier. In addition, the following additional methods are added to deal with namespaces:

Method attrs.getValueByQName(qname) attrs.getNameByQName(qname)

attrs.getQNameByName(name)

attrs.getQNames()

Description

Returns value for qualified name.

Returns (namespaee, loealname) tuple for a name.

Returns qualified name for name specified as a tuple (namespace, localname) . Returns qualified names of all attributes.

c.startPrefixMapping(prefix, uri)

Called at the start of an XML namespace declaration. For example, if an element is defined as '<foo:bar xmlns:foo="http://spam.com">', then prefix is set to 'foo' and uri is set to 'http://spam.com'.

Example

The following example illustrates a SAX-based parser, by printing out the ingredient list from the recipe file shown earlier. This should be compared with the example in the xml.dom.minidom section.

from xml.sax import ContentHandler, parse class RecipeHandler(ContentHandler): def startDocument(self): self.initem = False def startElement(self,name,attrs): if name == 'item':

self.num = attrs.get('num','1') self.units = attrs.get('units','none') self.text = [] self.initem = True def endElement(self,name): if name == 'item':

text = "".join(self.text) if self.units == 'none': self.units = "" unitstr = "%s %s" % (self.num, self.units) print("%-10s %s" % (unitstr,text.strip())) self.initem = False def characters(self,data): if self.initem:

self.text.append(data)

parse("recipe.xml",RecipeHandler())

Notes

The xml.sax module has many more features for working with different kinds of XML data and creating custom parsers. For example, there are handler objects that can be defined to parse DTD data and other parts of the document. More information can be found in the online documentation.

xml.sax.saxutils

The xml.sax.saxutils module defines some utility functions and objects that are often used with SAX parsers, but are often generally useful elsewhere.

escape(data [, entities])

Given a string, data, this function replaces certain characters with escape sequences. For example, '<' gets replaced by '&lt;'. entities is an optional dictionary that maps characters to the escape sequences. For example, setting entities to { u'\xf1' : '&ntilde;' } would replace occurences of n with '&ntilde;'.

unescape(data [, entities])

Unescapes special escape sequences that appear in data. For instance, '&lt;' is replaced by '<' . entities is an optional dictionary mapping entities to unescaped character values. entities is the inverse of the dictionary used with escape()—for example, { '&ntilde;' : u'\xf1' }.

quoteattr(data [, entities])

Escapes the string data, but performs additional processing that allows the result value to be used as an XML attribute value. The return value can be printed directly as an attribute value—for example, print "<element attr=%s>" % quoteattr(somevalue). entities is a dictionary compatible for use with the escape() function.

XMLGenerator([out [, encoding]])

A ContentHandler object that merely echoes parsed XML data back to the output stream as an XML document. This re-creates the original XML document. out is the output document and defaults to sys.stdout. encoding is the character encoding to use and defaults to 'iso-8859-1' .This can be useful if you're trying to debug your parsing code and use a handler that is known to work.

Was this article helpful?

0 0

Post a comment