The Element Tree Interface

Python's ElementTree module provides a convenient way to access data stored in XML files. ElementTree is part of Python's standard library (since Python 2.5), and is also provided as part of NLTK in case you are using Python 2.4.

We will illustrate the use of ElementTree using a collection of Shakespeare plays that have been formatted using XML. Let's load the XML file and inspect the raw data, first at the top of the file O, where we see some XML headers and the name of a schema called play.dtd, followed by the root element PLAY. We pick it up again at the start of Act 1 0. (Some blank lines have been omitted from the output.)

>>> merchant_file = nltk.data.find('corpora/shakespeare/merchant.xml') >>> raw = open(merchant_file).read() >>> print raw[0:l68] O <?xml version="l.0"?>

<?xml-stylesheet type="text/css" href="shakes.css"?>

<!-- <!DOCTYPE PLAY SYSTEM "play.dtd"> -->

<TITLE>The Merchant of Venice</TITLE> >>> print raw[l850:2075] © <TITLE>ACT I</TITLE>

<SCENE><TITLE>SCENE I. Venice. A street.</TITLE> <STAGEDIR>Enter ANTONIO, SALARINO, and SALANIO</STAGEDIR> <SPEECH>

<SPEAKER>ANTONIO</SPEAKER>

<LINE>In sooth, I know not why I am so sad:</LINE>

We have just accessed the XML data as a string. As we can see, the string at the start of Act 1 contains XML tags for title, scene, stage directions, and so forth.

The next step is to process the file contents as structured XML data, using Element Tree. We are processing a file (a multiline string) and building a tree, so it's not surprising that the method name is parse O. The variable merchant contains an XML element PLAY ©. This element has internal structure; we can use an index to get its first child, a TITLE element ©. We can also see the text content of this element, the title of the play . To get a list of all the child elements, we use the getchildren() method .

>>> from nltk.etree.ElementTree import ElementTree >>> merchant = ElementTree().parse(merchant_file) O >>> merchant

<Element PLAY at 22fa800> 0

<Element TITLE at 22fa828> 0

'The Merchant of Venice'

>>> merchant.getchildren() 0

[<Element TITLE at 22fa828>, <Element PERSONAE at 22fa7b0>, <Element SCNDESCR at 2300170>, <Element PLAYSUBT at 2300198>, <Element ACT at 23001e8>, <Element ACT at 234ec88>, <Element ACT at 23c87d8>, <Element ACT at 2439198>, <Element ACT at 24923c8>]

The play consists of a title, the personae, a scene description, a subtitle, and five acts. Each act has a title and some scenes, and each scene consists of speeches which are made up of lines, a structure with four levels of nesting. Let's dig down into Act IV:

<Element SCENE at 224cf80>

'SCENE I. Venice. A court of justice.'

<Element SPEECH at 226ee40>

<Element SPEAKER at 226ee90>

'PORTIA'

<Element LINE at 226eee0>

"The quality of mercy is not strain'd,"

Your Turn: Repeat some of the methods just shown, for one of the other Shakespeare plays included in the corpus, such as Romeo and Juliet or Macbeth. For a list, see nltk.corpus.shakespeare.fileids().

Although we can access the entire tree this way, it is more convenient to search for subelements with particular names. Recall that the elements at the top level have several types. We can iterate over just the types we are interested in (such as the acts), using merchant.findall('ACT'). Here's an example of doing such tag-specific searches at every level of nesting:

>>> for i, act in enumerate(merchant.findall('ACT')):

... for j, scene in enumerate(act.findall('SCENE')):

... for k, speech in enumerate(scene.findall('SPEECH')):

... print "Act %d Scene %d Speech %d: %s" % (i+1, j+1, k+1, line.text)

Act 3 Scene 2 Speech 9: Let music sound while he doth make his choice;

Act 3 Scene 2 Speech 9: Fading in music: that the comparison

Act 3 Scene 2 Speech 9: And what is music then? Then music is

Act 5 Scene 1 Speech 23: And bring your music forth into the air.

Act 5 Scene 1 Speech 23: Here will we sit and let the sounds of music

Act 5 Scene 1 Speech 23: And draw her home with music.

Act 5 Scene 1 Speech 24: I am never merry when I hear sweet music.

Act 5 Scene 1 Speech 25: Or any air of music touch their ears,

Act 5 Scene 1 Speech 25: By the sweet power of music: therefore the poet

Act 5 Scene 1 Speech 25: But music for the time doth change his nature.

Act 5 Scene 1 Speech 25: The man that hath no music in himself,

Act 5 Scene 1 Speech 25: Let no such man be trusted. Mark the music.

Act 5 Scene 1 Speech 29: It is your music, madam, of the house.

Act 5 Scene 1 Speech 32: No better a musician than the wren.

Instead of navigating each step of the way down the hierarchy, we can search for particular embedded elements. For example, let's examine the sequence of speakers. We can use a frequency distribution to see who has the most to say:

>>> speaker_seq = [s.text for s in merchant.findall('ACT/SCENE/SPEECH/SPEAKER')] >>> speaker_freq = nltk.FreqDist(speaker_seq) >>> top5 = speaker_freq.keys()[:5] >>> top5

['PORTIA', 'SHYLOCK', 'BASSANIO', 'GRATIANO', 'ANTONIO']

We can also look for patterns in who follows whom in the dialogues. Since there are 23 speakers, we need to reduce the "vocabulary" to a manageable size first, using the method described in Section 5.3.

>>> mapping = nltk.defaultdict(lambda: 'OTH')

>>> speaker_seq2 = [mapping[s] for s in speaker_seq]

>>> cfd = nltk.ConditionalFreqDist(nltk.ibigrams(speaker_seq2))

ANTO BASS GRAT OTH PORT SHYL

ANTO

0

11

4

11

9

12

BASS

10

0

11

10

26

16

GRAT

6

8

0

19

9

5

OTH

8

16

18

153

52

25

PORT

7

23

13

53

0

21

SHYL

15

15

2

26

21

0

Ignoring the entry of 153 for exchanges between people other than the top five, the largest values suggest that Othello and Portia have the most significant interactions.

Was this article helpful?

0 0

Post a comment