Playlist Data Parsing

Hand- In the previous section's second subsection we created a handcrafted regex- PLY crafted based parser for .m3u files. In this subsection we will create a parser to do the m3u m u same thing, but this time using the PyParsing module. An extract from a .m3u parser parser file is shown in Figure 14.6 (523 •<), and the BNF is shown in Figure 14.7 ^557

As we did when reviewing the previous subsection's .pls parser, we will review the .m3u parser in three parts: first the creation of the parser, then the helper function, and finally the call to the parser. Just as with the .pls parser, we are ignoring the parser's return value and instead populating our data structure as the parsing progresses. (In the following two subsections we will create parsers whose return values are used.)

title = restOfLine("title") filename = restOfLine("filename")

seconds = Combine(Optional("-") + Word(nums)).setParseAction(

lambda tokens: int(tokens[0]))("seconds") info = Suppress("#EXTINF:") + seconds + Suppress(",") + title entry = info + LineEnd() + filename + LineEnd() entry.setParseAction(add_song) parser = Suppress("#EXTM3U") + OneOrMore(entry)

Song We begin by creating an empty list that will hold the Song named tuples.

named tuple Although the BNF is quite simple, some of the parser elements are more com-

523 plex than those we have seen so far. Notice also that we create the parser elements in reverse order to the order used in the BNF. This is because in Python we can only refer to things that already exist, so for example, we cannot create a parser element for an ENTRY before we have created one for an INFO since the former refers to the latter.

The title and filename parser elements are ones that match every character from the parse position where they are tried until the end of the line. This means that they can match any characters, including whitespace—but not including newline which is where they stop. We also give these parser elements names, for example, "title"—this allows us to conveniently access them by name as an attribute of the tokens object that is given to parse action functions.

The seconds parser element matches an optional minus sign followed by digits; (nums is a predefined PyParsing string that contains the digits). We use Combine() to ensure that the sign (if present) and digits are returned as a single string. (It is possible to specify a separator for Combine(), but there is no need in this case, since the default of an empty string is exactly what we want.) The parse action is so simple that we have used a lambda. The Combine() ensures that there is always precisely one token in the tokens tuple, and we use int() to convert this to an integer. If a parse action returns a value, that value becomes the value associated with the token rather than the text that was matched. We have also given a name to the token for convenience of access later on.

The info parse action consists of the literal string that indicates an entry, followed by the seconds, followed by a comma, followed by the title—and all this is defined very simply and naturally in a way that matches the BNF. Notice also that we use Suppress() for the literal string andfor the comma since although both are essential for the grammar, they are of no interest to us in terms of the data itself.

The entry parser element is very easy to define: simply an info followed by a newline, then a filename followed by a newline—the LineEnd() is a predefined PyParsing parser element to match a newline. And since we are populating our list of songs as we parse rather than at the end, we give the entry parser element a parse action that will be called whenever an ENTRY is matched.

The parser itself is a parser element that matches the literal string that indicates a .m3u file, followed by one or more ENTRYs.

def add_song(tokens):

songs.append(Song(tokens.title, tokens.seconds, tokens.filename))

Map- The add_song() function is simple, especially since we named the parser ele-ping ments we are interested in and are therefore able to access them as attributes iunngpack- of the tokens object. And of course, we could have written the function even more compactly by converting the tokens to a dictionary and using mapping unpacking—for example, songs.append(Song(**tokens.asDict())).

try:

parser.parseFile(fh) except ParseException as err:

print("parse error: {0}".format(err)) return [] return songs

The code for calling ParserElement.parseFile() is almost identical to the code we used for the .pls parser, although in this case instead of passing a filename we opened a file in text mode and passed in the io.TextIOWrapper returned by the built-in open() function as the fh ("file handle") variable.

We have now finished reviewing two simple PyParsing parsers, and seen many of the most commonly used parts of the PyParsing API. In the following two subsections we will look at more complex parsers, both of which are recursive, that is, they have nonterminals whose definition includes themselves, and in ing 179 <

the final example we will also see how to handle operators and their precedences and associativities.

Parsing the Blocks Domain-Specific Language ||

Hand- In the previous section's third subsection we created a recursive descent parser PLY

crafted for .blk files. In this subsection we will create a PyParsing implementation of a blocks blocks blocks parser that should be easier to understand and be more maintainable. parser parser > 559

525•< Two example .blk files are shown in Figures 14.8 (525 •<) and 14.10 (526 •<). The BNF for the blocks format is shown in Figure 14.12 (527 <).

We will look at the creation of the parser elements in two parts, then we will look at the helper function, and then we will see how the parser is called. And at the end we will see how the parser's results are transformed into a root block with child blocks (which themselves may contain child blocks, etc.), that is our required output.

left_bracket, right_bracket = map(Suppress, "[]") new_rows = Word("/")("new_rows").setParseAction(

lambda tokens: len(tokens.new_rows)) name = CharsNotIn("[]/\n")("name").setParseAction(

lambda tokens: tokens.name.strip()) color = (Word("#", hexnums, exact=7) |

Word(alphas, alphanums))("color") empty_node = (left_bracket + right_bracket).setParseAction( lambda: EmptyBlock)

As always with PyParsing parsers, we create parser elements to match the BNF from last to first so that for every parser element we create that depends on one or more other parser elements, the elements it depends on already exist.

The brackets are an important part of the BNF, but are of no interest to us for the results, so we create suitable Suppress() parser elements for them.

For the new_rows parser element it might be tempting to use Literal("/")—but that must match the given text exactly whereas we want to match as many /s as are present. Having created the new_rows parser element, we give a name to its results and add a parsing action that replaces the string of one or more /s with an integer count of how many /s there were. Notice also that because we gave a name to the result, we can access the result (i.e., the matched text), by using the name as an attribute of the tokens object in the lambda.

The name parser element is slightly different from that specified in the BNF in that we have chosen to disallow not only brackets and forward slashes, but also newlines. Again, we give the result a name. We also set a parse action, this time to strip whitespace since whitespace (apart from newlines) is allowed as part of a name, yet we don't want any leading or trailing whitespace.

For the color parser element we have specified that the first character must be a # followed by exactly six hexadecimal digits (seven characters in all), or a sequence of alphanumeric characters with the first character alphabetic.

We have chosen to handle empty nodes specially. We define an empty node as a left bracket followed by a right bracket, and replace the brackets with the value EmptyBlock which earlier in the file is defined as EmptyBlock = 0. This means that in the parser's results list we represent empty blocks with 0, and as noted earlier, we represent new rows by an integer row count (which will always be > 0).

nodes = Forward()

node_data = Optional(color + Suppress(":")) + Optional(name)

node_data.setParseAction(add_block)

node = left_bracket - node_data + nodes + right_bracket nodes << Group(ZeroOrMore(Optional(new_rows) +

OneOrMore(node | empty_node)))

We define nodes to be a Forward() parser element, since we need to use it before we specify what it matches. We have also introduced a new parser element that isn't in the BNF, node_data, which matches the optional color and optional name. We give this parser element a parse action that will create a new Block, so each time a node_data is encountered a Block will be added to the parser's results list.

The node parser element is defined very naturally as a direct translation of the BNF. Notice that both the node_data and nodes parser elements are optional (the former consisting of two optional elements, the latter quantified by zero or more), so empty nodes are correctly allowed.

Finally, we can define the nodes parser element. Since it was originally created as a Forward() we must append parser elements to it using <<. Here we have set nodes to be zero or more of an optional new row and one or more nodes. Notice that we put node before empty_node—since PyParsing matches left to right we normally order parser elements that have common prefixes from longest to shortest matching.

We have also grouped the nodes parser element's results using Group()—this ensures that each nodes is created as a list in its own right. This means that a node that contains nodes will be represented by a Block for the node, and by a list for the contained nodes—and which in turn may contain Blocks, or integers for empty nodes or new rows, and so on. It is because of this recursive structure that we had to create nodes as a Forward(), and also why we must use the << operator (which in PyParsing is used to append), to add the Group() parser element and the elements it contains to the nodes element.

One important but subtle point to note is that we used the - operator rather than the + operator in the definition of the node parser element. We could just as easily have used +, since both + (ParserElement._add_()) and - (ParserElement._sub_()) do the same job—they return a parser element that represents the concatenation of the two parser elements that are the operator's operands.

The reason we chose to use - rather than + is due to a subtle but important difference between them. The - operator will stop parsing and raise a Parse-SyntaxException as soon as an error is encountered, something that the + operator doesn't do. If we had used + all errors would have a line number of 1 and a column of 1; but by using -, any errors have the correct line and column numbers. In general, using + is the right approach, but if our tests show that we are getting incorrect error locations, then we can start to change +s into -s as we have done here—and in this case only a single change was necessary.

def add_block(tokens):

return Block.Block(tokens.name, tokens.color if tokens.color else "white")

Whenever a node_data is parsed instead of the text being returned and added to the parser's results list, we create and return a Block. We also always set the color to white unless a color is explicitly specified.

In the previous examples we parsed a file and an open file handle (an opened io.TextIOWrapper); here we will parse a string. It makes no difference to Py-Parsing whether we give it a string or a file, so long as we use ParserElement. parseFile() or ParserElement.parseString() as appropriate. In fact, PyParsing offers other parsing methods, including ParserElement.scanString() which searches a string for matches, and ParserElement.transformString() which returns a copy of the string it is given, but with matched texts transformed into new texts by returning new text from parse actions.

results = nodes.parseString(text, parseAll=True) assert len(results) == 1 items = results.asList()[0] populate_children(items, stack) except (ParseException, ParseSyntaxException) as err: raise ValueError("Error {{0}}: syntax error, line " "{0}".format(err.lineno))

return stack[0]

This is the first PyParsing parser where we have used the parser's results rather than created the data structures ourselves during the parsing process. We expect the results to be returned as a list containing a single ParseResults object. We convert this object into a standard Python list, so now we have a list containing a single item—a list of our results—which we assign to the items variable, and that we then further process via the populate_children() call.

Before discussing the handling of the results, we will briefly mention the error handling. If the parser fails it will raise an exception. We don't want PyParsing's exceptions to leak out to clients since we may choose to change the parser generator later on. So, if an exception occurs, we catch it and then raise our own exception (a ValueError) with the relevant details.

The In the case of a successful parse of the hierarchy.blk example, the items list hierar- looks like this (with occurrences of <Block.Block object at 0x8f52acd> and chy.biA similar, replaced with Block for clarity):

525< [0, Block, [], 2, 0, Block, [], 2, Block, [], 0, Block, []]

Whenever we parsed an empty block we returned 0 to the parser's results list; whenever we parsed new rows we returned the number of rows; and whenever we encountered a node_data, we created a Block to represent it. In the case of Blocks they always have an empty child list (i.e., the children attribute is set to [] ), since at this point we don't know if the block will have children or not.

So here the outer list represents the root block, the 0s represent empty blocks, the other integers (all 2s in this case) represent new rows, and the [] s are empty child lists since none of the hierarchy.blk file's blocks contain other blocks.

The The messagebox.blk example's items list (pretty printed to reveal its structure, menage- and again using Block for clarity) is:

box.blk file [Block,

526 [Block,

Here we can see that the outer list (representing the root block) contains a block that has a child list of one block that contains its own child list, and where these children are blocks (with their own empty child lists), new rows (2 and 1), and empty blocks (0s).

One problem with the list results representation is that every Block's children list is empty—each block's children are in a list that follows the block in the parser's results list. We need to convert this structure into a single root block with child blocks. To this end we have created a stack—a list containing a single root Block. We then call the populate_children() function that takes the list of items returned by the parser and a list with a root block, and populates the root block's children (and their children, and so on, as appropriate) with the items.

The populate_children() function is quite short, but also rather subtle.

def populate_children(items, stack): for item in items:

if isinstance(item, Block.Block):

stack[-1].children.append(item) elif isinstance(item, list) and item: stack.append(stack[-1].children[-1]) populate_children(item, stack) stack.pop() elif isinstance(item, int): if item == EmptyBlock:

stack[-1].children.append(Block.get_empty_block()) else:

for x in range(item):

stack[-1].children.append(Block.get_new_row())

We iterate over every item in the results list. If the item is a Block we append it to the stack's last (top) Block's child list. (Recall that the stack is initialized with a single root Block item.) If the item is a nonempty list, then it is a child list that belongs to the previous block. So we append the previous block (i.e., the top Block's last child) to the stack to make it the top of the stack, and then recursively call populate_children() on the list item and the stack. This ensures that the list item (i.e., its child items) is appended to the correct item's child list. Once the recursive call is finished, we pop the top of the stack, ready for the next item.

If the item is an integer then it is either an empty block (0, i.e., EmptyBlock) or a count of new rows. If it is an empty block we append an empty block to the stack's top Block's list of children. If the item is a new row count, we append that number of new rows to the stack's top Block's list of children.

If the item is an empty list this signifies an empty child list and we do nothing, since by default all Blocks are initialized to have an empty child list.

At the end the stack's top item is still the root Block, but now it has children (which may have their own children, and so on). For the hierarchy.blk example, the populate_children() function produces the structure illustrated in Figure 14.13 (528 •<). And for the messagebox.blk example, the function produces the structure illustrated in Figure 14.14 (528 •<).

The conversion into an SVG file using the BlockOutput.save_blocks_as_svg() function is the same for all the blocks parsers, since they all produce the same root block and children structures.

Parsing First-Order Logic

In this last PyParsing subsection we will create a parser for a DSL for express- PLY

ing formulas in first-order logic. This has the most complex BNF of all the examples in the chapter, and the implementation requires us to handle operators, including their precedences and associativities, something we have not needed to do so far. There is no handcrafted version of this parser—once we have reached this level of complexity it is better to use a parser generator. But in addition to the PyParsing version shown here, in the following section's last subsection there is an equivalent PLY parser for comparison.

Here are a few examples of the kind of first-order logical formulas that we want to be able to parse:

~ true | true & true -> forall x: exists y: true

(forall x: exists y: true) -> true & ~ true -> true true & forall x: x = x true & (forall x: x = x)

We have opted to use ASCII characters rather than the proper logical operator symbols, to avoid any distraction from the parser itself. So, we have used forall for V, exists for 3, -> for ^ (implies), | for v (logical or), & for a (logical and), and ~ for — (logical not). Since Python strings are Unicode it would be easy to use the real symbols—or we could adapt the parser to accept both the ASCII forms shown here and the real symbols.

In the formulas shown here, the parentheses make a difference in the last two formulas—so those formulas are different—but not for the two above them (those starting with true), which are the same despite the parentheses. Naturally, the parser must get these details right.

One surprising aspect of first-order logic is that not (~) has a lower precedence than equals (=), so ~ a = b is actually ~ (a = b). This is why logicians usually put a space after ~.

A BNF for our first-order logic DSL is given in Figure 14.15. For the sake of clarity the BNF does not include any explicit mention of whitespace (no \n or \s* elements), but we will assume that whitespace is allowed between all terminals and nonterminals.

Although our subset of BNF syntax has no provision for expressing precedence or associativity, we have added comments to indicate associativities for the first-order logic parser

Formulas

FORMULA ::= ('forall' | 'exists') SYMBOL ':' FORMULA

| FORMULA '->' FORMULA # right associative | FORMULA '|' FORMULA # left associative

| FORMULA '&' FORMULA # left associative

| '~' FORMULA | '(' FORMULA ')' | TERM ' = ' TERM | 'true' | 'false'

TERM ::= SYMBOL | SYMBOL '(' TERMLIST ')' TERMLIST ::= TERM | TERM ',' TERM LIST SYMBOL ::= [a-zA-Z]\w*

Figure 14.15 ABNF for first-order logic binary operators. As for precedence, the order is from lowest to highest in the order shown in the BNF for the first few alternatives; that is, forall and exists have the lowest precedence, then ->, then |, then &. And the remaining alternatives all have higher precedence than those mentioned here.

Before looking at the parser itself, we will look at the import and the line that follows it since they are different than before.

from pyparsing_py3 import (alphanums, alphas, delimitedList, Forward, Group, Keyword, Literal, opAssoc, operatorPrecedence, ParserElement, ParseException, ParseSyntaxException, Suppress, Word)

ParserElement.enablePackrat()

The import brings in some things we haven't seen before and that we will cov-Memoiz- er when we encounter them in the parser. The enablePackrat() call is used to ing switch on an optimization (based on memoizing) that can produce a consider-35K able speedup when parsing deep operator hierarchies.* If we do this at all it is best to do it immediately after importing the pyparsing_py3 module—and before creating any parser elements.

Although the parser is short, we will review it in three parts for ease of explanation, and then we will see how it is called. We don't have any parser actions since all we want to do is to get an AST (Abstract Syntax Tree)—a list representing what we have parsed—that we can post-process later on if we wish.

left_parenthesis, right_parenthesis, colon = map(Suppress, "():") forall = Keyword("forall")

* For more on packrat parsing, see Bryan Ford's master's thesis at pdos.csail.mit.edu/~baford/ packrat/.

exists = Keyword("exists") implies = Literal("->") or_ = Literal("|") and_ = Literal("&") not_ = Literal("~") equals = Literal("=")

boolean = Keyword("false") | Keyword("true") symbol = Word(alphas, alphanums)

All the parser elements created here are straightforward, although we had to add underscores to the end of a few names to avoid conflicts with Python keywords. If we wanted to give users the choice of using ASCII or the proper Unicode symbols, we could change some of the definitions. For example:

forall = Keyword("forall") | Literal("V")

If we are using a non-Unicode editor we could use the appropriate escaped Unicode code point, such as Literal("\u2200"), instead of the symbol.

term << (Group(symbol + Group(left_parenthesis +

delimitedList(term) + right_parenthesis)) | symbol)

A term is defined in terms of itself, which is why we begin by creating it as a Forward(). And rather than using a straight translation of the BNF we use one of PyParsing's coding patterns. Recall that the delimitedList() function returns a parser element that can match a list of one or more occurrences of the given parser element, separated by commas (or by something else if we explicitly specify the separator). So here we have defined the term parser element as being either a symbol followed by a comma-separated list of terms or a symbol—and since both start with the same parser element we must put the one with the longest potential match first.

formula = Forward()

forall_expression = Group(forall + symbol + colon + formula) exists_expression = Group(exists + symbol + colon + formula) operand = forall_expression | exists_expression | boolean | term formula << operatorPrecedence(operand, [

(equals, 2, opAssoc.LEFT), (not_, 1, opAssoc.RIGHT), (and_, 2, opAssoc.LEFT), (or_, 2, opAssoc.LEFT), (implies, 2, opAssoc.RIGHT)])

Although the formula looks quite complicated in the BNF, it isn't so bad in PyParsing syntax. First we define formula as a Forward() since it is defined in terms of itself. The forall_expression and exists_expression parser elements are straightforward to define; we've just used Group() to make them sublists within the results list to keep their components together and at the same time distinct as a unit.

The operatorPrecedence() function (which really ought to have been called something like createOperators()) creates a parser element that matches one or more unary, binary, and ternary operators. Before calling it, we first specify what our operands are—in this case a forall_expression or an ex-ists_expression or a boolean or a term. The operatorPrecedence() function takes a parser element that matches valid operands, and then a list of parser elements that must be treated as operators, along with their arities (how many operands they take), and their associativities. The resultant parser element (in this case, formula) will match the specified operators and their operands.

Each operator is specified as a three- or four-item tuple. The first item is the operator's parser element, the second is the operator's arity as an integer (1 for a unary operator, 2 for a binary operator, and 3 for a ternary operator), the third is the associativity, and the fourth is an optional parse action.

PyParsing infers the operators' order of precedence from their relative positions in the list given to the operatorPrecedence() function, with the first operator having the highest precedence and the last the lowest, so the order of the items in the list we pass is important. In this example, = has the highest precedence (and has no associativity, so we have made it left-associative), and -> has the lowest precedence and is right-associative.

This completes the parser, so we can now look at how it is called. try:

result = formula.parseString(text, parseAll=True) assert len(result) == 1 return result[0].asList() except (ParseException, ParseSyntaxException) as err: print("Syntax error:\n{0.line}\n{1}~".format(err, " " * (err.column - 1)))

This code is similar to what we used for the blocks example in the previous subsection, only here we have tried to give more sophisticated error handling. In particular, if an error occurs we print the line that had the error and on the line below it we print spaces followed by a caret (~) to indicate where the error was detected. For example, if we parse the invalid formula, forallx: = x & true, we will get:

In this case the error location is slightly off—the error is that = x should have the form y = x, but it is still pretty good.

In the case of a successful parse we get a list of ParseResults which has a single result—as before we convert this to a Python list.

Earlier we saw some example formulas; now we will look at some of them again, this time with the result lists produced by the parser, pretty printed to help reveal their structure.

We mentioned before that the ~ operator has a lower precedence than the = operator—so let's see if this is handled correctly by the parser.

Here we get exactly the same results for both formulas, which demonstrates that = has higher precedence than ~. Of course, we would need to write several more test formulas to check all the cases, but this at least looks promising.

Two of the formulas that we saw earlier were forall x: x = x & true and (forall x: x = x) & true, and we pointed out that although the only difference between them is the parentheses, this is sufficient to make them different formulas. Here are the lists the parser produces for them:

The parser is clearly able to distinguish between these two formulas, and creates quite different parse trees (nested lists). Without the parentheses, forall's formula is everything right of the colon, but with the parentheses, forall's scope is limited to within the parentheses.

But what about the two formulas that again are different only in that one has parentheses, but where the parentheses don't matter, so that the formulas are actually the same? These two formulas are true & forall x: x = x and true & (forall x: x = x), and fortunately, when parsed they both produce exactly the same list: [

The parentheses don't matter here because only one valid parse is possible.

We have now completed the PyParsing first-order logic parser, and in fact, all of the book's PyParsing examples. If PyParsing is of interest, the PyParsing web site (pyparsing.wikispaces.com) has many other examples and extensive documentation, and there is also an active Wiki and mailing list.

In the next section we will look at the same examples as we covered in this section, but this time using the PLY parser which works in a very different way from PyParsing.

0 0

Post a comment