Quick Introduction to PyParsing

PyParsing makes no real distinction between lexing and parsing. Instead, it provides functions and classes to create parser elements—one element for each thing to be matched. Some parser elements are provided predefined by PyParsing, others can be created by calling PyParsing functions or by instantiating PyParsing classes. Parser elements can also be created by combining other parser elements together—for example, concatenating them with + to form a sequence of parser elements, or or-ing them with | to form a set of parser element alternatives. Ultimately, a PyParsing parser is simply a collection of parser elements (which themselves may be made up of parser elements, etc.), composed together.

If we want to process what we parse, we can process the results that PyParsing returns, or we can add parse actions (code snippets) to particular parser elements, or some combination of both.

PyParsing provides a wide range of parser elements, of which we will briefly describe some of the most commonly used. The Literal() parser element matches the literal text it is given, and CaselessLiteral() does the same thing but ignores case. If we are not interested in some part of the grammar we can use Suppress(); this matches the literal text (or parser element) it is given, but does not add it to the results.

The Keyword() element is almost the same as Literal() except that it must be followed by a nonkeyword character—this prevents a match where a keyword is a prefix of something else. For example, given the data text, "filename", Literal("file") will match filename, with the name part left for the next parser element to match, but Keyword("file") won't match at all.

Another important parser element is Word(). This element is given a string that it treats as a set of characters, and will match any sequence of any of the given characters. For example, given the data text, "abacus", Word("abc") will match abacus. If the Word() element is given two strings, the first is taken to contain those characters that are valid for the first character of the match and the second to contain those characters that are valid for the remaining characters. This is typically used to match identifiers—for example, Word(alphas, alphanums) matches text that starts with an alphabetic character and that is followed by zero or more alphanumeric characters. (Both alphas and alphanums are predefined strings of characters provided by the PyParsing module.)

A less frequently used alternative to Word() is CharsNotIn(). This element is given a string that it treats as a set of characters, and will match all the characters from the current parse position onward until it reaches a character from the given set of characters. It does not skip whitespace and it will fail if the current parse character is in the given set, that is, if there are no characters to accumulate. Two other alternatives to Word() are also used. One is Skip-To(); this is similar to CharsNotIn() except that it skips whitespace and it always succeeds—even if it accumulates nothing (an empty string). The other is Regex() which is used to specify a regex to match.

PyParsing also has various predefined parser elements, including restOfLine that matches any characters from the point the parser has reached until the end of the line, pythonStyleComment which matches a Python-style comment, quotedString that matches a string that's enclosed in single or double quotes (with the start and end quotes matching), and many others.

There are also many helper functions provided to cater for common cases. For example, the delimitedList() function returns a parser element that matches a list of items with a given delimiter, and makeHTMLTags() returns a pair of parser elements to match a given HTML tag's start and end, and for the start also matches any attributes the tag may have.

Parsing elements can be quantified in a similar way to regexes, using Option-al(), ZeroOrMore(), OneOrMore(), and some others. If no quantifier is specified, the quantity defaults to 1. Elements can be grouped using Group() and combined using Combine()—we'll see what these do further on.

Once we have specified all of our individual parser elements and their quantities, we can start to combine them to make a parser. We can specify parser elements that must follow each other in sequence by creating a new parser element that concatenates two or more existing parser elements together—for example, if we have parser elements key and value we can create a key_value parser element by writing key_value = key + Suppress("=") + value. We can specify parser elements that can match any one of two or more alternatives by creating a new parser element that ors two or more existing parser elements together—for example, if we have parser elements true and false we can create a boolean parser element by writing boolean = true | false.

Notice that for the key_value parser element we did not need to say anything about whitespace around the =. By default, PyParsing will accept any amount of whitespace (including none) between parser elements, so for example, PyParsing treats the BNF definition KEY ' = ' VALUE as if it were written \s* KEY \s* ' = ' \s* VALUE \s*. (This default behavior can be switched off, of course.)

Note that here and in the subsections that follow, we import each PyParsing name that we need individually. For example:

from pyparsing_py3 import (alphanums, alphas, CharsNotIn, Forward, Group, hexnums, OneOrMore, Optional, ParseException, ParseSyntaxException, Suppress, Word, ZeroOrMore)

This avoids using the import * syntax which can pollute our namespace with unwanted names, but at the same time affords us the convenience to write alphanums and Word() rather than pyparsing_py3.alphanums and pypars-ing_py3.Word(), and so on.

Before we finish this quick introduction to PyParsing and look at the examples in the following subsections, it is worth noting a couple of important ideas relating to how we translate a BNF into a PyParsing parser.

PyParsing has many predefined elements that can match common constructs. We should always use these elements wherever possible to ensure the best possible performance. Also, translating BNFs directly into PyParsing syntax is not always the right approach. PyParsing has certain idiomatic ways of handling particular BNF constructs, and we should always follow these to ensure that our parser runs efficiently. Here we'll very briefly review a few of the predefined elements and idioms.

One common BNF definition is where we have an optional item. For example:

If we translated this directly into PyParsing we would write:

This assumes that item is some parser element defined earlier. The Empty() class provides a parser element that can match nothing. Although syntactically correct, this goes against the grain of how PyParsing works. The correct PyParsing idiom is much simpler and involves using a predefined element:

optional_item = Optional(item)

Some BNF statements involve defining an item in terms of itself. For example, to represent a list of variables (perhaps the arguments to a function), we might have the BNF:

VARLIST ::= VARIABLE | VARIABLE ',' VARLIST II BNF

At first sight we might be tempted to translate this directly into PyParsing syntax:

variable = Word(alphas, alphanums)

var_list = variable | variable + Suppress(",") + var_list # WRONG!

The problem seems to be simply a matter of Python syntax—we can't refer to var_list before we have defined it. PyParsing offers a solution to this: We can create an "empty" parser element using Forward(), and then later on we can append parse elements—including itself—to it. So now we can try again.

var_list = Forward()

var_list << (variable | variable + Suppress(",") + var_list) # WRONG!

This second version is syntactically valid, but again, it goes against the grain of how PyParsing works—and as part of a larger parser its use could lead to a parser that is very slow, or that simply doesn't work. (Note that we must use parentheses to ensure that the whole right-hand expression is appended and not just the first part because << has a higher precedence level than |, that is, it binds more tightly than |.) Although its use is not appropriate here, the Forward() class is very useful in other contexts, and we will use it in a couple of the examples in the following subsections.

Instead of using Forward() in situations like this, there are alternative coding patterns that go with the PyParsing grain. Here is the simplest and most literal version:

var_list = variable + ZeroOrMore(Suppress(",") + variable)

This pattern is ideal for handling binary operators, for example:

plus_expression = operand + ZeroOrMore(Suppress("+") + operand)

Both of these kinds of usage are so common that PyParsing offers convenience functions that provide suitable parser elements. We will look at the operator-Precedence() function that is used to create parser elements for unary, binary, and ternary operators in the example in the last of the following subsections. For delimited lists, the convenience function to use is delimitedList(), which we will show now, and which we will use in an example in the following subsections:

var_list = delimitedList(variable)

The delimitedList() function takes a parser element and an optional delimiter—we didn't need to specify the delimiter in this case because the default is comma, the delimiter we happen to be using.

So far the discussion has been fairly abstract. In the following four subsections we will create four parsers, each of increasing sophistication, that demonstrate how to make the best use of the PyParsing module. The first three parsers are PyParsing versions of the handcrafted parsers we created in the previous section; the fourth parser is new and much more complex, and is shown in this section, and in lex/yacc form in the following section.

+1 0

Post a comment