Ranges and Closures

The T9 system is used for entering text on mobile phones (see Figure 3-5). Two or more words that are entered with the same sequence of keystrokes are known as textonyms. For example, both hole and golf are entered by pressing the sequence 4653. What other words could be produced with the same sequence? Here we use the regular expression «A[ghi][mno][jlk][def]$»:

>>> [w for w in wordlist if re.search('A[ghi][mno][jlk][def]$', w)] ['gold', 'golf', 'hold', 'hole']

The first part of the expression, «A[ghi]», matches the start of a word followed by g, h, or i. The next part of the expression, «[mno]», constrains the second character to be m, n, or o. The third and fourth characters are also constrained. Only four words satisfy all these constraints. Note that the order of characters inside the square brackets is not significant, so we could have written «A[hig][nom][ljk][fed]$» and matched the same words.

• Your Turn: Look for some "finger-twisters," by searching for words that use only part of the number-pad. For example «A[ghijklmno]+$», " ^ »A* or more concisely, «A[g-o]+$», will match words that only use keys 4, 5, 6 in the center row, and «A[a-fj-o]+$» will match words that use keys 2, 3, 5, 6 in the top-right corner. What do - and + mean?

Let's explore the + symbol a bit further. Notice that it can be applied to individual letters, or to bracketed sets of letters:

>>> chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))

>>> [w for w in chat_words if re.search('Am+i+n+e+$', w)]

['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'mine',


>>> [w for w in chat_words if re.search('A[ha]+$', w)]

['a', 'aaaaaaaaaaaaaaaaa', 'aaahhhh', 'ah', 'ahah', 'ahahah', 'ahh',

'ahhahahaha', 'ahhh', 'ahhhh', 'ahhhhhh', 'ahhhhhhhhhhhhhh', 'h', 'ha', 'haaa',

'hah', 'haha', 'hahaaa', 'hahah', 'hahaha', 'hahahaa', 'hahahah', 'hahahaha', ...]

It should be clear that + simply means "one or more instances of the preceding item," which could be an individual character like m, a set like [fed], or a range like [d-f]. Now let's replace + with *, which means "zero or more instances of the preceding item." The regular expression «Am*i*n*e*$» will match everything that we found using «Am+i +n+e+$», but also words where some of the letters don't appear at all, e.g., me, min, and mmmmm. Note that the + and * symbols are sometimes referred to as Kleene closures, or simply closures.

The A operator has another function when it appears as the first character inside square brackets. For example, «[AaeiouAEIOU]» matches any character other than a vowel. We can search the NPS Chat Corpus for words that are made up entirely of non-vowel characters using «A[AaeiouAEIOU]+$» to find items like these: :):):), grrr, cyb3r, and zzzzzzzz. Notice this includes non-alphabetic characters.

Here are some more examples of regular expressions being used to find tokens that match a particular pattern, illustrating the use of some new symbols: \, {}, (), and |.

>>> wsj = sorted(set(nltk.corpus.treebank.words())) >>> [w for w in wsj if re.search('A[0-9]+\.[0-9]+$', w)]

['0.0085', '0.05', '0.1', '0.16', '0.2', '0.25', '0.28', '0.3', '0.4', '0.5', '0.50', '0.54', '0.56', '0.60', '0.7', '0.82', '0.84', '0.9', '0.95', '0.99', '1.01', '1.1', '1.125', '1.14', '1.1650', '1.17', '1.18', '1.19', '1.2', ...] >>> [w for w in wsj if re.search('A[A-Z]+\$$', w)] ['C$', 'US$']

>>> [w for w in wsj if re.search('A[0-9]{4}$', w)]

['1614', '1637', '1787', '1901', '1903', '1917', '1925', '1929', '1933', ...] >>> [w for w in wsj if re.search('A[0-9]+-[a-z]{3,5}$', w)] ['10-day', '10-lap', '10-year', '100-share', '12-point', '12-year', ...] >>> [w for w in wsj if re.search('A[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)] ['black-and-white', 'bread-and-butter', 'father-in-law', 'machine-gun-toting', 'savings-and-loan']

>>> [w for w in wsj if re.search('(ed|ing)$', w)]

['62%-owned', 'Absorbed', 'According', 'Adopting', 'Advanced', 'Advancing', ...]

Your Turn: Study the previous examples and try to work out what the \, {}, (), and | notations mean before you read on.

You probably worked out that a backslash means that the following character is deprived of its special powers and must literally match a specific character in the word. Thus, while . is special, \. only matches a period. The braced expressions, like {3,5}, specify the number of repeats of the previous item. The pipe character indicates a choice between the material on its left or its right. Parentheses indicate the scope of an operator, and they can be used together with the pipe (or disjunction) symbol like this: «w(i|e|ai|oo)t», matching wit, wet, wait, and woot. It is instructive to see what happens when you omit the parentheses from the last expression in the example, and search for «ed|ing$».

The metacharacters we have seen are summarized in Table 3-3. Table 3-3. Basic regular expression metacharacters, including wildcards, ranges, and closures Operator Behavior . Wildcard, matches any character

Aabc Matches some pattern abc at the start of a string abc$ Matches some pattern abc at the end of a string [abc] Matches one of a set of characters [A-Z0-9] Matches one of a range of characters ed|ing|s Matches one of the specified strings (disjunction) * Zero or more of previous item, e.g., a*, [a-z]* (also known as Kleene Closure)

? Zero or one of the previous item (i.e., optional), e.g., a?, [a-z]?

{n} Exactly n repeats where n is a non-negative integer

{n,} At least n repeats {,n} No more than n repeats {m,n} At least m and no more than n repeats a(b|c)+ Parentheses that indicate the scope of the operators

To the Python interpreter, a regular expression is just like any other string. If the string contains a backslash followed by particular characters, it will interpret these specially. For example, \b would be interpreted as the backspace character. In general, when using regular expressions containing backslash, we should instruct the interpreter not to look inside the string at all, but simply to pass it directly to the re library for processing. We do this by prefixing the string with the letter r, to indicate that it is a raw string. For example, the raw string r'\band\b' contains two \b symbols that are interpreted by the re library as matching word boundaries instead of backspace characters. If you get into the habit of using r'... ' for regular expressions—as we will do from now on—you will avoid having to think about these complications.

Was this article helpful?

0 0

Post a comment