Regular Expressions for Detecting Word Patterns

Many linguistic processing tasks involve pattern matching. For example, we can find words ending with ed using endswith('ed'). We saw a variety of such "word tests" in Table 1-4. Regular expressions give us a more powerful and flexible method for describing the character patterns we are interested in.

* > There are many other published introductions to regular expressions, organized around the syntax of regular expressions and applied to ' " t ¿t^ searching text files. Instead of doing this again, we focus on the use of regular expressions at different stages of linguistic processing. As usual, we'll adopt a problem-based approach and present new features only as they are needed to solve practical problems. In our discussion we will mark regular expressions using chevrons like this: «patt».

éwleelöi]g t>£ Z'2~. iiiamcuw rod: kq^LlLe II 'fiiS^ato^eJ na Ooin^

il^Ep.. soataLy odrialenofia :i c :ia l.rjryroL" luii > l u E sent.decodeI ■■ . | u.lower(|

■ u-encode \ -j ■ . f fACUTE ■ re.compile< )

Figure 3-4. Unicode and IDLE: UTF-8 encoded string literals in the IDLE editor; this requires that an appropriate font is set in IDLE's preferences; here we have chosen Courier CE.

To use regular expressions in Python, we need to import the re library using: import re. We also need a list of words to search; we'll use the Words Corpus again (Section 2.4). We will preprocess it to remove any proper names. >>> import re

>>> wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

Let's find words ending with ed using the regular expression «ed$». We will use the re.search(p, s) function to check whether the pattern p can be found somewhere inside the string s. We need to specify the characters of interest, and use the dollar sign, which has a special behavior in the context of regular expressions in that it matches the end of the word:

>>> [w for w in wordlist if re.search('ed$', w)]

['abaissed', 'abandoned', 'abased', 'abashed', 'abatised', 'abed', 'aborted', ...]

The . wildcard symbol matches any single character. Suppose we have room in a crossword puzzle for an eight-letter word, with j as its third letter and t as its sixth letter. In place of each blank cell we use a period:

>>> [w for w in wordlist if re.search('A..j..t..$', w)]

['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', ...]

replaced

Was this article helpful?

+1 0

Post a comment