Simple Approaches to Tokenization

V Important: Remember to prefix regular expressions with the letter r

•«f, (meaning "raw"), which instructs the Python interpreter to treat the string literally, rather than processing any backslashed characters it contains.

characters other than letters, digits, or underscore. We can use \W in a simple regular expression to split the input on anything other than a word character: >>> re.split(r'\W+', raw)

['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without', 'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered',

Observe that this gives us empty strings at the start and the end (to understand why, try doing 'xx'.split('x')). With re.findall(r'\w+', raw), we get the same tokens, but without the empty strings, using a pattern that matches the words instead of the spaces. Now that we're matching the words, we're in a position to extend the regular expression to cover a wider range of cases. The regular expression «\w+|\S\w*» will first try to match any sequence of word characters. If no match is found, it will try to match any non-whitespace character (\S is the complement of \s) followed by further word characters. This means that punctuation is grouped with any following letters (e.g., 's) but that sequences of two or more punctuation characters are separated. >>> re.findall(r'\w+|\S\w*', raw)

["'When", 'I', "'M", 'a', 'Duchess', ',', , 'she', 'said', 'to', 'herself', ',',

'(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'won', "'t", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that', 'makes', 'people', 'hot', '-tempered', ',', , '.', '.', '.']

Let's generalize the \w+ in the preceding expression to permit word-internal hyphens and apostrophes: «\w+([-']\w+)*». This expression means \w+ followed by zero or more instances of [-']\w+; it would match hot-tempered and it's. (We need to include ?: in this expression for reasons discussed earlier.) We'll also add a pattern to match quote characters so these are kept separate from the text they enclose. >>> print re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw)

[ , 'When', "I'M", 'a', 'Duchess', ',', , 'she', 'said', 'to', 'herself', ',',

'(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', , 'I',

"won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', , '...']

The expression in this example also included «[-.(]+», which causes the double hyphen, ellipsis, and open parenthesis to be tokenized separately.

Table 3-4 lists the regular expression character class symbols we have seen in this section, in addition to some other useful symbols.

Table 3-4. Regular expression symbols Symbol Function \b Word boundary (zero width) \d Any decimal digit (equivalent to [0-9])




Any non-digit character (equivalent to [A0-9])


Any whitespace character (equivalent to [ \t\n\r\f\v]


Any non-whitespace character (equivalent to [A \t\n\r\f\v])


Any alphanumeric character (equivalent to [a-zA-Z0-9 ])


Any non-alphanumeric character (equivalent to [Aa-zA-Z0-9 ])


The tab character


The newline character

Was this article helpful?

0 0

Post a comment