NLTKs Regular Expression Tokenizer

The function nltk.regexp_tokenize() is similar to re.findall() (as we've been using it for tokenization). However, nltk.regexp_tokenize() is more efficient for this task, and avoids the need for special treatment of parentheses. For readability we break up the regular expression over several lines and add a comment about each line. The special (?x) "verbose flag" tells Python to strip out the embedded whitespace and comments.

>>> text = 'That U.S.A. poster-print costs $12.40...'

>>> pattern = r'''(?x) # set flag to allow verbose regexps

| \w+(-\w+)* # words with optional internal hyphens | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%

>>> nltk.regexp_tokenize(text, pattern)

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

When using the verbose flag, you can no longer use ' ' to match a space character; use \s instead. The regexp_tokenize() function has an optional gaps parameter. When set to True, the regular expression specifies the gaps between tokens, as with re.split().

We can evaluate a tokenizer by comparing the resulting tokens with a wordlist, and then report any tokens that don't appear in the wordlist, using set(tokens).difference(wordlist). You'll probably want to lowercase all the tokens first.

Was this article helpful?

0 0

Post a comment