Extracting Word Pieces

The re.findall() ("find all") method finds all (non-overlapping) matches of the given regular expression. Let's find all the vowels in a word, then count them:

>>> word = 'supercalifragilisticexpialidocious' >>> re.findall(r'[aeiou]', word)

['u', 'e', 'a', 'i', 'a', 'i', 'i', 'i', 'e', 'i', 'a', 'i', 'o', 'i', 'o', 'u'] >>> len(re.findall(r'[aeiou]', word)) 16

Let's look for all sequences of two or more vowels in some text, and determine their relative frequency:

>>> wsj = sorted(set(nltk.corpus.treebank.words())) >>> fd = nltk.FreqDist(vs for word in wsj

[('io', 549), ('ea', 476), ('ie', 331), ('ou', 329), ('ai', 261), ('ia', 253), ('ee', 217), ('oo', 174), ('ua', 109), ('au', 106), ('ue', 105), ('ui', 95), ('ei', 86), ('oi', 65), ('oa', 59), ('eo', 39), ('iou', 27), ('eu', 18), ...]

Your Turn: In the W3C Date Time Format, dates are represented like this: 2009-12-31. Replace the ? in the following Python code with a regular expression, in order to convert the string '2009-12-31' to a list of integers [2009, 12, 31]:

Was this article helpful?

0 0

Post a comment