Repetition special characters and grouping

The most common aspects of REs involve the use of special characters, multiple occurrences of RE patterns, and using parentheses to group and extract submatch patterns. One particular RE we looked at related to simple e-mail addresses ("\[email protected]\w+\.com"). Perhaps we want to match more e-mail addresses than this RE allows. In order to support an additional hostname in front of the domain, i.e., "www.xxx.com" as opposed to accepting only "xxx.com" as the entire domain, we have to modify our existing RE. To indicate that the hostname is optional, we create a pattern which matches the hostname (followed by a dot), use the ? operator indicating zero or one copy of this pattern, and insert the optional RE into our previous RE as follows: "\[email protected](\w+\.)?\w+\.com". As you can see from the examples below, either one or two names are now accepted in front of the " .com".

>>> re.match(patt, '[email protected]').group()

'[email protected]'

>>> re.match(patt, '[email protected]').group() '[email protected]'

Furthermore, we can even extend our example to allow any number of intermediate subdomain names with the following pattern: "\[email protected](\w+\.)*\w+\.com":

>>> patt = '\[email protected](\w+\.)*\w+\.com' >>> re.match(patt,

'[email protected]').group() '[email protected]'

However, we must add the disclaimer that using solely alphanumeric characters does not match all the possible characters which may make up e-mail addresses. The above RE patterns would not match a domain such as "xxx-yyy.com" or other domains with "\W" characters.

Earlier, we discussed the merits of using parentheses to match and save subgroups for further processing rather than coding a separate routine to manually parse a string after an RE match had been determined. In particular, we discussed a simple RE pattern of an alphanumeric string and a number separated by a hyphen, "\w+-\d+," and how adding subgrouping to form a new RE, " (\w+)-(\d+)," would do the job. Here is how the original RE works:

>>> m = re.match('\w\w\w-\d\d\d', 'abc-123') >>> if m != None: m.group()

>>> m = re.match('\w\w\w-\d\d\d', 'abc-xyz') >>> if m != None: m.group()

In the above code, we created an RE to recognize three alphanumeric characters followed by three digits. Testing this RE on "abc-123," we obtained with positive results while "abc-xyz" fails. We will now modify our RE as discussed before to be able to extract the alphanumeric string and number. Note how we can now use the group() method to access individual subgroups or the groups() method to obtain a tuple of all the subgroups matched:

>>> m

= re.match('

(\w\w\w)-(\d\d\d)', 'abc-123

>>> m

. group()

# entire match

'abc-123'

>>> m

.group(1)

# subgroup 1

'abc'

>>> m

.group(2)

# subgroup 2

'123'

>>> m

. groups()

# all subgroups

('abc

', '123')

As you can see, group() is used in the normal way to show the entire match, but can also be used to grab individual subgroup matches. We can also use the groups() method to obtain a tuple of all the substring matches.

Here is a simpler example showing different group permutations, which will hopefully make things even more clear:

>>> m = re.match('(ab)', >>> m.group() 'ab'

>>> m = re.match('(a)(b) >>> m.group() 'ab'

ab ab1

# no subgroups

# entire match

# all subgroups

# one subgroup

# entire match

# subgroup 1

# all subgroups

# two subgroups

# entire match

# subgroup 1

# subgroup 2

# all subgroups ab') # two subgroups

# entire match

# subgroup l

# subgroup 2

# all subgroups

+2 -1

Post a comment