Advanced Regular Expressions

Capturing Groups#

Say we want to do more than just print or return a match, but instead want to use the result and/or manipulate it. That’s where capture groups come into play.

import re

def rearrange_name(name):
	# Use () to define capture groups.
    result = re.search("^(\w*), (\w* [A-Z]\.)$", name)
    if result is None:
        return name
    return "{} {}".format(result[2], result[1])


print(rearrange_name("Lovelace, Ada B."))

The above code, search for a last name and initial, separated from a second name by a comma and returns the results as a string of first name, followed by the initial followed by the last name. We achieve this by accessing the result.groups tuple. The first position is the result match, the second and third position is defined by the () in the search pattern. Pretty neat!

Repetition Modifiers#

While we can use *, +, and ? to define characteristics of repetition, we can also request a repetition of the query or part of the query itself using the curly braces, {}.

import re
# search for a sequence of five letters, case insensitive.
result = re.search( r"[a-zA-Z]{5}", "a whale")
# returns whale

result = re.search( r"[a-zA-Z]{5}", "a great white whale")
# returns whale
#to find all occurences we use the findall function

result = re.findall( r"[a-zA-Z]{5}", "a great white whale appeared")
# returns great white whale appea

#To limit to exactly five letters we include word boundries
result = re.findall( r"\b[a-zA-Z]{5}\b", "a great white whale appeared")
# returns great white whale

# To add a range to teh length or repetition of the sequence
result = re.findall( r"\w{5,10}", "a big shot lawyer flustered the client")
# return lawyer flustered client

# The number of repetions can have no upper limit and is determined by the text searched. {5,}
result = re.findall( r"\w{5,}", "a big shot lawyer flustered the client")
# return lawyer flustered client

# The maximum number of repitions letter can be set
result = re.findall( r"s\w{,5}\b", "a smaller shirt size was available")
# returns shirt size s
# ( the letter s followed by 0 to 5 letters and a word boundary)

Here’s a nice example to extract a PID and the uppercase message from log lines.

import re

def extract_pid(log_line):
	regex = r"\[(\d+)\]: ([A-Za-z]+)"
	result = re.search(regex, log_line)
	# No index error if there is no match
	if result is None:
		return None
	return "{} ({})".format(result[1], result[2])


print(extract_pid("July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"))
# 12345 (ERROR)

print(extract_pid("99 elephants in a [cage]"))
# None

print(extract_pid("A string that also has numbers [34567] but no uppercase message"))
# None

print(extract_pid("July 31 08:08:08 mycomputer new_process[67890]: RUNNING Performing backup"))
# 67890 (RUNNING)

Splitting#

We can use regex expressions in the split() and replace() string methods.

# Split the text into sentences.
re.split( r"[.?!]", "Hello! There are short phrases. Aren't there?")
# Returns ["Hello", "There are short phrases", "Aren't there", ""]

To return the characters we use to split the text we can use (), to create capture groups.

# Split the text into sentences.
re.split( r"([.?!])", "Hello! There are short phrases. Aren't there?")
# Returns ['Hello', '!', ' There are short phrases', '.', " Aren't there", '?', '']

Substitution#

To redact email addresses in a text we could use the following:

re.sub( r"[\w.%+-]+@[\w.-]+", "[REDACTED]", "Email recieved from email@example.com")

# Returns 'Email recieved from [REDACTED]'

We can use re.sub() to both search using regex and to replace using regex. From the capture groups example:

import re

def rearrange_name(name):
	# Use () to define capture groups.
    result = re.search("^(\w*), (\w* [A-Z]\.)$", name)
    if result is None:
        return name
    return "{} {}".format(result[2], result[1])


print(rearrange_name("Lovelace, Ada B."))

re.sub( r"^([\w .-]*), ([\w .-]*)$", r"\2 \1", "Lovelace, Ada")
# Returns "Ada Lovelace"

The \2 and \1 are used in the replace query to reference the capture groups found in the search query.