Regular Expressions

A powerful tool for my IT toolbox.

A search query of text using a string. They allow us to search a text fro strings matching a specific expression.

Why use regex?

They a both powerful and flexible tools and offer a less brittle method to search for patterns in strings.

Grep#

It prints out any line that matches the query we pass it.

example:

grep thon /usr/share/dict/words

The search is case sensitive. The -i flag sets the search as case insensitive.

grep -i python /usr/share/dict/words

Regular Expressions Reserved Characters#

The dot . is a wildcard. It can be represent any character.

grep l.rts /usr/shar/dict/words

# Returns:	alerts 	blurts 	flirts

The circumflex ^ denotes the beginning of a line of text, not for particular words.

grep ^fruit /usr/share/dict/words

Returns a list of words starting with fruit.

The dollar sign $ denotes the end of a line.

grep cat$ /usr/share/words/dict

Returns a list of words ending with cat.

Note: Both ^ and $ match the beginning and end of the line we are searching and not individual words in the line.

Regex in Python#

import re

result = re.search(r"aza", "plaza")

The re.search function searches for the “aza” string in the word “plaza”. The “r” indicates that the search pattern is a rawstring. It’s a good idea to always use raw strings for regular expressions in Python.

The result is

<re.Match object; span(2, 5), match="aza">

If there is no match:

import re

result = re.search(r"aza", "maze")

The result is

None

Using the special characters from earlier:

print( re.search(r"p.ing", "penguin") )
print( re.search(r"p.ing", "Penguin", re.IGNORECASE ))

The second option above is the equivalent to passing the -i flag to grep.

Wildcards and Character Classes#

Using the square brackets, [] we can define a character class.

print( re.search( r"[Pp]ython", "Python") )

This search for both “Python” and “python”.

We can also use a dash, -, for a range of characters. For example, [a-z] to search for any letter.

print( re.search( r"[a-z]way", "The end of the highway"))

We can define other ranges, [A-Z] for all uppercase letters, of [0-9] for all digits. We can combine as many ranges and symbols as we want.

print( re.search( r"cloud[a-zA-Z0-9]", "cloud9"))

To exclude a character range in the search pattern we use the circumflex (^) inside the square brackets.

print( re.search( r"[^a-zA-Z]", "This is a text with spaces."))

This searches for any characters that are not a letter and returns the index of the first occurrence of a space.

We can also add the space to the list of characters we don’t want to match.

print( re.search( r"[^a-zA-Z ]", "This is a text with spaces."))

The above search returns the index of the full stop, ..

To match one symbol or another, we use the pipe symbol, |.

In each of the searches below we get a match.

print( re.search( r"cat|dog", "There once was a cat."))

print( re.search( r"cat|dog", "I like dogs."))

print( re.search( r"cat|dog", "I like both cats and dogs."))

In this case, we only get the first match in the third search.

If we want to get all matches we need to use the re.findall() module.

print( re.findall( r"cat|dog", "I like both cats and dogs."))

The result is a list of the strings found.

Repetition Qualifiers#

To use the repetition qualifier, we use the .*, to indicate any character repeated as many times as possible including zero.

print( re.search( r"Py.*n", "Pygmalion"))
# returns -> Pygmilion

The above search, searches for any string starting with “Py”, proceeded by any number of any character, with “n” as the last character.

print( re.search( r"Py.*n", "Python Programming"))
# returns -> Python Programming, becuase the * takes as many characters as possible.

This is a demonstration of the repetition modifier being greedy.

To only match characters, we need to use the character class.

print( re.search( r"Py[a-z]*n", "Python Programming"))
# returns -> Python, becuase the [a-z] is limited to characters a-z only.

Grep only includes the * repetition qualifier.

Python and egrep include others.

Where the * is any number of the characters before it, the + repetition modifier, matches one or more of the characters before it.

print( re.search( r"Py[a-z]+n", "Python Programming"))
# returns -> Python

print( re.search( r"Py[a-z]+n", "Pyn"))

# returns -> None, as there must be one or more a-z characters before the `n` at the end of the result.

Let’s try o+l+

print( re.search( r"o+l+", "goldfish"))
# returns -> 'ol'
print( re.search( r"o+l+", "woolly"))
# returns -> 'ooll'
print( re.search( r"o+l+", "boil"))
# returns -> None,  although the string had an `o` and an 'l', it had another character between them.

To find words with at least one repetition (minimum two occurrences) of the letter A or a,

print( re.search( r"[Aa].*[Aa]", "pineapple"))
# returns -> None

print( re.search( r"[Aa].*[Aa]", "banana"))
# returns -> 'anana' (greedy!)

The question mark, ? repetition modifier matches either zero or one occurrence of the character before it.

print( re.search( r"p?each", "To each their own"))
# returns -> each
# The 'p' was marked as optional by using the '?'.

And when the p is present.

print( re.search( r"p?each", "Do I dare to eat a peach?"))
# returns -> peach

Escape Characters#

In case of regex, the escape character is the backslash, \.

In this case we want to find the string .com.

print( re.search( r".com", "Welcome"))
# returns -> lcom
# Not correct

print( re.search( r"\.com", "Welcome"))
# returns -> None
# Correct

Note: When we see a \ is could be escaping a special character OR is could be a special string character such as \n. Using raw strings avoids some of these problems, because the special characters won’t be interpreted when generating the string, it will only be interpreted when parsing the regular expression.

The \w is a shorthand character class that matches any alphanumeric character (including underscores). It is equivalent to the character class “[A-Za-z0-9_]”.

print( re.search( r"\w*", "Welcome to the party"))
# returns -> Welcome

print( re.search( r"\w*", "Welcome_to_the_party"))
# returns -> Welcome_to_the_party

The \d is a shorthand character class that matches any digits. It is equivalent to the character class [0-9].

The \s is a shorthand character class that matches any whitespace characters. It is equivalent to the character class [ \t\r\n\f].

In regular expressions, \b is a special character that matches a word boundary. A word boundary is the position between a word character (as defined by \w) and a non-word character.

Note that \b only matches at the start or end of a word, not at the start or end of a line. To match the start or end of a line, you can use the special characters ^ and $, respectively.

Note: A useful site to check regex is www.regex101.com.

Regular expressions in Action#

Let’s say I want to find the names of countries that start and finish with a.

print( re.search( r"A.*a", "Argentina"))
# Returns match Argentina

print( re.search( r"A.*a", "Azerbaijan"))
# returns a match Azerbaija, not what we want.

# To make our seach stricter

print( re.search( r"^A.*a$", "Azerbaijan"))
# returns None

To build a pattern to search for valid variable names.

# This will be a serach pattern for a complete line.
pattern = r"
# We want the first character to be a word character
pattern = r"^[a-zA-Z_]
# We want the remainder (to the end) of the string to be a non space character, that is a word character.
pattern = r"^[a-zA-Z_][a-zA-Z0-9]*$"

The pattern above can be used to check for valid variable names.

print( re.search( pattern, "_this_is_a_valid_varialbe_name"))
# Returns a match

print( re.search( pattern, "this is not a valid variable name"))
# returns None

print( re.search( pattern, "2This_is_not_valid"))
# returns None
# The first character has to be a letter or an underscore, and not a number.

What about to check if a string is a complete sentence? It starts with an uppercase letter, followed by at least some lowercase letters or a space, and ends with a period, question mark, or exclamation point.

result = re.search(r"^[A-Z][a-z ]+[.?!]$", text)

More to follow. These are my notes from the first module of week three of the Coursera Using Python to interact with the Operating System course.