Regular Expressions¶

Stats 507, Fall 2021

James Henderson, PhD
October 21, 2021

Overview¶

  • Regular Expressions
  • Examples and Concepts
  • Regex Crossword
  • Takeaways

Regular Expressions¶

  • Regular expressions are a way to describe patterns in strings.
  • Patterns may be abstract.
  • Common regex vocabulary ...
  • ... but details differ between implementations and standards.

Imports¶

  • Here are the imports we will use in these slides.
  • re is a built-in Python module for regular expressions
In [ ]:
import numpy as np
import pandas as pd
import re

Example¶

  • The file fruit.txt is a list of fruits distributed with R's stringr library.
In [ ]:
fruits_df = pd.read_csv('./fruits.txt')
fruits = list(fruits_df['fruit'].values)
fruits_df.head()

Pandas¶

  • Pandas has several vectorized string functions that understand regular expressions:
    • contains,
    • match, fullmatch,
    • count,
    • findall,
    • replace,
    • extract,
    • split.
In [ ]:
fruits_df[fruits_df['fruit'].str.match('^a')]

Search / Contains¶

  • str.contains() returns a bool indicating whether a pattern is found in each entry of a string series.
  • It is based on re.search().
  • Find all two-word fruits by searching for a space.
In [ ]:
#[re.search(' ', fruit) is not None for fruit in fruits]
two_word_fruits = []
for fruit in fruits:
    if re.search(' ', fruit):# is not None:
        two_word_fruits.append(fruit)
two_word_fruits

Search / Contains¶

  • Find all two-word fruits by searching for a space.
  • Let's use this method to explore regex concepts.
In [ ]:
fruits_df[fruits_df['fruit'].str.contains(' ')]

Regex Concepts - Simple search¶

  • Find all fruits with an "a" anywhere in the word.
In [ ]:
fruits_df[fruits_df['fruit'].str.contains('a')]

Regex Concepts - Anchors¶

  • A caret ^ indicates the match must come at the beginning of the string.
  • Find all fruits beginning with an "a".
  • This is known as an anchor.
In [ ]:
fruits_df[fruits_df['fruit'].str.contains('^a')]

Regex Concepts - Anchors¶

  • A dollar sign $ indicates the match must come at the end of the string.
  • Find all fruits ending with an "a".
  • This is also known as an anchor.
In [ ]:
fruits_df[fruits_df['fruit'].str.contains('a$')]

Anchors in Pandas¶

  • Pandas also has vectorized .startswith() and .endswith() methods.
  • Find all fruits starting or ending with an "a".
In [ ]:
fruits_df[
  np.logical_or(
    fruits_df['fruit'].str.startswith('a'),
    fruits_df['fruit'].str.endswith('a')
  )
]

Regex Concepts - Or¶

  • A bar | can be used as an or operator in regular expressions.
  • Find all fruits starting or ending with an "a".
In [ ]:
fruits_df[fruits_df['fruit'].str.contains('^a|a$')]

Regex Concepts - Bracket Expressions¶

  • Multiple acceptable matches can be collected into a bracket expression.
  • Find all fruits starting with a vowel.
In [ ]:
fruits_df[fruits_df['fruit'].str.contains('^[aeiou]')]

Regex Concepts - Bracket Expressions¶

  • Inside a bracketed expression, a caret ^ means to match anything but the listed characters.
  • Find all fruits ending with a consonant other than n, r, or t.
In [ ]:
fruits_df[fruits_df['fruit'].str.contains('[^aeiounrt]$')]

Regex Concepts - Ranges¶

  • Bracket expressions understand the following ranges:
    • [a-z] - lowercase letters
    • [A-Z] - uppercase letters
    • [0-9] - digits
  • These can be used together, e.g. [A-Za-z0-9].

Regex Concepts - Quantifiers¶

  • Numbers in braces {} can be used to specify a a specific number (or range) of matches.
  • Find all fruits ending with two consecutive consonants other than n, r or t.
In [ ]:
fruits_df[fruits_df['fruit'].str.contains('[^aeiounrt]{2}$')]
#fruits_df[fruits_df['fruit'].str.contains('[^aeiour]{2, 3}$')]

Regex Concepts - Quantifiers¶

  • How would we find all fruits with two consecutive vowels?
In [ ]:
#fruits_df[fruits_df['fruit'].str.contains('')]

Regex Concepts - Wild Card and Quantifiers¶

  • The quantifier * indicates 0 or more matches, ? indicates 0 or 1 matches, and + indicates one or more matches.
  • A dot (or period) . can be used to match any single character.
  • These are often used together, e.g. .* matches anything but a
    newline (\n) character.

Regex Concepts - Wild Card Example¶

  • Find all fruits with two consecutive vowels, twice, separated by a single consonant.
In [ ]:
rgx0 = '[aeiou]{2}.[aeiou]{2}'
fruits_df[fruits_df['fruit'].str.contains(rgx0)]

Regex Concepts - Wild Card with Quantifier Example¶

  • Find all fruits with two consecutive vowels, twice, separated by one or more consonants.
In [ ]:
first = True
if first:
    rgx1 = '[aeiou]{2}.+[aeiou]{2}'
    fruits_df[fruits_df['fruit'].str.contains(rgx1)]
else:
    fruits_df[
      np.logical_and(
        fruits_df['fruit'].str.contains(rgx1),
        ~fruits_df['fruit'].str.contains(rgx0)
      )

Escape sequences¶

  • Characters with special meanings like . can be escaped using a backslash \, e.g. \..
  • Some can also be placed in brackets [.].
In [ ]:
fruits.append('507@umich.edu')
print(fruits[len(fruits) - 1])

for f in fruits:
    if re.search('\.', f):
        print(f)
    if re.search('[.]', f):
        print('[' + f + ']')

Escape sequences¶

  • Because \ is used as an escape character, a literal backslash \needs to be escaped.
  • Commonly appears in file paths on Windows.
In [ ]:
fruits.append(r'C:\path\file.txt')
fruits[len(fruits) - 1]

Escape sequences¶

To avoid unwanted escaping with \ in a regular expression, use raw string literals like r'C:\x' instead of the equivalent 'C:\x'.

-- Wes McKinney

In [ ]:
for f in fruits:
    if re.search(r'\\', f):
        print(f)
    if re.search('\\\\', f):
        print('ugh!')
        print(f)

Character Classes¶

  • Various escape sequences can be used to represent specifc classes of characters.
    • words: \w roughly [a-zA-z0-9]+,
    • non-words: \W,
    • digits: \d = [0-9],
    • non-digits: \D,
    • whitespace: \s,
    • non-whitespace: \S.
In [ ]:
fruits_df[fruits_df['fruit'].str.contains('\s')]

Groups¶

  • Use parantheses to create groups.
  • Groups can be referred back to using an escaped integer.
  • Let's find all fruits with:
    • a double letter
    • a double letter other than "r"
    • a double letter at the end of the word.
In [ ]:
fruits_df[fruits_df['fruit'].str.contains('(.)\\1')]
#fruits_df[fruits_df['fruit'].str.contains('([^r])\\1')]
#fruits_df[fruits_df['fruit'].str.contains('(.)\\1$')]

Regex Crosswords¶

  • Let's practice regular expression concepts by solving the intermediate puzzles from https://regexcrossword.com.

Takeaways¶

  • Regular expresions are used to describe patterns in strings.
  • Use these patterns to search, find and replace, extract or otherwise work with strings.
  • Use regular expressions whenever you can.