Strings and Regular Expressions

Reading
Working with Strings in R
- Files as templates
- String operations
Regular Expressions
- Regex Concepts
Exercise - regex crosswords
Resources

Reading

The stringr readme.
“Strings” (Chapter 14) from Hadley Wickham’s R for Data Science.
An interactive tutorial on regular expressions.

Working with Strings in R

In R, you create strings of type character using either single or double quotes. There is no difference (in R) between the two.

string1 = "This is a string."
string2 = 'This is a string.'
all.equal(string1, string2)

## [1] TRUE

typeof(string1)

## [1] "character"

This is not the case in all languages. For instance, consider the following example from bash.

#!/bin/bash
## A short script to illustrate single vs double quotes

# Specify filename and extension
FILE=my_file
EXT=.txt

# Double quotes allow for expansion
echo "Double quotes: $FILE$EXT"

# Single quotes create a literal string
echo 'Single quotes: $FILE$EXT'

Double Quotes: my_file.txt
Single Quotes: $FILE$EXT

Returning to R, you can mix single and double quotes when you want to include one or the other within your string.

string_single = "These are sometimes called 'scare quotes'."
print(string_single)

## [1] "These are sometimes called 'scare quotes'."

string_double = 'Quoth the Raven, "Nevermore."'
print(string_double)

## [1] "Quoth the Raven, \"Nevermore.\""

cat(string_double,'\n')

## Quoth the Raven, "Nevermore."

You can also include quotes within a string by escaping them:

string_double = "Quoth the Raven, \"Nevermore.\""
print(string_double)

## [1] "Quoth the Raven, \"Nevermore.\""

cat(string_double,'\n')

## Quoth the Raven, "Nevermore."

Observe the difference between print() and cat() in terms of how the escaped characters are handled. Be aware also that because backslash plays this special role as an escape character, it itself needs to be escaped:

backslash = "This is a backslash '\\', this is not '\ '."
writeLines(backslash)

## This is a backslash '\', this is not ' '.

Files as templates

Similar to cat() is the function writeLines() used above. The latter is more syntactic when writing to a file and has the advantage of adding a line between components of a vector. Below is an example.

some_file = c('Line 1', 'Line 2')
writeLines(some_file)

## Line 1
## Line 2

## compare to cat
cat(some_file)

## Line 1 Line 2

You should also make note of readLines() for reading the text in a file. It is helpful to realize that readLines() and writeLines() are inverse to one another.

The two can be used in conjunction with a template and a find-and-replace function (see below) to create a series of related .R or other files for automating a series of related tasks.

template = readLines('./template.sh')
writeLines(template)

## #!/bin/bash
## ## A short script to illustrate single vs double quotes
## 
## # Specify filename and extension
## FILE=my_file
## EXT=.txt
## 
## # Double quotes allow for expansion
## echo "Double quotes: $FILE$EXT"
## 
## # Single quotes create a literal string
## echo 'Single quotes: $FILE$EXT'

for ( i in 1:3 ) {
  new_file = sprintf('./template-%i.sh', i)
  writeLines(
    stringr::str_replace_all(template, 'my_file', paste(i) ), 
    con = new_file 
  )
  cat('Wrote ', new_file, '.\n', sep = '')
}

## Wrote ./template-1.sh.
## Wrote ./template-2.sh.
## Wrote ./template-3.sh.

String operations

The table below collects some common string operations from base R and their parallels in the stringr package. I say parallels and not equivalents because they do not always behave in the same way. If you are new to string manipulation in R, I suggest you use stringr for these operations. However, you should be aware of the common base functions as you may encounter them in code written by others (including me).

operation	base	stringr
join	`paste`	`str_c`
subset	`substr`	`str_sub`
split	`strsplit`	`str_split`
search	`grep`, `grepl`	`str_which`,`str_detect`

concatenating strings

The functions paste and stringr::str_c are both used to join strings together.

Observe the difference between the sep and collapse arguments in paste.

length(LETTERS)

## [1] 26

paste(LETTERS, collapse = "")

## [1] "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

paste(1:26, LETTERS, sep = ': ')

##  [1] "1: A"  "2: B"  "3: C"  "4: D"  "5: E"  "6: F"  "7: G"  "8: H"  "9: I" 
## [10] "10: J" "11: K" "12: L" "13: M" "14: N" "15: O" "16: P" "17: Q" "18: R"
## [19] "19: S" "20: T" "21: U" "22: V" "23: W" "24: X" "25: Y" "26: Z"

paste(1:26, LETTERS, sep = ': ', collapse = '\n ')

## [1] "1: A\n 2: B\n 3: C\n 4: D\n 5: E\n 6: F\n 7: G\n 8: H\n 9: I\n 10: J\n 11: K\n 12: L\n 13: M\n 14: N\n 15: O\n 16: P\n 17: Q\n 18: R\n 19: S\n 20: T\n 21: U\n 22: V\n 23: W\n 24: X\n 25: Y\n 26: Z"

Below we see that str_c behaves similarly.

library(stringr) # stringr is in the tidyverse
all.equal(str_c(LETTERS, collapse = ""), paste(LETTERS, collapse = "") )

## [1] TRUE

all.equal(str_c(1:26, LETTERS, sep = ': '), paste(1:26, LETTERS, sep = ': ') )

## [1] TRUE

all.equal(str_c(1:26, LETTERS, sep = ': ', collapse = '\n '), 
           paste(1:26, LETTERS, sep = ': ', collapse = '\n ') 
)

## [1] TRUE

However, these functions differ in the treatment of missing values (NA).

paste(1:3, c(1, NA, 3), sep = ':', collapse = ', ')

## [1] "1:1, 2:NA, 3:3"

str_c(1:3, c(1, NA, 3), sep = ':', collapse = ', ')

## [1] NA

str_c(1:3, str_replace_na(c(1, NA, 3)), sep = ":", collapse = ', ')

## [1] "1:1, 2:NA, 3:3"

length

Recall that length returns the length of a vector. To get the length of a string use nchar or str_length.

length(paste(LETTERS, collapse = "") )

## [1] 1

nchar(paste(LETTERS, collapse = "") )

## [1] 26

str_length(paste(LETTERS, collapse = "") )

## [1] 26

substrings

The following functions extract sub-strings at given positions.

substr('Strings',  3, 7)

## [1] "rings"

str_sub('Strings', 1, 6)

## [1] "String"

The function stringr::str_sub supports negative indexing.

sprintf('base: %s, stringr: %s', 
        substr('Strings', -5, -1), 
        str_sub('Strings', -5, -1)
)

## [1] "base: , stringr: rings"

finding matches

The example below uses the vector fruit from the stringr package.

The base function grep returns the indices of all strings within a vector that contain the requested pattern. The grepl function behaves in the same way but returns a logical vector of the same length as the input x.

head(fruit)

## [1] "apple"       "apricot"     "avocado"     "banana"      "bell pepper"
## [6] "bilberry"

grep('fruit', fruit)

## [1] 12 26 35 39 42 57 75 79

which(grepl('fruit', fruit) )

## [1] 12 26 35 39 42 57 75 79

head(grepl('fruit', fruit) )

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

grepl('fruit', fruit)[grep('fruit', fruit)]

## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

These functions are vectorized over the input but not the pattern.

grep(c('fruit', 'berry'), fruit)

## Warning in grep(c("fruit", "berry"), fruit): argument 'pattern' has length > 1
## and only the first element will be used

## [1] 12 26 35 39 42 57 75 79

sapply(c('fruit', 'berry'), grep, x = fruit)

## $fruit
## [1] 12 26 35 39 42 57 75 79
## 
## $berry
##  [1]  6  7 10 11 19 21 29 32 33 38 50 70 73 76

The match function is vectorized over the input, but returns only the first match and requires exact matching.

match('berry', fruit)

## [1] NA

match(c('apple', 'pear'), c(fruit,fruit) )

## [1]  1 59

The corresponding stringr functions are vectorized over both pattern and input, but the vectorization uses broadcasting so be careful. Pay attention to the order that the string and pattern are supplied in as it is the reverse of the base functions.

ind_fruit = which(str_detect(fruit, 'fruit') )
ind_berry = which(str_detect(fruit, 'berry') )

ind_either = which(str_detect(fruit, c('fruit','berry') ) )
setdiff(union(ind_fruit, ind_berry), ind_either )

##  [1] 12 26 42  7 11 19 21 29 33 73

# Below we demonstrate the broadcasting pattern
ind_odd = seq(1, length(fruit), 2)
ind_even = seq(2, length(fruit), 2)

odd_fruit = ind_odd[ str_detect(fruit[ind_odd], 'fruit') ]
even_berry = ind_even[ str_detect(fruit[ind_even], 'berry') ]
setdiff(union(odd_fruit, even_berry), ind_either )

## numeric(0)

The vectorization in this case doesn’t help us to avoid the lapply pattern we used with grep.

sapply(c('fruit', 'berry'), function(x) which(str_detect(fruit, x) ) )

## $fruit
## [1] 12 26 35 39 42 57 75 79
## 
## $berry
##  [1]  6  7 10 11 19 21 29 32 33 38 50 70 73 76

However, str_locate is vectorized using an “OR” operator.

ind_fruit = str_locate(fruit, 'fruit')
ind_berry = str_locate(fruit, 'berry')

ind_either = str_locate(fruit, c('fruit','berry'))
setdiff(union(ind_fruit, ind_berry), ind_either )

## integer(0)

find and replace

For find and replace operations, you can use one of str_replace or str_replace_all. The former matches only the first instance of pattern. Similar base functions are sub and gsub.

# abc ... 
letter_vec = paste(letters, collapse = '')

## replace all instances
str_replace_all(letter_vec, '[aeiou]', 'X')

## [1] "XbcdXfghXjklmnXpqrstXvwxyz"

#replace the first instance
str_replace(letter_vec, '[aeiou]', 'X')

## [1] "Xbcdefghijklmnopqrstuvwxyz"

You can also find and/or replace by the position in a string using str_sub.

# to replace by location
str_sub(letter_vec, start = 1:3, end = 2:4)

## [1] "ab" "bc" "cd"

str_sub(letter_vec, start = -3, end = -1) = 'XXX'

splitting strings

The base function strsplit can be used to split a string into pieces based on a pattern. The example below finds all two-word fruit names from fruit.

fruit_list = strsplit(fruit,' ')
two_ind = which(sapply(fruit_list, length)==2)
fruit_two = lapply(fruit_list[two_ind], paste, collapse=' ')
unlist(fruit_two)

##  [1] "bell pepper"       "blood orange"      "canary melon"     
##  [4] "chili pepper"      "goji berry"        "kiwi fruit"       
##  [7] "purple mangosteen" "rock melon"        "salal berry"      
## [10] "star fruit"        "ugli fruit"

The str_split function behaves similarly for this simple case.

all.equal(fruit_list, str_split(fruit, ' '))

## [1] TRUE

When there are multiple patterns matching the split point, the functions strsplit and str_split behave differently.

string = '1;2,3'
strsplit(string, c(';', ','))

## [[1]]
## [1] "1"   "2,3"

str_split(string, c(';', ','))

## [[1]]
## [1] "1"   "2,3"
## 
## [[2]]
## [1] "1;2" "3"

# Use a regular expression to split on either character. 
str_split(string,';|,')

## [[1]]
## [1] "1" "2" "3"

Regular Expressions

Regular expressions (“regexp” or “regex”) are a way to describe patterns in strings, often in an abstract way. There is a common regexp vocabulary though some details differ between implementations and standards. The basic idea is illustrated in the examples below using the fruit data from the stringr library.

## find all two word fruits by searching for a space
fruit[grep(" ", fruit)]

##  [1] "bell pepper"       "blood orange"      "canary melon"     
##  [4] "chili pepper"      "goji berry"        "kiwi fruit"       
##  [7] "purple mangosteen" "rock melon"        "salal berry"      
## [10] "star fruit"        "ugli fruit"

## find all fruits with an 'a' anywhere in the word
fruit[grep("a", fruit)]

##  [1] "apple"             "apricot"           "avocado"          
##  [4] "banana"            "blackberry"        "blackcurrant"     
##  [7] "blood orange"      "breadfruit"        "canary melon"     
## [10] "cantaloupe"        "cherimoya"         "cranberry"        
## [13] "currant"           "damson"            "date"             
## [16] "dragonfruit"       "durian"            "eggplant"         
## [19] "feijoa"            "grape"             "grapefruit"       
## [22] "guava"             "jackfruit"         "jambul"           
## [25] "kumquat"           "loquat"            "mandarine"        
## [28] "mango"             "nectarine"         "orange"           
## [31] "pamelo"            "papaya"            "passionfruit"     
## [34] "peach"             "pear"              "physalis"         
## [37] "pineapple"         "pomegranate"       "purple mangosteen"
## [40] "raisin"            "rambutan"          "raspberry"        
## [43] "redcurrant"        "salal berry"       "satsuma"          
## [46] "star fruit"        "strawberry"        "tamarillo"        
## [49] "tangerine"         "watermelon"

## find all fruits starting with 'a'
fruit[grep("^a", fruit)]

## [1] "apple"   "apricot" "avocado"

## find all fruits ending with 'a'
fruit[grep("a$", fruit)]

## [1] "banana"    "cherimoya" "feijoa"    "guava"     "papaya"    "satsuma"

## find all fruits starting with a vowel
fruit[grep("^[aeiou]", fruit)]

## [1] "apple"      "apricot"    "avocado"    "eggplant"   "elderberry"
## [6] "olive"      "orange"     "ugli fruit"

## find all fruits with two consecutive vowels
fruit[grep("[aeiou]{2}", fruit)]

##  [1] "blood orange"      "blueberry"         "breadfruit"       
##  [4] "cantaloupe"        "cloudberry"        "dragonfruit"      
##  [7] "durian"            "feijoa"            "gooseberry"       
## [10] "grapefruit"        "guava"             "jackfruit"        
## [13] "kiwi fruit"        "kumquat"           "loquat"           
## [16] "lychee"            "passionfruit"      "peach"            
## [19] "pear"              "pineapple"         "purple mangosteen"
## [22] "quince"            "raisin"            "star fruit"       
## [25] "ugli fruit"

## find all fruits ending with two consecutive consonants other than r
fruit[grep("[^aeiour]{2}$", fruit)]

## [1] "blackcurrant" "currant"      "eggplant"     "peach"        "redcurrant"

In the examples above, we return all strings matching a simple pattern.

We can specify that the pattern be found at the beginning ^a or end a$ using anchors. We can provide multiple options for the match within brackets []. We can negate options within brackets using ^ in a different context. The curly braces ask for a specific number (or range {min, max}) of matches.

In the example below we use . to match any (single) character. This behaves much like ? in a Linux file name. We can ask for multiple matches by appending * if we want 0 or more matches and + if we want at least 1 match.

## find all fruits with two consecutive vowels twice, separated by a single
## consonant
fruit[grep("[aeiou]{2}.[aeiou]{2}", fruit)]

## [1] "feijoa"

## find all fruits with two consecutive vowels twice, separated by one or
## more consonants
fruit[grep("[aeiou]{2}.+[aeiou]{2}", fruit)]

## [1] "breadfruit"   "feijoa"       "passionfruit"

## find all fruits with exactly three consecutive consonants in the middle of
## two vowels
fruit[grep("[aeiou][^aeiou ]{3}[aeiou]", fruit)]

##  [1] "apple"             "blackberry"        "blackcurrant"     
##  [4] "breadfruit"        "dragonfruit"       "huckleberry"      
##  [7] "passionfruit"      "pineapple"         "purple mangosteen"
## [10] "raspberry"

#str_view(fruit, "[aeiou][^aeiou ]{3}[aeiou]")

To match an actual period (or other meta-character) we need to escape with a backslash. Thus, we use the regular expression \\.

c(fruit, "umich.edu")[grep('\\.', c(fruit, "umich.edu"))]

## [1] "umich.edu"

The double backslash is needed because the regular expression itself is passed as a string and strings also use backslash as an escape character. This is also important to remember when building file paths as strings on a Windows computer. In other languages, you generally only need a single backslash in your regular expression.

Matched values can be grouped using parentheses () and referred back to in the order they appear using a back reference \\1.

## find all fruits with a repeated letter
fruit[grep("(.)\\1", fruit)]

##  [1] "apple"             "bell pepper"       "bilberry"         
##  [4] "blackberry"        "blackcurrant"      "blood orange"     
##  [7] "blueberry"         "boysenberry"       "cherry"           
## [10] "chili pepper"      "cloudberry"        "cranberry"        
## [13] "currant"           "eggplant"          "elderberry"       
## [16] "goji berry"        "gooseberry"        "huckleberry"      
## [19] "lychee"            "mulberry"          "passionfruit"     
## [22] "persimmon"         "pineapple"         "purple mangosteen"
## [25] "raspberry"         "redcurrant"        "salal berry"      
## [28] "strawberry"        "tamarillo"

## find all fruits with a repeated letter but exclude double r
fruit[grep("([^r])\\1", fruit)]

##  [1] "apple"             "bell pepper"       "blood orange"     
##  [4] "chili pepper"      "eggplant"          "gooseberry"       
##  [7] "lychee"            "passionfruit"      "persimmon"        
## [10] "pineapple"         "purple mangosteen" "tamarillo"

## find all fruits that end with a repeated letter
fruit[grep("(.)\\1$", fruit)]

## [1] "lychee"

Regex Concepts

You are already familiar with the use of the grep utility from the Linux command line. For more on using regexp with grep skim the man page for GNU grep focusing on the following concepts:

Single character regexp, meta-characters, and escaping.
Bracket expressions, range expressions, and character classes.
Character classes denoted using a backslash
Repetition
Concatenation, alternation (i.e. logical or), and precedence
Grouping with parentheses and back references

Exercise - regex crosswords

During a Tuesday activity, you will work with your group to solve the intermediate puzzles to help cement your regexp understanding. To practice, solve the basic puzzles at Regex Crossword – you will submit these as a quiz. Consider working on the other puzzles to enhance your understanding of regular expressions.

Resources

“String Manipulation” (Chapter 11) in Matloff’s The Art of R Programming.
Tip sheet for regular expressions in SAS.
Some tips for using regular expressions in Stata:
1. https://www.stata.com/support/faqs/data-management/regular-expressions/
2. https://stats.idre.ucla.edu/stata/faq/how-can-i-extract-a-portion-of-a-string-variable-using-regular-expressions/