In R, you create strings of type character
using either single or double quotes. There is no difference (in R) between the two.
string1 = "This is a string."
string2 = 'This is a string.'
all.equal(string1, string2)
## [1] TRUE
typeof(string1)
## [1] "character"
This is not the case in all languages. For instance, consider the following example from bash
.
#!/bin/bash
## A short script to illustrate single vs double quotes
# Specify filename and extension
FILE=my_file
EXT=.txt
# Double quotes allow for expansion
echo "Double quotes: $FILE$EXT"
# Single quotes create a literal string
echo 'Single quotes: $FILE$EXT'
Double Quotes: my_file.txt
Single Quotes: $FILE$EXT
Returning to R, you can mix single and double quotes when you want to include one or the other within your string.
string_single = "These are sometimes called 'scare quotes'."
print(string_single)
## [1] "These are sometimes called 'scare quotes'."
string_double = 'Quoth the Raven "Nevermore."'
print(string_double)
## [1] "Quoth the Raven \"Nevermore.\""
cat(string_double,'\n')
## Quoth the Raven "Nevermore."
You can also include quotes within a string by escaping them:
string_double = "Quoth the Raven \"Nevermore.\""
print(string_double)
## [1] "Quoth the Raven \"Nevermore.\""
cat(string_double,'\n')
## Quoth the Raven "Nevermore."
Observe the difference between print()
and cat()
in terms of how the escaped characters are handled. Be aware also that because backslash plays this special role as an escape character, it itself needs to be escaped:
backslash = "This is a backslash '\\', this is not '\ '."
writeLines(backslash)
## This is a backslash '\', this is not ' '.
The table below collects some common string operations from base R and their parallels in the stringr
package. I say parallels and not equivalents because they do not always behave in the same way. If you are getting started in R, I suggest you use stringr
for these operations. You should however be aware of the common base functions as you may encounter them in code written by others.
operation | base | stringr |
---|---|---|
join | paste |
str_c |
subset | substr |
str_sub |
split | strsplit |
str_split |
search | grep , grepl |
str_locate ,str_detect |
The functions paste
and stringr::str_c
are both used to join strings together.
Observe the difference between the sep
and collapse
arguments in paste
.
length(LETTERS)
## [1] 26
paste(LETTERS,collapse="")
## [1] "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
paste(1:26,LETTERS,sep=': ')
## [1] "1: A" "2: B" "3: C" "4: D" "5: E" "6: F" "7: G" "8: H"
## [9] "9: I" "10: J" "11: K" "12: L" "13: M" "14: N" "15: O" "16: P"
## [17] "17: Q" "18: R" "19: S" "20: T" "21: U" "22: V" "23: W" "24: X"
## [25] "25: Y" "26: Z"
paste(1:26,LETTERS,sep=': ',collapse='\n ')
## [1] "1: A\n 2: B\n 3: C\n 4: D\n 5: E\n 6: F\n 7: G\n 8: H\n 9: I\n 10: J\n 11: K\n 12: L\n 13: M\n 14: N\n 15: O\n 16: P\n 17: Q\n 18: R\n 19: S\n 20: T\n 21: U\n 22: V\n 23: W\n 24: X\n 25: Y\n 26: Z"
Below we see that str_c
behaves similarly.
library(stringr)
all.equal(str_c(LETTERS,collapse=""), paste(LETTERS, collapse=""))
## [1] TRUE
all.equal(str_c(1:26,LETTERS,sep=': '), paste(1:26, LETTERS,sep=': '))
## [1] TRUE
all.equal(str_c(1:26,LETTERS,sep=': ', collapse='\n '), paste(1:26, LETTERS,sep=': ', collapse='\n '))
## [1] TRUE
However, these functions differ in the treatment of missing values (NA
).
paste(1:3,c(1,NA,3),sep=':', collapse=', ')
## [1] "1:1, 2:NA, 3:3"
str_c(1:3,c(1,NA,3),sep=':', collapse=', ')
## [1] NA
str_c(1:3, str_replace_na(c(1,NA,3)), collapse=', ')
## [1] "11, 2NA, 33"
Recall that length
returns the length of a vector. To get the length of a string use nchar
or str_length
.
length(paste(LETTERS,collapse=""))
## [1] 1
nchar(paste(LETTERS,collapse=""))
## [1] 26
str_length(paste(LETTERS,collapse=""))
## [1] 26
The following functions extract sub-strings at given positions.
substr('Strings',3,7)
## [1] "rings"
str_sub('Strings',1,6)
## [1] "String"
The function stringr::str_sub
supports negative indexing.
sprintf('base: %s, stringr: %s', substr('Strings',-5,-1), str_sub('Strings',-5,-1))
## [1] "base: , stringr: rings"
The example below uses the vector fruit
from the stringr
package.
The base function grep
returns the indices of all strings within a vector that contain the requested pattern. The grepl
function behaves in the same way but returns a logical vector of the same length as the input x
.
head(fruit)
## [1] "apple" "apricot" "avocado" "banana" "bell pepper"
## [6] "bilberry"
grep('fruit', fruit)
## [1] 12 26 35 39 42 57 75 79
which(grepl('fruit', fruit))
## [1] 12 26 35 39 42 57 75 79
head(grepl('fruit', fruit))
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
These functions are vectorized over the input but not the pattern.
grep(c('fruit', 'berry'), fruit)
## Warning in grep(c("fruit", "berry"), fruit): argument 'pattern' has length
## > 1 and only the first element will be used
## [1] 12 26 35 39 42 57 75 79
sapply(c('fruit', 'berry'), grep, x=fruit)
## $fruit
## [1] 12 26 35 39 42 57 75 79
##
## $berry
## [1] 6 7 10 11 19 21 29 32 33 38 50 70 73 76
The match
function is vectorized over the input, but returns only the first match and requires exact matching.
match('berry',fruit)
## [1] NA
match(c('apple', 'pear'), c(fruit,fruit))
## [1] 1 59
The corresponding stringr
functions are vectorized over both pattern and input, but the vectorization uses broadcasting so be careful. Pay attention to the order that the string and pattern are supplied in.
ind_fruit = which(str_detect(fruit, 'fruit'))
ind_berry = which(str_detect(fruit, 'berry'))
ind_either = which(str_detect(fruit, c('fruit','berry')))
setdiff(union(ind_fruit, ind_berry), ind_either)
## [1] 12 26 42 7 11 19 21 29 33 73
ind_odd = seq(1, length(fruit), 2)
ind_even = seq(2, length(fruit), 2)
odd_fruit = ind_odd[str_detect(fruit[ind_odd], 'fruit')]
even_berry = ind_even[str_detect(fruit[ind_even], 'berry')]
setdiff(union(odd_fruit, even_berry), ind_either)
## numeric(0)
The vectorization in this case doesn’t help us to avoid the lapply
pattern we used with grep
.
sapply(c('fruit', 'berry'), function(x) which(str_detect(fruit,x)))
## $fruit
## [1] 12 26 35 39 42 57 75 79
##
## $berry
## [1] 6 7 10 11 19 21 29 32 33 38 50 70 73 76
However, str_locate
is vectorized using an “OR” operator.
ind_fruit = str_locate(fruit, 'fruit')
ind_berry = str_locate(fruit, 'berry')
ind_either = str_locate(fruit, c('fruit','berry'))
setdiff(union(ind_fruit, ind_berry), ind_either)
## integer(0)
The base function strsplit
can be used to split a string into pieces based on a pattern. The example below finds all two-word fruit names from fruit
.
fruit_list = strsplit(fruit,' ')
two_ind = which(sapply(fruit_list, length)==2)
fruit_two = lapply(fruit_list[two_ind], paste, collapse=' ')
unlist(fruit_two)
## [1] "bell pepper" "blood orange" "canary melon"
## [4] "chili pepper" "goji berry" "kiwi fruit"
## [7] "purple mangosteen" "rock melon" "salal berry"
## [10] "star fruit" "ugli fruit"
all.equal(fruit_list, str_split(fruit, ' '))
## [1] TRUE
When there are multiple patterns matching the split point, these strsplit
and str_split
behave differently.
string = '1;2;3'
strsplit(string, ';')
## [[1]]
## [1] "1" "2" "3"
str_split(string,';')
## [[1]]
## [1] "1" "2" "3"
str_split(string,';')
## [[1]]
## [1] "1" "2" "3"
Regular expressions are a way to describe patterns in strings. There is a common regexp vocabulary though some details differ between languages.
The basic idea is illustrated in the examples below using the fruit data.
## find all two word fruits
fruit[grep(" ",fruit)]
## [1] "bell pepper" "blood orange" "canary melon"
## [4] "chili pepper" "goji berry" "kiwi fruit"
## [7] "purple mangosteen" "rock melon" "salal berry"
## [10] "star fruit" "ugli fruit"
## find all fruits with an 'a'
fruit[grep("a", fruit)]
## [1] "apple" "apricot" "avocado"
## [4] "banana" "blackberry" "blackcurrant"
## [7] "blood orange" "breadfruit" "canary melon"
## [10] "cantaloupe" "cherimoya" "cranberry"
## [13] "currant" "damson" "date"
## [16] "dragonfruit" "durian" "eggplant"
## [19] "feijoa" "grape" "grapefruit"
## [22] "guava" "jackfruit" "jambul"
## [25] "kumquat" "loquat" "mandarine"
## [28] "mango" "nectarine" "orange"
## [31] "pamelo" "papaya" "passionfruit"
## [34] "peach" "pear" "physalis"
## [37] "pineapple" "pomegranate" "purple mangosteen"
## [40] "raisin" "rambutan" "raspberry"
## [43] "redcurrant" "salal berry" "satsuma"
## [46] "star fruit" "strawberry" "tamarillo"
## [49] "tangerine" "watermelon"
## find all fruits starting with 'a'
fruit[grep("^a", fruit)]
## [1] "apple" "apricot" "avocado"
## find all fruits ending with 'a'
fruit[grep("a$", fruit)]
## [1] "banana" "cherimoya" "feijoa" "guava" "papaya" "satsuma"
## find all fruits starting with a vowel
fruit[grep("^[aeiou]", fruit)]
## [1] "apple" "apricot" "avocado" "eggplant" "elderberry"
## [6] "olive" "orange" "ugli fruit"
## find all fruits with two consecutive vowels
fruit[grep("[aeiou]{2}", fruit)]
## [1] "blood orange" "blueberry" "breadfruit"
## [4] "cantaloupe" "cloudberry" "dragonfruit"
## [7] "durian" "feijoa" "gooseberry"
## [10] "grapefruit" "guava" "jackfruit"
## [13] "kiwi fruit" "kumquat" "loquat"
## [16] "lychee" "passionfruit" "peach"
## [19] "pear" "pineapple" "purple mangosteen"
## [22] "quince" "raisin" "star fruit"
## [25] "ugli fruit"
## find all fruits ending with two consecutive consonants other than r
fruit[grep("[^aeiour]{2}$", fruit)]
## [1] "blackcurrant" "currant" "eggplant" "peach"
## [5] "redcurrant"
In the examples above, we return all strings matching a simple pattern. We can specify that the pattern be found at the beginning ^a
or end a$
using anchors. We can provide multiple options for the match within brackets []
. We can negate options within brackets using ^
in a different context. The curly braces ask for a specific number (or range {min, max}
) of matches.
In the example below we use .
to match any (single) character. This behaves much like ?
in Unix file names. We can ask for multiple matches by appending *
if we want 0 or more matches and +
if we want at least 1 match.
## find all fruits with two consecutive vowels twice (?)
fruit[grep("[aeiou]{2}.[aeiou]{2}", fruit)]
## [1] "feijoa"
## find all fruits with two consecutive vowels twice
fruit[grep("[aeiou]{2}.+[aeiou]{2}", fruit)]
## [1] "breadfruit" "feijoa" "passionfruit"
## find all fruits with extactly three consectutive consonants in the middle
fruit[grep("[aeiou][^aeiou ]{3}[aeiou]", fruit)]
## [1] "apple" "blackberry" "blackcurrant"
## [4] "breadfruit" "dragonfruit" "huckleberry"
## [7] "passionfruit" "pineapple" "purple mangosteen"
## [10] "raspberry"
To match an actual period, use the regular expression \\.
c(fruit, "umich.edu")[grep('\\.', c(fruit, "umich.edu"))]
## [1] "umich.edu"
The double backslash is needed because the regular expression itself is passed as a string and strings also use backslash as an escape character. This is also important to remember when building file paths as strings on a Windows computer.
Matched values can be grouped using parentheses ()
and referred back to in the order they appear using a back reference \\1
.
## find all fruits with a repeated letter
fruit[grep("(.)\\1", fruit)]
## [1] "apple" "bell pepper" "bilberry"
## [4] "blackberry" "blackcurrant" "blood orange"
## [7] "blueberry" "boysenberry" "cherry"
## [10] "chili pepper" "cloudberry" "cranberry"
## [13] "currant" "eggplant" "elderberry"
## [16] "goji berry" "gooseberry" "huckleberry"
## [19] "lychee" "mulberry" "passionfruit"
## [22] "persimmon" "pineapple" "purple mangosteen"
## [25] "raspberry" "redcurrant" "salal berry"
## [28] "strawberry" "tamarillo"
## find all fruits with a repeated letter but exclude double r
fruit[grep("([^r])\\1", fruit)]
## [1] "apple" "bell pepper" "blood orange"
## [4] "chili pepper" "eggplant" "gooseberry"
## [7] "lychee" "passionfruit" "persimmon"
## [10] "pineapple" "purple mangosteen" "tamarillo"