About R

As described by The R Foundation, “R is a language and environment for statistical computing and graphics.” Importantly, R is open-source, free software distributed under a GNU GPL-3 license.

It is also easily extensible through contributed packages that cover much of modern statistics and data science.

RStudio

RStudio is an “integrated development environment” (IDE) for working with R that simplifies many tasks and makes for a friendlier introduction to R. It provides a nice interface via Rmarkdown for integrating R code with text for creating documents.

RStudio is distributed by a company of the same name that also offers a number of related products for working with data: Shiny for interactive graphics along with enterprise and server editions. I suggest you use RStudio when feasible to streamline your work flow.

Reading

Everyone should read:

Optional reading which may be particularly helpful if you find the above difficult:

R Basics

Objects

Nearly everything in R is an object that can be referred to by name.

Assignment

We generally create objects by assigning values to them:

# This is a comment ignored by R
instructor <- 'Dr. Henderson'
x <- 10
y <- 32
z <- c(x, y) #Form vectors by combining or concatenating elements.

9 -> w # This works, but is bad style.
the_answer = 42 # Most other languages use = for assignment.

The values can be referred to by the object name:

the_answer
## [1] 42
Naming Objects

Object names can be any syntactically valid name. You should, however, avoid clobbering built in R names such as pi, mean, sum, etc.

You also should not use reserved words when naming objects.

You should generally use meaningful, descriptive names for objects you create. There are various styles for creating object names with multiple words:

  • snake_case : separate words with an underscore
  • camelCase : capitalize subsequent words
  • dot.case : separate words with a period

Following the tidyverse style guide, you should use snake_case in code you write for this course. Avoid dot.case and camelCase.

Non-syntactic names

For purposes of presentation, you may occasionally want to use non-syntactic names or names that break the rules above. Use back-ticks (i.e. `object name`) to create non-syntactic names:

`Value ($)` = 1e3
`Value ($)`
## [1] 1000
Style notes

Generally speaking, you do not need to adopt all aspects of my coding style. However, you should develop and use a consistent style of your own.

Still, I will ask you to follow certain common but non-universal style conventions in order to create a cohesive course style. Here are the first few:

  1. Please use snake_case when naming objects.
  2. Do not use CAPITAL letters in object names.
  3. Always include a space after a comma ,.
  4. Use = or <- for assignment universally - do not mix and match.

Value vs Reference

It is important to know that in R objects are (typically) stored by value and not by reference:

x = 10
y = 32
z = c(x, y)
c(x, y, z)
## [1] 10 32 10 32
y = the_answer
c(x, y, z)
## [1] 10 42 10 32

In contrast, if z = c(x, y) were a reference to the contents of x and y then changing y would change z and the value referred to by z as well.

At a more technical level, R has copy on modify semantics and uses lazy evaluation meaning objects at times are stored by reference. We will learn about these later in the course, but for reasoning about R programs it is generally sufficient to think of objects holding values rather than references to values.

Later in this lesson we will learn how to create objects without assigning specific values to them.

Global Environment

An environment is a context in which the names we use to refer to R objects have meaning. For now, it is sufficient to know that the default environment where objects are assigned is called the global environment. You can list objects in the global environment using the function ls() and can remove objects using rm(). [Note the similarity to the Linux shell. ]

ls()
## [1] "instructor" "the_answer" "Value ($)"  "w"          "x"         
## [6] "y"          "z"
rm(the_answer)
ls()
## [1] "instructor" "Value ($)"  "w"          "x"          "y"         
## [6] "z"

We can remove multiple objects in a few ways:

remove(instructor, the_answer) # remove and rm are synonyms
## Warning in remove(instructor, the_answer): object 'the_answer' not found
ls()
## [1] "Value ($)" "w"         "x"         "y"         "z"
rm( list = c('x', 'y') ) # Object names are passed to list as strings
ls()
## [1] "Value ($)" "w"         "z"

To remove all objects from the global environment use ‘rm( list = ls() )’:

ls()
## [1] "Value ($)" "w"         "z"
rm( list = ls() )
ls()
## character(0)

Programmatic assignment

Occasionally, you may write a program where you need to access or assign an object whose name you do not know ahead of time. In these cases you may wish to use the get() or assign() functions.

# Ex 1
rm( list = ls() )
ls()
## character(0)
assign("new_int", 9) # i.e. new_int = 9
ls()
## [1] "new_int"
get("new_int")
## [1] 9
# Ex 2
rm( list = ls() )
obj = 'obj_name'; val = 42
assign(obj, value = val) # assign the value in 'val' to the value in 'obj'
ls()
## [1] "obj"      "obj_name" "val"
obj
## [1] "obj_name"
obj_name
## [1] 42

Arithmetic operations

R can do arithmetic with objects that contain numeric types.

x = 10; y = 32
z = x + y 
x + y
## [1] 42
z / x
## [1] 4.2
z^2
## [1] 1764
z + 2 * c(y, x) - 10  
## [1] 96 52
11 %% 2   # Modular arithmetic (i.e. what is the remainder)
## [1] 1
11 %/% 2  # Integer division discards the remainder 
## [1] 5

Be careful about mixing vectors of different lengths as R will generally recycle values:

x = 4:6
y = c(0, 1)
x * y
## Warning in x * y: longer object length is not a multiple of shorter object
## length
## [1] 0 5 0
x = 1:4
y * x
## [1] 0 2 0 4

There are a number of common mathematical functions already in R:

mean(x) # average
sum(x)  # summation
sd(x)   # Standard deviation
var(x)  # Variance
exp(x)  # Exponential
sqrt(x) # Square root
log(x)  # Natural log
sin(x)  # Trigonometric functions
cos(pi / 2) # R even contains pi, but only does finite arithmetic
floor(x / 2) #The nearest integer below
ceiling(x / 2) #The nearest integer above

When doing math in R or another computing language, be cognizant of the fact that numeric doubles have finite precision.
This can sometimes lead to unexpected results as seen here:

sqrt(2)^2 - 2
## [1] 4.440892e-16
# floating point addition not commutative 
{.1 + .7 + .2} == 1
## [1] TRUE
{.7 + .2 + .1} == 1
## [1] FALSE
Style notes
  • Always include a space on either side of binary operators, e.g. a + b not a+b with the exception of high precedence operators. e.g. a + b^2, 1:3 + 2.

Exercises

For each snippet of R code below, compute the value of z without using R. Use R to check your work when done.

  1. What is the value of z?
x = 10
y = c(9, 9)
z = x
z = y
z = sum(z)
  1. What is the value of z?
x = -1:1
y = rep(1, 10)
z = mean(x * y)
  1. Which do you think is larger e0 or e1? Why? What is the value of z?
x0 = 1:10000
y0 = x0 * pi / max(x0)
e0 = sum( abs( cos(y0)^2 + sin(y0)^2 - 1 ) )

x1 = 1:100000
y1 = x1 * pi / max(x1)
e1 = sum( abs( cos(y1)^2 + sin(y1)^2 - 1 ) )

z = floor( e1 / e0 )

The R package system

Much of the utility of R is derived from an extensive collection of user and domain-expert contributed packages. Packages are simply a standardized way for people to share documented code and data. There are thousands of packages, likely more than 10,000 officially distributed through the CRAN alone!

Packages are primarily distributed through three sources:

Installing packages

The primary way to install a package is using install.packages("pkg").

#install.packages('lme4') # the package name should be passed as a character string

You can find the default location for your R packages using the .libPaths() function. If you don’t have write permission to this folder, you can set this directory to a personal library instead.

.libPaths() ## The default library location
## [1] "/Library/Frameworks/R.framework/Versions/4.0/Resources/library"
.libPaths('/Users/jbhender/Rlib') #Create the directory first!
.libPaths()
## [1] "/Users/jbhender/Rlib"                                          
## [2] "/Library/Frameworks/R.framework/Versions/4.0/Resources/library"

To install a package to a personal library use the ‘lib’ option.

## install.packages("haven",lib='/Users/jbhender/Rlib')

Use the above with caution and only when necessary.

If your computer has the necessary tools, packages can also be installed from source by downloading the package file and pointing directly to the source tar ball (.tgz) or Windows binary.

Using packages in R

Installing a package does not make it available to R. There are two ways to use things from a package:

  • calling library("pkg") to add it to the search path,
  • using the pkg::obj construction to access a package’s exported objects,
  • using the pkg:::obj to access non-exported objects.

These methods are illustrated below using the data set InstEval distributed with the ‘lme4’ package.

#head(InstEval)
## Using the pkg::function construction
head(lme4::InstEval)
##   s    d studage lectage service dept y
## 1 1 1002       2       2       0    2 5
## 2 1 1050       2       1       1    6 2
## 3 1 1582       2       2       0    2 5
## 4 1 2050       2       2       1    3 3
## 5 2  115       2       1       0    5 2
## 6 2  756       2       1       0    5 4

The library("pkg") command adds a package to the search path.

search()
## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"
library(lme4)
## Loading required package: Matrix
search()
##  [1] ".GlobalEnv"        "package:lme4"      "package:Matrix"   
##  [4] "package:stats"     "package:graphics"  "package:grDevices"
##  [7] "package:utils"     "package:datasets"  "package:methods"  
## [10] "Autoloads"         "package:base"
head(InstEval)
##   s    d studage lectage service dept y
## 1 1 1002       2       2       0    2 5
## 2 1 1050       2       1       1    6 2
## 3 1 1582       2       2       0    2 5
## 4 1 2050       2       2       1    3 3
## 5 2  115       2       1       0    5 2
## 6 2  756       2       1       0    5 4

To remove a library from the search path use detach("package:pkg", unload = TRUE).

detach(package:lme4, unload = TRUE)
search()
##  [1] ".GlobalEnv"        "package:Matrix"    "package:stats"    
##  [4] "package:graphics"  "package:grDevices" "package:utils"    
##  [7] "package:datasets"  "package:methods"   "Autoloads"        
## [10] "package:base"

Exercises

  1. Find the default location for R packages on:
    • your personal computer
    • the scs servers
  2. Install the following R packages (at home):
  • tidyverse: a collection of packages developed by Hadley Wickham and others
  • data.table: we will use this later in the course and briefly below

Input and output

R is primarily and in-memory language, meaning it is designed to work with objects stored in working memory (i.e. RAM) rather than on disk. Therefore, it is essential to know how to read and write data from disk.

Delimited Data

Data is commonly shared as flat (maybe compressed) text files often delimited by commas (e.g. .csv), tabs or '\t' (e.g. .tab, .txt), or one+ white space characters (e.g. .data, .txt).

Base R

In the base R packages, these can be read using read.table and its wrappers like read.csv. To read the file at ../data/recs2015_public_v4.csv, use read.table() and assign the input to an object.

recs = read.table( '../../data/recs2015_public_v4.csv', sep = ',', 
                   stringsAsFactors = FALSE, header = TRUE)
dim(recs)
## [1] 5686  759
class(recs)
## [1] "data.frame"

These functions return data.frames which are special lists whose members all have the same length (i.e. the number of rows.)

Likewise, you can write delimited files using write.table or write.csv.

write.table(lme4::InstEval, file = '../../data/InstEval.txt', sep = '\t', 
            row.names = FALSE)
(Tidyverse) readR

The readR package distributed with the tidyverse offers a more efficient version of the above with (arguably) better defaults.

recs_tib = readr::read_delim('../../data/recs2015_public_v4.csv', delim=',')
dim(recs_tib)
## [1] 5686  759
class(recs_tib)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Similarly there is a readr::write_delim function.

data.table

My personal favorite functions for reading and writing delimited data are data.table::fread() and data.table::fwrite().

recs_dt = data.table::fread('../../data/recs2015_public_v4.csv')
dim(recs_dt)
## [1] 5686  759
class(recs_dt)
## [1] "data.table" "data.frame"

You can even pass a command line argument to prepocess your data before reading it in:

recs_dt = data.table::fread('gunzip -c ../../data/recs2015_public_v4.csv.gz')

We will learn more about tibbles and data.tables later in the course.

Native R Binaries

There are two common formats for writing binaries containing native R objects: .RData and .rds.

To save one or more R objects to a file, use save().

df = lme4::InstEval
df_desc = 'The lme4::InstEval data.'
save(df, df_desc, file = '../../data/InstEval.RData')

To restore these objects to the global environment use load().

rm( list = ls() )
ls()
## character(0)
load('../../data/InstEval.RData')
ls()
## [1] "df"      "df_desc"

The function load() returns the names of objects returned invisibly meaning you can assign them as shown below.

foo = load('../../data/InstEval.RData')
foo
## [1] "df"      "df_desc"
# The following construction assigns the InstEval data to a new object .
# Useful when you don't know what the first object of foo is going to be.
assign('InstEval', get(foo[1]) )
Serialized R Data

Generally, it is best to use save() and load() for R objects. However, there are times when it can be helpful to save the data an R object contains without also saving its name. In these cases, you can use saveRDS() and readRDS().

saveRDS(lme4::InstEval, file = '../../data/InstEval.rds')
df = readRDS('../../data/InstEval.rds')
Style notes

You may come across the extensions .Rdata or .rda from time to time. These are generally synonyms for .RData and can usually be loaded using load(). However, the standard is to use .RData. This should be considered course style.

Other formats

You may find the following packages useful for reading and writing data from other formats:

  • readxl for reading Excel files
  • haven for reading other common data formats
  • foreign an alternative to haven that supports additional formats.

Vectors

The material in this section is largely based on the assigned readings:

  • R for Data Science, Chapter 20
  • Advanced R, Vectors and Subsetting. Vectors are the basic building blocks of many R data structures including rectangular data structures or data.frames most commonly used for analysis.

There are two kinds of vectors: atomic vectors and lists. Every vector has two defining properties aside from the values it holds:

  • A type (or mode) referring to what type of data it holds, and
  • A length referring to the number of elements it points to.

Use typeof() to determine a vector’s type and length() to determine its length.

In R, scalars are just vectors with length one.

Atomic Vectors

The elements in atomic vectors must all be of the same type.
Here are the most important types:

  • Logical: TRUE, FALSE
  • Integer: 1L
  • Double: 2.4
  • Character: "some words"

Two less commonly used types are raw and complex. The mode of a vector of integers or doubles is numeric, the mode of other types is the same as the type.

To create new vectors without assigning values to them use c() or vector(mode, length).

cvec = c()
cvec
## NULL
length(cvec)
## [1] 0
typeof(cvec)
## [1] "NULL"
vvec = vector('character', length = 4)
length(vvec)
## [1] 4
typeof(vvec)
## [1] "character"
vvec
## [1] "" "" "" ""

When the required length can be determined in advance, the latter is construction is preferred – especially for long vectors. This is because it pre-allocates the entire vector which avoids inefficient repeat copying owing to R’s copy on modify semantics.

Subsetting

Access elements of a vector x using x[]. Read about various ways to subset a vector in section 20.4.5 of R for Data Science.

Lists

Lists provide a more flexible data structure which can hold elements of multiple types, including other vectors. This can be useful for bringing together multiple object types into a single place.

Lists are R’s most flexible object structure.

Read through section 20.5 of R for Data Science here.

Make sure you understand the distinction between [] and [[]] when used to subset a list.

Attributes

Attributes are metadata associated with a vector.

Some of the most important and commonly used attributes we will learn about are:

  • names
  • class
  • dim

Each of these can be accessed and set using functions of the same name.

Names

There are a few ways to create a vector with names.

## Ex1: assign with names()
x = 1:3
names(x) = c('Uno', 'Dos', 'Tres')
x
##  Uno  Dos Tres 
##    1    2    3
##Ex2: quoted names
x = c( 'Uno' = 1, 'Dos' = 2, 'Tres' = 3 )
x
##  Uno  Dos Tres 
##    1    2    3
##Ex3: bare names
x = c( Uno = 1, Dos = 2, Tres = 3)
x
##  Uno  Dos Tres 
##    1    2    3
##Ex4: 
names(x)
## [1] "Uno"  "Dos"  "Tres"
class

The class of an object plays an important role in R’s S3 object oriented system and determines how an object is treated by various functions.

The class of a vector is generally the same as its mode or type.

class(FALSE); class(0L); class(1); class('Two'); class( list() )
## [1] "logical"
## [1] "integer"
## [1] "numeric"
## [1] "character"
## [1] "list"

Note the difference after we change the class of x.

x
##  Uno  Dos Tres 
##    1    2    3
class(x) = 'character'
print(x)
##  Uno  Dos Tres 
##  "1"  "2"  "3"

It is generally better to use explicit conversion functions such as as.character() to change an object’s class as simply modifying the attribute will not guarantee the object is a valid member of that class.

dim

We use dim() to access or set the dimensions of an object.
Three special classes where dimensions matter are matrix, array, and data.frame. We will examine just the first two here.

Both matrices and arrays are really just long vectors with this additional dim attribute.

# Matrix class
x = matrix(1:10, nrow = 5, ncol = 2)
dim(x)
## [1] 5 2
class(x)
## [1] "matrix" "array"
dim(x) = c(2, 5)
x
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10

The above demonstrates that R matrices are stored in column-major order.
The matrix class is the two-dimensional special cases of the array class.

dim(x) = c(5, 1, 2)
class(x)
## [1] "array"
x
## , , 1
## 
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
## [4,]    4
## [5,]    5
## 
## , , 2
## 
##      [,1]
## [1,]    6
## [2,]    7
## [3,]    8
## [4,]    9
## [5,]   10
dim(x) = c(5, 2, 1)
x
## , , 1
## 
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
class(x)
## [1] "array"
Arbitrary attributes

See a list of all attributes associated with a vector using attributes(). Access and assign specific attributes using attr().

# Assign an a new attribute color
attr(x, 'color') = 'green'
attributes(x)
## $dim
## [1] 5 2 1
## 
## $color
## [1] "green"
class( attributes(x) )
## [1] "list"
# Access or assign specific attributes
attr(x, 'dim')
## [1] 5 2 1
attr(x, 'dim') = c(5, 2)
class(x)
## [1] "matrix" "array"
x
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
## attr(,"color")
## [1] "green"
# Assign NULL to remove attributes
attr(x, 'color') = NULL
attributes(x)
## $dim
## [1] 5 2

The data.frame class

The data.frame class in R is a list whose elements (columns) all have the same length. A data.frame has attributes names with elements of the list constituting its columns, row.names with values uniquely identifying each row, and also has class data.frame.

Construct a new data.frame using data.frame(). Here is an example.

df = 
  data.frame( 
    ID = 1:10,
    Group = sample(0:1, 10, replace = TRUE),
    Var1 = rnorm(10),
    Var2 = seq(0, 1, length.out = 10),
    Var3 = rep(c('a', 'b'), each = 5)
  )

names(df)
## [1] "ID"    "Group" "Var1"  "Var2"  "Var3"
dim(df)
## [1] 10  5
length(df)
## [1] 5
nrow(df)
## [1] 10
class(df$Var3)
## [1] "character"

We can access the values of a data frame both like a list:

df$ID
##  [1]  1  2  3  4  5  6  7  8  9 10
df[['Var3']]
##  [1] "a" "a" "a" "a" "a" "b" "b" "b" "b" "b"

or like an array:

df[1:5, ]
##   ID Group       Var1      Var2 Var3
## 1  1     0  0.1220973 0.0000000    a
## 2  2     0 -0.6835078 0.1111111    a
## 3  3     0  0.5217777 0.2222222    a
## 4  4     0 -1.5620367 0.3333333    a
## 5  5     0 -0.4335780 0.4444444    a
df[, 'Var2']
##  [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
##  [8] 0.7777778 0.8888889 1.0000000

This works because the subset operator [ has behavior that depends on the class of the object being subset.

Read more about data frames in Advanced R here.

Exercises

Consider an arbitrary data.frame df with columns a, b, and c.

  1. Which of the following is not the same as the others?
  1. df$a
  2. df[1]
  3. df[['a']]
  4. df[, 1]
  1. Which of the following are equivalent to length(df)? Choose all that apply.
  1. nrow(df)
  2. ncol(df)
  3. 3 * nrow(df)
  4. length(df[['a']])
  5. length(df[1:3])
  1. Which of the following are equivalent to length(df$a)? Choose all that apply.
  1. nrow(df)
  2. ncol(df)
  3. 3 * nrow(df)
  4. length(df[['a']])
  5. length(df[1:3])

Logicals

R has three reserved words of type ‘logical’:

typeof(TRUE)
## [1] "logical"
typeof(FALSE)
## [1] "logical"
typeof(NA)
## [1] "logical"
if ( TRUE && T ) {
  print('Synonyms')
}
## [1] "Synonyms"
if ( FALSE || F ){
  print('Synonyms')
}

While T and F are equivalent to TRUE and FALSE it is best to always use the full words. You should also avoid using T or F as names for objects or function arguments.

Style notes
  • Always use TRUE and FALSE rather than T or F for logical vectors.
  • Avoid naming objects or function arguments with any of these four reserved words: TRUE, FALSE, T, F.
Boolean comparisons

Boolean operators are useful for generating values conditionally on other values. Here are the basics:

Operator Meaning
== equal
!= not equal
>, >= greater than (or equal to)
<, <= less than (or equal to)
&& scalar AND
|| scalar OR
& vectorized AND
| vectorized OR
! negation (!TRUE == FALSE and !FALSE == TRUE)
any() are ANY of the elements true
all() are ALL of the elements true

Boolean comparisons create logical vectors:

{2 * 3} == 6     # test equality with ==
## [1] TRUE
{2 + 2} != 5     # use != for 'not equal'
## [1] TRUE
sqrt(70) > 8     # comparison operators: >, >=, <, <=
## [1] TRUE
sqrt(64) >= 8  
## [1] TRUE
!{2 == 3}        # Use not to negate or 'flip' a logical
## [1] TRUE

Comparison operators are vectorized:

1:10 > 5
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

You can can combine operators using ‘and (&&)’ or ‘or (||)’:

{2 + 2} == 4 | {2 + 2} == 5 # An or statement asks if either statement is true
## [1] TRUE
{2 + 2} == 4 & {2 + 2} == 5 # And requires both to be true
## [1] FALSE

Note the difference between the single (vectorized) and double versions:

even = {1:10 %% 2} == 0
div4 = {1:10 %% 4} == 0

even | div4
##  [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
even || div4
## [1] FALSE
even & div4
##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
even && div4
## [1] FALSE

Use any or all to efficiently check for the presence of any TRUE or FALSE in a logical vector.

any(even)
## [1] TRUE
all(even)
## [1] FALSE
Style notes
  • Boolean comparisons use binary operators, always include spaces on either side of binary operators.

Using which

The which() function returns the elements of a logical vector that return true:

which( {1:5}^2 > 10 )
## [1] 4 5

A combination of which and logical vector can be used to subset data frames:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
mtcars[ which( mtcars$mpg > 30 ), ]
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
## Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

Functions

R identifies functions by the func() construction. Functions are simply collections of commands that do something. Functions take arguments which can be used to specify which objects to operate on and what values of parameters are used. You can use help(func) to see what a function is used for and what arguments it expects, i.e. help(sprintf).

Arguments

Functions will often have multiple arguments. Some arguments have default values, others do not. All arguments without default values must be passed to a function. Arguments can be passed by name or position. For instance,

x = runif( n = 5, min = 0, max = 1)
y = runif(5, 0, 1)
z = runif(5)
round( cbind(x, y, z), 1)
##        x   y   z
## [1,] 0.6 0.6 0.3
## [2,] 0.3 0.5 0.2
## [3,] 0.9 0.6 0.5
## [4,] 0.1 0.3 1.0
## [5,] 0.2 0.7 0.6

both generate 5 numbers from a Uniform(0, 1) distribution.

Arguments passed by name need not be in order:

w = runif( min = 0, max = 1, n = 5)
u = runif( min = 0, max = 1, 5) # This also works but is bad style. 
round( rbind(u = u, w = w), 1 )
##   [,1] [,2] [,3] [,4] [,5]
## u  0.8  0.6  0.4  0.6  0.3
## w  0.6  0.4  0.6  0.9  0.1
Style notes
  • Values for function arguments with default values should be passed by name, not position.
  • Commonly used and required function arguments can be passed by position.
  • It’s never bad style to pass by name rather than value.

Writing Functions

You can create your own functions in R. Use functions for tasks that you repeat often in order to make your scripts more easily readable and modifiable. A good rule of thumb is never to copy an paste more than twice; use a function instead.
It can also be a good practice to use functions to break complex processes into parts, especially if these parts are used with control flow statements such as loops or conditionals.

# function to compute z-scores
z_score1 = function(x) {
  #inputs: x - a numeric vector
  #outputs: the z-scores for x
  xbar = mean(x)
  s = sd(x)
  z = (x - mean(x)) / s
  
  return(z)  
}

The return statement is not strictly necessary, but can make complex functions more readable. It is good practice to avoid creating intermediate objects to store values only used once.

# function to compute z-scores
z_score2 = function(x){
  #inputs: x - a numeric vector
  #outputs: the z-scores for x
  {x - mean(x)} / sd(x)
}
x = rnorm(10, 3, 1) ## generate some normally distributed values
round( cbind(x, 'Z1' = z_score1(x), 'Z2' = z_score2(x) ), 1)
##         x   Z1   Z2
##  [1,] 3.9  0.7  0.7
##  [2,] 2.6 -0.3 -0.3
##  [3,] 5.0  1.5  1.5
##  [4,] 1.2 -1.4 -1.4
##  [5,] 1.9 -0.8 -0.8
##  [6,] 2.1 -0.7 -0.7
##  [7,] 2.0 -0.7 -0.7
##  [8,] 4.6  1.2  1.2
##  [9,] 4.0  0.8  0.8
## [10,] 2.3 -0.5 -0.5

Default parameters

We can set default values for parameters using the construction ‘parameter = xx’ in the function definition.

# function to compute z-scores
z_score3 = function(x, na.rm=T){
  {x - mean(x, na.rm=na.rm)} / sd(x, na.rm=na.rm)
}
x = c(NA, x, NA)
round( cbind(x, 'Z1' = z_score1(x), 'Z2' = z_score2(x), 'Z3' = z_score3(x) ), 1)
##         x Z1 Z2   Z3
##  [1,]  NA NA NA   NA
##  [2,] 3.9 NA NA  0.7
##  [3,] 2.6 NA NA -0.3
##  [4,] 5.0 NA NA  1.5
##  [5,] 1.2 NA NA -1.4
##  [6,] 1.9 NA NA -0.8
##  [7,] 2.1 NA NA -0.7
##  [8,] 2.0 NA NA -0.7
##  [9,] 4.6 NA NA  1.2
## [10,] 4.0 NA NA  0.8
## [11,] 2.3 NA NA -0.5
## [12,]  NA NA NA   NA

Scope

Scoping refers to how R looks up the value associated with an object referred to by name. There are two types of scoping – lexical and dynamic – but we will concern ourselves only with lexical scoping here. There are four keys to understanding scoping:

  • environments
  • name masking
  • variables vs functions
  • dynamic look up and lazy evaluation.

An environment can be thought of as a context in which names are associated with objects. Each time a function is called, it generates a new environment for the computation.

Consider the following examples:

ls()
##  [1] "cvec"     "df"       "df_desc"  "div4"     "even"     "foo"     
##  [7] "InstEval" "u"        "vvec"     "w"        "x"        "y"       
## [13] "z"        "z_score1" "z_score2" "z_score3"
f1 = function() {
  f1_message = "I'm defined inside of f!"  # `message` is a function in base
  ls()
}
f1()
## [1] "f1_message"
exists('f1')
## [1] TRUE
exists('f1_message')
## [1] FALSE
environment()
## <environment: R_GlobalEnv>
f2 = function(){
  environment()
}
f2()
## <environment: 0x7f8c61f1cc00>
rm(f1, f2)

Name masking refers to where and in what order R looks for object names.
When we call f1 above, R first looks in the current environment which happens to be the global environment. The call to ls() however, happens within the environment created by the function call and hence returns only the objects defined in the local environment.

When an environment is created, it gets nested within the current environment referred to as the “parent environment”. When an object is referenced we first look in the current environment and move recursively up through parent environments until we find a value bound to that name.

Name masking refers to the notion that objects of the same name can exist in different environments. Consider these examples:

y = x = 'I came from outside of f!'
f3 = function(){
  x =  'I came from inside of x!'
  list( x = x, y = y )
}
f3()
## $x
## [1] "I came from inside of x!"
## 
## $y
## [1] "I came from outside of f!"
x
## [1] "I came from outside of f!"
mean = function(x){ sum(x) }
mean(1:10)
## [1] 55
base::mean(1:10)
## [1] 5.5
rm(mean)

R also uses dynamic look up, meaning values are searched for when a function is called, not when it is created. In the example above, y was defined in the global environment rather than within the function body. This means the value returned by f3 depends on the value of y in the global environment. You should generally avoid this, but there are occasions where it can be useful.

y = "I have been reinvented!"
f3()
## $x
## [1] "I came from inside of x!"
## 
## $y
## [1] "I have been reinvented!"

Finally, lazy evaluation means R only evaluates function arguments if and when they are actually used.

f4 = function(x){
  #x
  45
}

f4( x = stop("Let's pass an error.") )
## [1] 45

Move x out of the comment to see what happens if we evaluate it.

Resources

(Optionally) read more about functions here and here.

You can also read much more about functions in Chapter 7 of The Art of R Programming.

Practice

  1. Skim the help pages for median(), mad(), and IQR().
  2. Write a function z_score_robust that accepts a numeric vector and returns robust z-scores centered at the median and scaled by the IQR.
  3. Make the function you wrote robust to vectors containing NA values.
  4. Generate some data from N(4, 2) to test your functions.

Control Statments

for loops

Here is the syntax for a basic for loop in R

for ( i in 1:10 ) {
   cat(i, '\n')
}
## 1 
## 2 
## 3 
## 4 
## 5 
## 6 
## 7 
## 8 
## 9 
## 10

Note that the loop and the iterator are evaluated within the global environment.

for ( var in names(mtcars) ) {
  cat( sprintf('average %s = %4.3f', var, mean(mtcars[, var]) ), '\n')
}
## average mpg = 20.091 
## average cyl = 6.188 
## average disp = 230.722 
## average hp = 146.688 
## average drat = 3.597 
## average wt = 3.217 
## average qsec = 17.849 
## average vs = 0.438 
## average am = 0.406 
## average gear = 3.688 
## average carb = 2.812
var
## [1] "carb"

while

A while statement can be useful when you aren’t sure how many iterations are needed. Here is an example that takes a random walk and terminates if the value is more than 10 units from 0.

maxIter = 1e3 # always limit the total iterations allowed
val = vector( mode = 'numeric', length = maxIter)
val[1] = rnorm(1) ## initialize

k = 1
while ( abs(val[k]) < 10 && k <= maxIter ) {
  val[k + 1] = val[k] + rnorm(1)
  k = k + 1
}
val = val[1:{k - 1}]

plot(val, type = 'l')

key words

The following key words are useful within loops:

  • break - break out of the currently excuting loop
  • next - move to the next iteration immediately, without executing the rest of this iteration (continue in other languages such as C++)

Here is an example using next:

for ( i in 1:10 ) {
  if ( i %% 2 == 0 ) next
  cat(i,'\n')
}
## 1 
## 3 
## 5 
## 7 
## 9

Here is an example using break:

x = c()
for ( i in 1:1e1 ) {
  if ( i %% 3 == 0 ) break
  x = c(x, i)
}
print(x)
## [1] 1 2
Exercises

The Fibonacci sequence starts 1, 1, 2, … and continues with each new value formed by adding the two previous values.

  1. Write a function ‘Fib1’ which takes an argument ‘n’ and returns the \(n^{th}\) value of the Fibonacci sequence. Use a for loop in the function.

  2. Write a function ‘Fib2’ which does the same thing using a while loop.

  3. Use a switch to write a function that has a parameter loop = c('for', 'while') for calling either Fib1 or Fib2.

Conditionals

In programming, we often need to execute a piece of code only if some condition is true. Here are some of the R tools for doing this.

if statements

The workhorse for conditional execution in R is the if statement.
In the syntax below, note the spacing around the condition enclosed in the parentheses.

if ( TRUE ) {
  print('do something if true')
}
## [1] "do something if true"

Use an else to control the flow without separately checking the condition’s negation:

if ( {2 + 2} == 5 ) {
  print('the statement is true')
} else {
  print('the statement is false')
}
## [1] "the statement is false"
result = c(4, 5)
report = ifelse( {2 + 2} == result, 'true', 'false')
report
## [1] "true"  "false"

As you can see above, there is also an ifelse() function that can be useful.

For more complex cases, you may want to check multiple conditions:

a = -1
b = 1

if ( a * b > 0 ) {
  print('Zero is not between a and b')
} else if ( a < b ) {
    smaller = a
    larger = b
} else {
    smaller = b
    larger  = a
}

c(smaller, larger)
## [1] -1  1
Style notes

In all of the examples above, please pay close attention to the use of spacing and indentation for clarity. Also note that the opening brace for a conditional expression should start on the same line as the if statement.

switch

Use a switch when you have multiple discrete options.

Here is a simple example:

cases = function(x) {
  switch(as.character(x),
    a=1,
    b=2,
    c=3,
    "Neither a, b, nor c."
  )
}
cases("a")
## [1] 1
cases("m")
## [1] "Neither a, b, nor c."
cases(8)
## [1] "Neither a, b, nor c."

Without the coercion, the final call will evaluate to NULL.

cases2 = function(x) {
  switch(x,
    a=1,
    b=2,
    c=3,
    "Neither a, b, nor c."
  )
}
cases2_output = cases2(8)
cases2_output
## NULL

A switch can also be used with a numeric expression,

for( i in c(-1:3, 9) ) {
  print( switch(i, 1, 2, 3, 4) )
}
## NULL
## NULL
## [1] 1
## [1] 2
## [1] 3
## NULL

Here is a more useful example:

mySummary = function(x) {
  
  switch(class(x),
         factor=table(x),
         numeric=sprintf('mean=%4.2f, sd=%4.2f', mean(x), sd(x)),
          'Only defined for factor and numeric classes.')
  
}

for ( var in names(iris) ) {
  cat(var, ':\n', sep = '')
  print( mySummary(iris[, var]) )
}
## Sepal.Length:
## [1] "mean=5.84, sd=0.83"
## Sepal.Width:
## [1] "mean=3.06, sd=0.44"
## Petal.Length:
## [1] "mean=3.76, sd=1.77"
## Petal.Width:
## [1] "mean=1.20, sd=0.76"
## Species:
## x
##     setosa versicolor  virginica 
##         50         50         50
Exercises
  1. Read the R code below and determine the value of twos and threes at the end.
twos = 0
threes = 0
for ( i in 1:10 ) {
  if ( i %% 2 == 0 ) {
    twos = twos + i
  } else if ( i %% 3 == 0 ) {
    threes = threes + i 
  }
}
  1. Read the R code below and determine the value of x at the end.
x = 0
for ( i in 1:10 ) {
  x = x + switch(1 + {i %% 3}, 1, 5, 10)
}