Format

This workshop is structured as a number of short thematic lessons. Each lesson includes a brief introduction to a topic and is followed by some simple exercises. My goal is to spend 15-20 minutes on each lesson including time for you to try the exercise and for us to review if necessary.

Topics

R Basics

Lesson 1: Getting Started

Objectves
  • Understand:
    • How objects are created and used
  • Be able to:
    • View and clear the global environment
    • Use R for simple arithmetic calculations
Objects

Everything in R is an object that can be referred to by name. We create objects by assigning values to them:

# This is a comment ignored by R
Instructor <- 'James Henderson'
x <- 10
y <- 32
z <- c(x,y) #This how we form vectors

9 -> w # This works, but is bad style.
TheAnswer = 42

The values can be referred to by the object name:

TheAnswer
## [1] 42

Objects are stored by value and not reference:

z <- c(x,y)
c(x,y,z)
## [1] 10 32 10 32
y=TheAnswer
c(x,y,z)
## [1] 10 42 10 32
Arithmetic

R can do arithmetic with objects that contain numbers.

x + y
## [1] 52
z / x
## [1] 1.0 3.2
z^2
## [1]  100 1024
z + 2*c(y,x) - 10
## [1] 84 42

Be careful about mixing vectors of different lengths as R will sometimes recycle values:

x <- 4:6
y <- c(0,1)
x*y
## Warning in x * y: longer object length is not a multiple of shorter object
## length
## [1] 0 5 0
x <- 1:4
y*x
## [1] 0 2 0 4

There are a number of common mathematical functions already in R:

mean(x) # average
## [1] 2.5
sum(x)  # summation
## [1] 10
sd(x)   # Standard deviation
## [1] 1.290994
var(x)  # Variance
## [1] 1.666667
exp(x)  # Exponential
## [1]  2.718282  7.389056 20.085537 54.598150
sqrt(x) # Square root
## [1] 1.000000 1.414214 1.732051 2.000000
log(x)  # Natural log
## [1] 0.0000000 0.6931472 1.0986123 1.3862944
Global Environment

The values are stored in a workspace called the global environment. You can view objects in the global environment using the function ‘ls()’ and remove objects using ‘rm()’:

ls()
## [1] "Instructor" "TheAnswer"  "w"          "x"          "y"         
## [6] "z"
rm(w)
ls()
## [1] "Instructor" "TheAnswer"  "x"          "y"          "z"

We can remove multiple objects in a few ways:

remove(Instructor,TheAnswer) # remove and rm are synonyms
ls()
## [1] "x" "y" "z"
rm(list=c('x','y')) # Object names are passed to list as strings
ls()
## [1] "z"

To clear the entire workspace use ‘rm(list=ls())’:

ls()
## [1] "z"
rm(list=ls())
ls()
## character(0)
More on objects

Functions are also objects:

ViewGlobalEnv <- ls
ViewGlobalEnv()
## [1] "ViewGlobalEnv"

Elements of vectors can be given names:

z = c('x'=10,'y'=42)
names(z)
## [1] "x" "y"
names(z) <- c('Xval','Yval'); names(z)
## [1] "Xval" "Yval"
unname(z)
## [1] 10 42
Exercises
  1. Determine whether object names are case sensitive in R.

  2. Ask four people on which day of the month they were born:
  • Store each value as an object with the person’s name
  • Concatenate these objects into a vector ‘birthdays’.
  • Remove the original objects from the global environment.
  • Type “quit(‘yes’)” at the console and then re-open R Studio. What do you notice?
  • Remove ‘birthdays’ from the global environment, type “quit(‘no’)”, and re-open R Studio. What do you notice?

Lesson 2: Scripts

R is primarily a scripting language and should rarely be used directly from the console. R can and often should be used interactively, but nearly everything you type should be in a script.

Objectives
  • Understand:
    • The role, importance and utility of scripts
    • Standard elements of good coding style
  • Be able to:
    • View and change the working directory
    • Create, save, and edit scripts
    • Use scripts interactively
    • Set and find shortcut keys
    • Run an entire script
R Scripts: what and why

Scripts are simply text files containing R commands. I say nearly everything you type should be in a script because:

  • scripts provide a sharable record of the data manipulation and analysis steps needed to reproduce an analysis
  • scripts allow you to incrementally build and modify analyses leading to better results
  • used right, scripts will reduce tedious re-typing or looking through menus.
Best Practices

Here are some best practices for working with scripts:

  • Use comments! Comments explain what a particular chunk of code does helping you and others to understand your script.
  • Start each script with a standard header that includes a brief description of its purpose.
  • Give scripts a descriptive name.
  • Use multiple scripts for key pieces of complex multipart analyses. Use folders or R projects to keep related scripts together.
  • Use descriptive variable names rather than x, y, z. Additional typing now will save headaches later and is minimized by text completion in R Studio.
  • Scripts should be self-contained and always start with a clean workspace.
An example script
cat(readChar('./ExampleScript1.R',nchars=file.size('./ExampleScript1.R')),'\n')
## ## An example script for the Intro to R Workshop ##
## ## Author: James Henderson (jbheder@umich.edu)
## ## Created: June 9, 2017
## ## Modified: June 12, Add a final print command.
## 
## ## prepare your workspace ##
## rm(list=ls())
## 
## ## load any packages you need
## # library(dplyr) 
## 
## ## create some objects ##
## message <- 'Hello World!'
## 
## ## do something ##
## print(message)
## 
## ## save some results
Working Directory

When you ask R to read or write a file without a specified path it defaults to looking in the current working directory. Use ‘getwd()’ and ‘setwd()’ to view and change the current working directory:

getwd()
## [1] "/Users/jbhender/Workshops/Intro_to_R"
startDirectory = getwd()
setwd('/Users/jbhender/Workshops/')
getwd()
## [1] "/Users/jbhender/Workshops"
setwd(startDirectory)
getwd()
## [1] "/Users/jbhender/Workshops/Intro_to_R"

To list the contents in a directory use ‘dir()’:

dir()
## [1] "attitude.csv"            "ExampleScript1.R"       
## [3] "Intro_2_R.html"          "Intro_2_R.Rmd"          
## [5] "message.RData"           "mtcars_displacement.pdf"
dir('./')
## [1] "attitude.csv"            "ExampleScript1.R"       
## [3] "Intro_2_R.html"          "Intro_2_R.Rmd"          
## [5] "message.RData"           "mtcars_displacement.pdf"

When working with scripts it best to make the working directory the highest level folder for a project and use relative paths to point to subfolders.

dir('./')
## [1] "attitude.csv"            "ExampleScript1.R"       
## [3] "Intro_2_R.html"          "Intro_2_R.Rmd"          
## [5] "message.RData"           "mtcars_displacement.pdf"
Running a script

When building an analysis you will often work with scripts interactively, calling each line in turn. At times you will want to run an entire script, which can be done using the ‘source()’ command:

source('./ExampleScript1.R')
## [1] "Hello World!"
Exercises
  1. Change the working directory in R Studio to a folder ‘RWorkshop’ on your desktop.
  2. Create a new script and save it as “Day1_RBasics.R” in the RWorkshop folder.
  3. Add a header with your name, the date, and a short description of the script.
  4. Type ‘print(“Hello world!”)’ in your script and then use ‘source’ to call it from the console.

Lesson 3: Read and Write Data

Objectives:

  • Understand:
    • File types used for data storage
  • Be able to:
    • Read data from and write data to common flat formats: tsv, csv
    • Save and load RData objects
Getting data into R

Data can be read into R from common flat file formats such as comma or tab separated text files. The best starting place is ‘read.table()’ or ‘read.csv()’

attitude_data <- read.csv('./attitude.csv',sep=',',
                            stringsAsFactors = FALSE)

To write to csv use ‘write.csv()’.

write.csv(attitude,file='./attitude.csv',
          row.names=FALSE)
Saving and loading R objects

To save data or other objects in the native .RData format using ‘save()’.

message='Hello world!'
save(message,attitude_data,file='./message.RData')

To read such data into R use ‘load()’.

rm(list=ls()) ## clearing workspace
foo <- load('./message.RData')
foo
## [1] "message"       "attitude_data"
ls()
## [1] "attitude_data" "foo"           "message"
message
## [1] "Hello world!"
Reading data form other formats

When possible, it is best to transfer data from other programs into R using the software associated with its native format to first export to a flat file.

Exercises:
  1. Download the cars data and move it to your workshop folder.
  2. Read it into R using ‘read.csv’.
  3. Write a copy of it using write.csv using the file name ‘mtcars_copy.csv’. Open and inspect your copy in a spreadsheet program.

  4. Save cars as an ‘.RData’ file.
  5. Clear your workspace and reload cars from the ‘.RData’ file.

Lesson 4: Classes

Objectives
  • Understand:
    • Commonly used classes in R
    • How classes impact the way R treats object
  • Be able to:
    • Create new objects of various classes
    • Determine the class of an object
    • Convert between common classes
    • Access specific elements within objects
Exercises
  1. Use your favorite search engine to find todays high and low temperatures for three cities of your choice.
  2. Create vectors for: the city names, the low temperatures, the high temperatures.
  3. Use the city names as names for the low and high vectors.
  4. Create a data frame for this information.
  5. Subset the data frame to return cities with more than a 15 degree difference between the hight and low temperature.
  6. Store the low and high temperatures in a matrix. Set the row and column names to be decriptive.
  7. Subset the matrix as before.
R is Classy

Named objects in R are associated with one or more classes that tell us how to understand the information they contain. To see the class(es) associated with an object use ‘class()’. Below are some common single-value classes (aka types):

str <- 'This is a string'
class(str)
## [1] "character"
number <- 4.5
class(number)
## [1] "numeric"
int <- 42
class(int)
## [1] "numeric"
int <- as.integer(42)
class(int)
## [1] "integer"

When we don’t specify the class of an object, R is programmed to supply a default type. There are special functions for declaring and converting between classes:

str <- '42'
str
## [1] "42"
class(str)
## [1] "character"
num <- as.numeric(str)
str
## [1] "42"
class(str)
## [1] "character"
int <- as.numeric(num)
class(int)
## [1] "numeric"
is.integer(num)
## [1] FALSE
is.integer(int)
## [1] FALSE
class(is.integer(int))
## [1] "logical"
Multiples values of a single type

Multiple values of a single type can be stored in vectors, matrices, or arrays.

Vectors

Vectors are one dimensional and have a specific ‘length’:

PetNames <- c('Nahla','Oliver')
length(PetNames)
## [1] 2
PetNames <- c(PetNames,'Trixie')
length(PetNames)
## [1] 3

If you try to combine multiple types, R will attempt to convert to a single type:

BirthDays <- c(10,27,29)
c(PetNames,BirthDays)
## [1] "Nahla"  "Oliver" "Trixie" "10"     "27"     "29"

You can a names attribute to vectors:

names(BirthDays) <- PetNames
names(BirthDays)
## [1] "Nahla"  "Oliver" "Trixie"
BirthDays <- c(Nahla=10,Oliver=27,Trixie=29)

Use ‘[]’ to access specific elements by name or position,

BirthDays[3]
## Trixie 
##     29
BirthDays[c(1,2)]
##  Nahla Oliver 
##     10     27
BirthDays[-1] ## Negative indexing 
## Oliver Trixie 
##     27     29
BirthDays['Oliver']
## Oliver 
##     27
Matrices

Matrices are two-dimensional vectors organized into rows and columns. They always contain values of a single type.

Matrices are stored using ‘column-major ordering’ meaning that by default they are filled and operated on by column.

X <- matrix(1:10,nrow=5,ncol=2)
Y <- matrix(1:10,nrow=5,ncol=2,byrow = TRUE)
X
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
Y
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6
## [4,]    7    8
## [5,]    9   10
class(X)
## [1] "matrix"

R can do matrix multiplication and many other linear algebra computations.

X %*% t(Y)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   13   27   41   55   69
## [2,]   16   34   52   70   88
## [3,]   19   41   63   85  107
## [4,]   22   48   74  100  126
## [5,]   25   55   85  115  145
3*X
##      [,1] [,2]
## [1,]    3   18
## [2,]    6   21
## [3,]    9   24
## [4,]   12   27
## [5,]   15   30
c(1,2)*Y
##      [,1] [,2]
## [1,]    1    4
## [2,]    6    4
## [3,]    5   12
## [4,]   14    8
## [5,]    9   20

Matrices have both dimension and length.

dim(X)
## [1] 5 2
length(X)
## [1] 10
as.vector(X)
##  [1]  1  2  3  4  5  6  7  8  9 10
c(nrow(X),ncol(X))
## [1] 5 2
colnames(X) <- paste('Col',1:2,sep='')
rownames(X) <- letters[1:5]
X["a",]
## Col1 Col2 
##    1    6
X[1:3,'Col2']
## a b c 
## 6 7 8
Arrays

See ‘help(arrays)’.

Multiple types
Lists

In R a list is a generic container for storing values of multiple types.

myList <- list(Name='An example list',
               Matrix=diag(5),
               n=5
               )
myList
## $Name
## [1] "An example list"
## 
## $Matrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    1    0    0    0
## [3,]    0    0    1    0    0
## [4,]    0    0    0    1    0
## [5,]    0    0    0    0    1
## 
## $n
## [1] 5
class(myList)
## [1] "list"
length(myList)
## [1] 3
names(myList)
## [1] "Name"   "Matrix" "n"

You can access a specific element in a list by position or name:

myList[['Name']]
## [1] "An example list"
myList$Matrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    1    0    0    0
## [3,]    0    0    1    0    0
## [4,]    0    0    0    1    0
## [5,]    0    0    0    0    1

Note the use of double brackets (’[[‘n’]]) and compare to the single bracket case below.

class(myList['n'])
## [1] "list"
class(myList[['n']])
## [1] "numeric"
Data Frames

Data frame are perhaps the most common way to represent a data set in R. A data frame is like a matrix with observations or units in rows and variables in columns. It doesn’t require the columns to all be of the same type.

df <- data.frame(ID=1:10,
                 Group=
                   sample(0:1,10,replace=TRUE),
                 Var1=rnorm(10),
                 Var2=seq(0,1,length.out=10),
                 Var3=factor(
                   rep(c('a','b'),each=5)
                   )
                )
names(df)
## [1] "ID"    "Group" "Var1"  "Var2"  "Var3"
dim(df)
## [1] 10  5
length(df)
## [1] 5
nrow(df)
## [1] 10

We can access the values of a data frame both like a list:

df$ID
##  [1]  1  2  3  4  5  6  7  8  9 10
df[['Var3']]
##  [1] a a a a a b b b b b
## Levels: a b

or like a matrix

df[1:5,]
##   ID Group       Var1      Var2 Var3
## 1  1     1 -1.8598392 0.0000000    a
## 2  2     1 -1.2936927 0.1111111    a
## 3  3     0 -0.4659862 0.2222222    a
## 4  4     1 -1.6222289 0.3333333    a
## 5  5     1  0.2157492 0.4444444    a
df[,'Var2']
##  [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
##  [8] 0.7777778 0.8888889 1.0000000
Logicals & Indexing
Logicals

R has three reserved words of class ‘logical’:

class(TRUE)
## [1] "logical"
class(FALSE)
## [1] "logical"
class(NA)
## [1] "logical"
if(TRUE & T){
  print('Synonyms')
}
## [1] "Synonyms"
if(FALSE | F){
  print('Synonyms')
}

While ‘T’ and ‘F’ are equivalent to ‘TRUE’ and ‘FALSE’ it is best to always use the full words. You should also avoid using ‘T’ or ‘F’ as objects or arguments in functions.

Boolean comparisons

Logicals are created by Boolean comparisons:

{2*3} == 6     # test equality with ==
## [1] TRUE
{2+2} != 5     # use != for 'not equal'
## [1] TRUE
sqrt(69) > 8   # comparison operators: >, >=, <, <=
## [1] TRUE
sqrt(64) >= 8  
## [1] TRUE
!{2==3}        # Use not to negate or 'flip' a logical
## [1] TRUE

Comparison operators are vectorized:

1:10 > 5
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

You can can combine operators using ‘and (&)’ or ‘or (|)’:

{2+2}==4 | {2+2}==5 # An or statement asks if either statement is true
## [1] TRUE
{2+2}==4 & {2+2}==5 # And requires both to be true
## [1] FALSE
if statements
if(TRUE){
  print('do something if true')
}
## [1] "do something if true"
if({2+2}==5){
  print('the statement is true')
} else{
  print('the statement is false')
}
## [1] "the statement is false"
result <- c(4,5)
report = ifelse({2+2}==result,'true','false')
report
## [1] "true"  "false"
Using which

The ‘which()’ function returns the elements of a logical vector that return true:

which({1:5}^2 > 10)
## [1] 4 5

A combination of which and logicals can be used to subset data frames:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
mtcars[which(mtcars$mpg>30),]
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
## Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

You can use ‘with()’ to refer to variables/columns by name:

ind <-with(mtcars, which(mpg > 20 & cyl >=6))
ind
## [1] 1 2 4
mtcars[ind,c('mpg','cyl')]
##                 mpg cyl
## Mazda RX4      21.0   6
## Mazda RX4 Wag  21.0   6
## Hornet 4 Drive 21.4   6
mtcars[which(mtcars[,'am']!=0),]
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

The ‘with()’ construction will not work with matrices.

rm(ind) # removing ind 
carsMat <- as.matrix(mtcars)
ind <-with(carsMat, which(mpg > 20 & cyl >=6))
## Error in eval(substitute(expr), data, enclos = parent.frame()): numeric 'envir' arg not of length one
carsMat[ind,c('mpg','cyl')]
## Error in eval(expr, envir, enclos): object 'ind' not found

Instead use explicit indexing by name or position.

X <- matrix(rnorm(100),25,4)
ind <- which({X[,1]>0 | X[,2]>0} & {X[,3]<0 | X[,4]<0})
1*{X[ind,] > 0} # convert logicals to numeric
##       [,1] [,2] [,3] [,4]
##  [1,]    0    1    0    1
##  [2,]    1    0    0    0
##  [3,]    1    0    0    0
##  [4,]    1    1    1    0
##  [5,]    1    0    0    0
##  [6,]    1    1    0    0
##  [7,]    1    1    0    0
##  [8,]    0    1    0    1
##  [9,]    1    1    1    0
## [10,]    0    1    0    0
## [11,]    1    1    0    1
## [12,]    1    0    1    0
## [13,]    1    0    1    0
## [14,]    0    1    1    0
## [15,]    1    1    1    0
## [16,]    0    1    0    0
Other classes

There are many other classes of objects in R and many packages define special classes. Here are few other common classes:

class(mean)
## [1] "function"
class(.GlobalEnv)
## [1] "environment"
class(Y~X1+X2)
## [1] "formula"

Lesson 5: Functions

Objectves
  • Understand:
    • How R knows to interpret something as a function
    • How arguments passed to functions are interpreted
  • Be able to:
    • Use ‘help()’ to access R documentation for a function
    • Write and call a function.
Functions

As we saw earlier, R identifies functions by the ‘func()’ construction. Functions are simply collections of commands that do something. Functions take arguments which can be used to specify which objects to operate on and what values of parameters are used. You can use ‘help(func)’ to see what a function is used for and what arguments it expects, i.e.

help(round)

Functions will often have multiple arguments. Some arguments have default values, others do not. All arguments without default values must be passed to a function. Arguments can be passed by name or position. For instance,

x <- runif(n=5,min=0,max=1)
y <- runif(5,0,1)
z <- runif(5)
round(cbind(x,y,z),1)
##        x   y   z
## [1,] 0.7 0.2 1.0
## [2,] 0.2 0.6 0.1
## [3,] 0.2 0.8 0.2
## [4,] 0.2 0.8 0.5
## [5,] 0.0 0.7 0.2

both generate 5 numbers from a Uniform(0,1) distribution.

Arguments passed by name need not be in order:

w <- runif(min=0,max=1,n=5)
u <- runif(min=0,max=1,5) # This also works but is bad style. 
round(rbind(u=u,w=w),1)
##   [,1] [,2] [,3] [,4] [,5]
## u  0.9  0.3  0.2  0.2  0.8
## w  0.2  0.9  0.3  0.6  0.2
Writing Functions

You can create your own functions in R. Use functions for tasks that you repeat often in order to make your scripts more easily readable and modifiable.

# function to compute z-scores
zScore1 <- function(x){
  xbar <- mean(x)
  s <- sd(x)
  z <- (x-mean(x))/s
  return(z)  
}

The return statement is not strictly necessary, but can make complex functions more readable. It is good practice to avoid creating intermediate objects to store values only used once.

# function to compute z-scores
zScore2 <- function(x){
  {x-mean(x)}/sd(x)
}
x <- rnorm(10,3,1) ## generate some normally distributed values
round(cbind(x,'Z1'=zScore1(x),'Z2'=zScore2(x)),1)
##         x   Z1   Z2
##  [1,] 3.6  1.2  1.2
##  [2,] 3.5  1.1  1.1
##  [3,] 2.3 -1.0 -1.0
##  [4,] 3.5  1.2  1.2
##  [5,] 3.1  0.5  0.5
##  [6,] 1.9 -1.6 -1.6
##  [7,] 2.3 -0.8 -0.8
##  [8,] 2.5 -0.6 -0.6
##  [9,] 2.9  0.1  0.1
## [10,] 2.8  0.0  0.0

We can set default values for parameters using the construction ‘parameter = xx’ in the function definition.

# function to compute z-scores
zScore3 <- function(x,na.rm=T){
  {x-mean(x,na.rm=na.rm)}/sd(x,na.rm=na.rm)
}
x <- c(NA,x,NA)
round(cbind(x,'Z1'=zScore1(x),'Z2'=zScore2(x),'Z3'=zScore3(x)),1)
##         x Z1 Z2   Z3
##  [1,]  NA NA NA   NA
##  [2,] 3.6 NA NA  1.2
##  [3,] 3.5 NA NA  1.1
##  [4,] 2.3 NA NA -1.0
##  [5,] 3.5 NA NA  1.2
##  [6,] 3.1 NA NA  0.5
##  [7,] 1.9 NA NA -1.6
##  [8,] 2.3 NA NA -0.8
##  [9,] 2.5 NA NA -0.6
## [10,] 2.9 NA NA  0.1
## [11,] 2.8 NA NA  0.0
## [12,]  NA NA NA   NA
Exercises
  1. Access and skim the help pages for ‘median()’, ‘mad()’, and ‘IQR’.
  2. Write a function ‘zScoreRobust’ that accepts a numeric vector and returns robust z-scores.
  3. Make the function you wrote robust to vectors containing “NA” values
  4. Generate some data from N(4,2) to test your functions.

Packages

Objectives:

  • Understand:
    • Basics of the R package system
    • What it means for a function to be ‘masked’
  • Be able to:
    • Install packages
    • Make a package available to R
    • Call functions from packages without loading
    • Remove packages

The R package system

Much of the utility of R is derived from an extensive collection of user and domain-expert contributed packages. Packages are simply a standardized way for people to share documented code and data. There are thousands of packages!

Packages are primarily distributed through three sources: + CRAN + Bioconductor + Github

Installing packages

The primary way to install a package is using ‘install.packages(“pkg”)’.

#install.packages('lme4') # the package name should be passed as a character string

You can find the default location for your R packages using the “.libPaths()” function. If you don’t have write permission to this folder, you can set this directory to a personal library instead.

.libPaths() ## The default library location
## [1] "/Library/Frameworks/R.framework/Versions/3.3/Resources/library"
.libPaths('/Users/jbhender/Rlib') #Create the directory first!
.libPaths()
## [1] "/Users/jbhender/Rlib"                                          
## [2] "/Library/Frameworks/R.framework/Versions/3.3/Resources/library"

To install a package to a personal library use the ‘lib’ option.

## install.packages("haven",lib='/Users/jbhender/Rlib')

If your computer has the necessary tools, packages can also be installed from source by downloading the package file and pointing directly to the source tar ball (‘.tgz’) or Windows binary.

Using packages in R

Installing a package does not make it available to R! There are two ways to use things from a package: + calling ‘library(“pkg”)’ to add it to the search path + using the “pkg::function” construction.

These methods are illustrated below using the data set ‘InstEval’ distributed with the ‘lme4’ package.

#head(InstEval)
## Using the pkg::function construction
head(lme4::InstEval)
##   s    d studage lectage service dept y
## 1 1 1002       2       2       0    2 5
## 2 1 1050       2       1       1    6 2
## 3 1 1582       2       2       0    2 5
## 4 1 2050       2       2       1    3 3
## 5 2  115       2       1       0    5 2
## 6 2  756       2       1       0    5 4

The ‘library(“pkg”)’ command adds a package to the search path.

search()
## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"
library(lme4)
## Loading required package: Matrix
search()
##  [1] ".GlobalEnv"        "package:lme4"      "package:Matrix"   
##  [4] "package:stats"     "package:graphics"  "package:grDevices"
##  [7] "package:utils"     "package:datasets"  "package:methods"  
## [10] "Autoloads"         "package:base"
head(InstEval)
##   s    d studage lectage service dept y
## 1 1 1002       2       2       0    2 5
## 2 1 1050       2       1       1    6 2
## 3 1 1582       2       2       0    2 5
## 4 1 2050       2       2       1    3 3
## 5 2  115       2       1       0    5 2
## 6 2  756       2       1       0    5 4

To remove a library from the search path use ‘detach(“package:pkg”,unload=TRUE)’.

detach(package:lme4,unload=TRUE)
search()
##  [1] ".GlobalEnv"        "package:Matrix"    "package:stats"    
##  [4] "package:graphics"  "package:grDevices" "package:utils"    
##  [7] "package:datasets"  "package:methods"   "Autoloads"        
## [10] "package:base"
Vignettes

As part of their documentation, many packages come with a “vignette” which servers as a short tour of a packages purpose and functionality.

Exercises
  1. Detach the ‘datasets’ package using ‘search()’ to check your success.
  2. Reload ‘datasets’ using ‘library()’ and again check with ‘search()’.
  3. Install the following packages:
    • haven (For reading data from other sources)
    • lme4 (For linear mixed models)s
    • ggplot2 (The “grammar of graphics”)
    • dplyr (For data manipulation)
    • tidyr (Utility functions for working with R objects)

Graphics

Objectives:

  • Understand:
    • The role of plotting ‘devices’ and how R handles images
  • Be able to:
    • Create standard statistical graphics using base R
    • Save graphical output to a file

R has standard functions for computing many statistical graphics.

Scatterplots
plot(mtcars$hp~mtcars$disp)

There are many aesthetic options you can control; see ‘par()’ for a full list.

with(mtcars,
     plot(hp~disp,pch=15,main='Horsepower in mtcars',xlab='displacement',ylab='horsepower',las=1,col='grey')
     )

Use vector to set values for specific points.

col <- rep('blue',nrow(mtcars))
col[which(mtcars$cyl==6)] <- 'grey'
col[which(mtcars$cyl==8)] <- 'red'

pch <- rep(16,nrow(mtcars))
pch[which(mtcars$am==1)] <- 17

with(mtcars,
     plot(hp~disp,pch=pch,col=col,main='Horsepower in mtcars',xlab='displacement',ylab='horsepower',las=1)
     )
legend('topleft',legend=c('Automatic','Manual'),pch=16:17,col='black',bty='n')
legend('bottomright',legend=paste(c(4,6,8),'cylinders'),col=c('blue','grey','red'),pch=15)

Other standard plots
hist(mtcars[,'hp'],col='lightblue',las=1,xlab='horsepower',main='Histogram of horsepower')

boxplot(mtcars[,'hp']~mtcars[,'cyl'],las=1,xlab='# of cylinders',ylab='horsepower',col=rgb(0,0,1,.5))

qqnorm(mtcars[,'hp'])
qqline(mtcars[,'hp'])

Writing plots to file

By default, plot commands are sent to the default Rstudio graphics window. However, you can print graphics directly to file using: pdf(), jpeg(), png(), etc.

pdf('./mtcars_displacement.pdf') #opens the file
  hist(mtcars$disp)
dev.off() ## closes the file
## quartz_off_screen 
##                 2
dir('./')
## [1] "attitude.csv"            "ExampleScript1.R"       
## [3] "Intro_2_R_files"         "Intro_2_R.html"         
## [5] "Intro_2_R.Rmd"           "message.RData"          
## [7] "mtcars_displacement.pdf"

I recommend using pdf as default as it is a vector based format.

Exercises:
  1. Using the mtcars data:
  • create a histogram of the ‘wt’ variable
  • create a qqplot of the ‘mpg’ variable
  • create side-by-side boxplots of mpg grouped by ‘cyl’
  1. Still using mtcars, create a scatter plot of mpg vs wt.
  • Use color and plotting symbol to also include information about cyl and gear in the plot
  • Create descriptive names for the axes and title
  • Add a legend explaining your plot symbols
  1. Write your scatter plot to pdf.

Additonal Topics

Control Statments

for loops

Here is the syntax for a basic for loop in R

for(i in 1:10){
   cat(i,'\n')
}
## 1 
## 2 
## 3 
## 4 
## 5 
## 6 
## 7 
## 8 
## 9 
## 10
for(var in names(mtcars)){
  cat(sprintf('average %s = %4.3f',var,mean(mtcars[,var])),'\n')
}
## average mpg = 20.091 
## average cyl = 6.188 
## average disp = 230.722 
## average hp = 146.688 
## average drat = 3.597 
## average wt = 3.217 
## average qsec = 17.849 
## average vs = 0.438 
## average am = 0.406 
## average gear = 3.688 
## average carb = 2.812
while

A while statement can be useful when you aren’t sure how many iterations are needed. Here is an that takes a random walk and terminates if the value is more than 10 units from 0.

maxIter <- 1e3 # always limit the total iterations allowed
val=vector(mode='numeric',length=maxIter)
val[1]=rnorm(1) ## intialize
k=1
while(abs(val[k]) < 10 & k <= maxIter){
  val[k+1] = val[k] + rnorm(1)
  k = k + 1
}
val = val[1:{k-1}]
plot(val)

switch

Use a switch when you have two or more discrete options.

mySummary <- function(x){
  switch(class(x),
         factor=table(x),
         numeric=sprintf('mean=%4.2f,sd=%4.2f',mean(x),sd(x)),
          'Only defined for factor and numeric classes.')
}
for(var in names(iris)){
  cat(var,':\n',sep='')
  print(mySummary(iris[,var]))
}
## Sepal.Length:
## [1] "mean=5.84,sd=0.83"
## Sepal.Width:
## [1] "mean=3.06,sd=0.44"
## Petal.Length:
## [1] "mean=3.76,sd=1.77"
## Petal.Width:
## [1] "mean=1.20,sd=0.76"
## Species:
## x
##     setosa versicolor  virginica 
##         50         50         50
Exercises

The Fibonacci sequence starts 1, 1, 2, … and continues with each new value formed by adding the two previous values.

  1. Write a function ‘Fib1’ which takes an argument ‘n’ and returns the \(n^{th}\) value of the Fibonacci sequence. Use a for loop in the function.

  2. Write a function ‘Fib2’ which does the same thing using a while loop.

  3. Use a switch to write a function that has a parameter ‘loop=c(’for’,‘while’)‘for calling either ’Fib1’ or ‘Fib2’.

Apply

Loops in R can be quite slow compared to other programming language on account of the overhead of many of the conveniences that make it useful for routine data analysis. Often, explicit loops can be avoided by using an ‘apply’ function.

Here is an example:

X = matrix(rep(1:5,each=5),5,5)
apply(X,1,sum)
## [1] 15 15 15 15 15
apply(X,2,sum)
## [1]  5 10 15 20 25

For lists use ‘lapply()’ or ‘sapply()’.

myList=list(x=1:5,y=-5:-1)
lapply(myList,sum)
## $x
## [1] 15
## 
## $y
## [1] -15
sapply(myList,sum)
##   x   y 
##  15 -15

The values in a data.frame are represented internally as a list, so use lapply with data frames.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
sapply(iris,class)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##    "numeric"    "numeric"    "numeric"    "numeric"     "factor"
apply(iris,2,class)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##  "character"  "character"  "character"  "character"  "character"

A very powerful construction for data manipulation is the use of apply with an implicit function.

sapply(mtcars,function(x){
  nVals = length(unique(x))
  return(nVals)
})
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##   25    3   27   22   22   29   30    2    2    3    6
Exercises
  1. Use apply to get the class of each variable in the ‘mtcars’ data set.
  2. Use apply to find the row means and column means of the ‘attitude’ data.