Course Homepage

About R

As decribed by The R Foundation, “R is a language and environment for statistical computing and graphics.” Importantly, R is open-source, free software distributed under a GNU GPL-3 license.

It is also easily extensible through contributed packages that cover much of modern statistics and data science.

RStudio

RStudio is an “integrated development environment” for working with R that simplifies many tasks and makes for a friendlier introduction to R. It provides a nice interface via Rmarkdown for integrating R code with text for creating documents. RStudio is distributed by a company of the same name that also offers a number of related products for working with data: Shiny for interactive graphics along with enterprise and server editions. I suggest you use RStudio when feasible to streamline your workflow.

Computing experience survey.

Based on the computing experience survey, most of you have used R before:

As such, we will skip some of the basics and move quickly through other introductory material.

Topics

R Basics

Getting Started

Objectves
  • Understand:
    • How objects are created and used.
  • Be able to:
    • View and clear the global environment.
    • Use R for simple arithmetic calculations.
Objects

Everything in R is an object that can be referred to by name. We create objects by assigning values to them:

# This is a comment ignored by R
Instructor <- 'Dr. Henderson'
x <- 10
y <- 32
z <- c(x,y) #Form vectors by combining or concatenting elements.

9 -> w # This works, but is bad style.
TheAnswer = 42 # Most other languages use = for assignemnt.

The values can be referred to by the object name:

TheAnswer
## [1] 42

Objects can be any syntacticaly valid name. You should, however, avoid clobbering built in R names such as pi, mean or sum. You also should not use reserved words when naming objects.

Finally, it is important to remember that in R objects are stored by value and not by reference:

z <- c(x,y)
c(x,y,z)
## [1] 10 32 10 32
y=TheAnswer
c(x,y,z)
## [1] 10 42 10 32

In contrast, if z <- c(x,y) were a reference to the contents of x and y then changing y would change z the value refered to by z as well.

Arithmetic

R can do arithmetic with objects that contain numeric types.

x + y
## [1] 52
z / x
## [1] 1.0 3.2
z^2
## [1]  100 1024
z + 2*c(y,x) - 10  
## [1] 84 42
11 %% 2   # Modular arithmetic
## [1] 1
11 %/% 2 # Integer division returns remainder
## [1] 5

Be careful about mixing vectors of different lengths as R will generally recycle values:

x <- 4:6
y <- c(0,1)
x*y
## Warning in x * y: longer object length is not a multiple of shorter object
## length
## [1] 0 5 0
x <- 1:4
y*x
## [1] 0 2 0 4

There are a number of common mathematical functions already in R:

mean(x) # average
## [1] 2.5
sum(x)  # summation
## [1] 10
sd(x)   # Standard deviation
## [1] 1.290994
var(x)  # Variance
## [1] 1.666667
exp(x)  # Exponential
## [1]  2.718282  7.389056 20.085537 54.598150
sqrt(x) # Square root
## [1] 1.000000 1.414214 1.732051 2.000000
log(x)  # Natural log
## [1] 0.0000000 0.6931472 1.0986123 1.3862944
sin(x)  # Trigonometric functions
## [1]  0.8414710  0.9092974  0.1411200 -0.7568025
cos(pi/2) # R even contains pi, but only does finite arithmetic
## [1] 6.123234e-17
Global Environment

The values are stored in a workspace called the global environment. You can view objects in the global environment using the function ‘ls()’ and remove objects using ‘rm()’:

ls()
## [1] "Instructor" "TheAnswer"  "w"          "x"          "y"         
## [6] "z"
rm(w)
ls()
## [1] "Instructor" "TheAnswer"  "x"          "y"          "z"

We can remove multiple objects in a few ways:

remove(Instructor,TheAnswer) # remove and rm are synonyms
ls()
## [1] "x" "y" "z"
rm(list=c('x','y')) # Object names are passed to list as strings
ls()
## [1] "z"

To clear the entire workspace use ‘rm(list=ls())’:

ls()
## [1] "z"
rm(list=ls())
ls()
## character(0)
More on objects

Functions are also objects:

ViewGlobalEnv <- ls
ViewGlobalEnv()
## [1] "ViewGlobalEnv"

Elements of vectors can be given names:

z = c('x'=10,'y'=42)
names(z)
## [1] "x" "y"
names(z) <- c('Xval','Yval'); names(z)
## [1] "Xval" "Yval"
unname(z)
## [1] 10 42

Use quit() to quit R. Use help() or ?function to get help.

Logicals & Indexing

Logicals

R has three reserved words of type ‘logical’:

typeof(TRUE)
## [1] "logical"
typeof(FALSE)
## [1] "logical"
typeof(NA)
## [1] "logical"
if(TRUE && T){
  print('Synonyms')
}
## [1] "Synonyms"
if(FALSE || F){
  print('Synonyms')
}

While ‘T’ and ‘F’ are equivalent to ‘TRUE’ and ‘FALSE’ it is best to always use the full words. You should also avoid using ‘T’ or ‘F’ as names for objects or function arguments.

Boolean comparisons

Boolean operators are useful for generating values conditionally on other values. Here are the basics:

Operator Meaning
== equal
!= not equal
>, >= greater than (or equal to)
<, <= less than (or equal to)
&& scalar AND
|| scalar OR
& vectorized AND
| vectorized OR
! negation (!TRUE == FALSE and !FALSE == TRUE)
any() are ANY of the elements true
all() are ALL of the elements true

Logicals are created by Boolean comparisons:

{2*3} == 6     # test equality with ==
## [1] TRUE
{2+2} != 5     # use != for 'not equal'
## [1] TRUE
sqrt(69) > 8   # comparison operators: >, >=, <, <=
## [1] TRUE
sqrt(64) >= 8  
## [1] TRUE
!{2==3}        # Use not to negate or 'flip' a logical
## [1] TRUE

Comparison operators are vectorized:

1:10 > 5
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

You can can combine operators using ‘and (&&)’ or ‘or (||)’:

{2+2}==4 | {2+2}==5 # An or statement asks if either statement is true
## [1] TRUE
{2+2}==4 & {2+2}==5 # And requires both to be true
## [1] FALSE

Note the difference between the single and double versions:

even <- {1:10 %% 2} == 0
div4 <- {1:10 %% 4} == 0

even | div4
##  [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
even || div4
## [1] FALSE
even & div4
##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
even && div4
## [1] FALSE

Use any or all to efficiently check for the presence of any TRUE or FALSE.

any(even)
## [1] TRUE
all(even)
## [1] FALSE
Using which

The ‘which()’ function returns the elements of a logical vector that return true:

which({1:5}^2 > 10)
## [1] 4 5

A combination of which and logicals can be used to subset data frames:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
mtcars[which(mtcars$mpg>30),]
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
## Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

Functions

Objectves
  • Understand:
    • How R knows to interpret something as a function.
    • How arguments passed to functions are interpreted.
  • Be able to:
    • Write and call a function.
Functions

As we saw in “Getting Started”, R identifies functions by the ‘func()’ construction. Functions are simply collections of commands that do something. Functions take arguments which can be used to specify which objects to operate on and what values of parameters are used. You can use ‘help(func)’ to see what a function is used for and what arguments it expects, i.e. help(sprintf).

Arguments

Functions will often have multiple arguments. Some arguments have default values, others do not. All arguments without default values must be passed to a function. Arguments can be passed by name or position. For instance,

x <- runif(n=5, min=0, max=1)
y <- runif(5, 0, 1)
z <- runif(5)
round(cbind(x, y, z), 1)
##        x   y   z
## [1,] 0.7 0.2 0.6
## [2,] 0.9 0.9 0.8
## [3,] 0.5 0.2 0.3
## [4,] 0.4 0.9 0.3
## [5,] 0.3 0.5 0.1

both generate 5 numbers from a Uniform(0,1) distribution.

Arguments passed by name need not be in order:

w <- runif(min=0, max=1, n=5)
u <- runif(min=0, max=1, 5) # This also works but is bad style. 
round(rbind(u=u, w=w), 1)
##   [,1] [,2] [,3] [,4] [,5]
## u  0.8  0.2  0.7  0.1  0.3
## w  0.1  0.8  0.8  0.6  0.9
Writing Functions

You can create your own functions in R. Use functions for tasks that you repeat often in order to make your scripts more easily readable and modifiable. A good rule of thumb is never to copy an paste more than twice; use a function instead. It can also be a good practice to use functions to break complex processes into parts, especially if these parts are used with control flow statements such as loops or conditionals.

# function to compute z-scores
zScore1 <- function(x){
  xbar <- mean(x)
  s <- sd(x)
  z <- (x - mean(x)) / s
  return(z)  
}

The return statement is not strictly necessary, but can make complex functions more readable. It is good practice to avoid creating intermediate objects to store values only used once.

# function to compute z-scores
zScore2 <- function(x){
  {x - mean(x)} / sd(x)
}
x <- rnorm(10,3,1) ## generate some normally distributed values
round(cbind(x, 'Z1'=zScore1(x), 'Z2'=zScore2(x)), 1)
##         x   Z1   Z2
##  [1,] 2.4 -0.8 -0.8
##  [2,] 3.6  0.9  0.9
##  [3,] 3.4  0.5  0.5
##  [4,] 2.6 -0.6 -0.6
##  [5,] 1.8 -1.6 -1.6
##  [6,] 2.9 -0.1 -0.1
##  [7,] 2.7 -0.3 -0.3
##  [8,] 4.4  2.0  2.0
##  [9,] 2.7 -0.3 -0.3
## [10,] 3.2  0.4  0.4
Default parameters

We can set default values for parameters using the construction ‘parameter = xx’ in the function definition.

# function to compute z-scores
zScore3 <- function(x, na.rm=T){
  {x - mean(x, na.rm=na.rm)} / sd(x, na.rm=na.rm)
}
x <- c(NA,x,NA)
round(cbind(x,'Z1'=zScore1(x),'Z2'=zScore2(x),'Z3'=zScore3(x)),1)
##         x Z1 Z2   Z3
##  [1,]  NA NA NA   NA
##  [2,] 2.4 NA NA -0.8
##  [3,] 3.6 NA NA  0.9
##  [4,] 3.4 NA NA  0.5
##  [5,] 2.6 NA NA -0.6
##  [6,] 1.8 NA NA -1.6
##  [7,] 2.9 NA NA -0.1
##  [8,] 2.7 NA NA -0.3
##  [9,] 4.4 NA NA  2.0
## [10,] 2.7 NA NA -0.3
## [11,] 3.2 NA NA  0.4
## [12,]  NA NA NA   NA
Scope

Scoping refers to how R looks up the value associated with an object referred to by name. There are two types of scoping – lexical and dynamic – but we will concern ourselves only with lexical scoping here. There are four keys to understanding scoping:

  • environments
  • name masking
  • variables vs functions
  • dynamic lookup and lazy evaluation.

An environment can be thought of as context in which a name for an object makes sense. Each time a function is called, it generates a new environment for the computation.

Consider the follwing examples:

ls()
##  [1] "div4"          "even"          "u"             "ViewGlobalEnv"
##  [5] "w"             "x"             "y"             "z"            
##  [9] "zScore1"       "zScore2"       "zScore3"
f1 <- function(){
  message = "Im defined inside of f!"
  ls()
}
f1()
## [1] "message"
exists('f1')
## [1] TRUE
exists('message')
## [1] TRUE
environment()
## <environment: R_GlobalEnv>
f2 <- function(){
  environment()
}
f2()
## <environment: 0x7f9877daba70>
rm(f1,f2)

Name masking refers to where and in what order R looks for object names.
When we call f1 above, R first looks in the current environment which happens to be the global environment. The call to ls() however, happens within the environment created by the function call and hence returns only the objects defined in the local environment.

When an environment is created, it gets nested within the current environment referred to as the “parent environemnt”. When an object is referenced we first look in the current environment and move recursively up through parent environments until we find a value bound to that name.

Name masking refers to the notion that objects of the same name can exist in different environments. Consider these examples:

y <- x <- 'I came from outside of f!'
f3 <- function(){
  x <- 'I came from inside of x!'
  list(x=x,y=y)
}
f3()
## $x
## [1] "I came from inside of x!"
## 
## $y
## [1] "I came from outside of f!"
x
## [1] "I came from outside of f!"
mean <- function(x){sum(x)}
mean(1:10)
## [1] 55
base::mean(1:10)
## [1] 5.5
rm(mean)

R also uses dynamic lookup, meaning values are searched for when a function is called and not when it is created. In the example above, y was defined in the global environment rather than within the function body. This means the value returned by f3 depends on the value of y in the global environment. You should generally avoid this, but there are occasions where it can be useful.

y <- "I have been reinvented!"
f3()
## $x
## [1] "I came from inside of x!"
## 
## $y
## [1] "I have been reinvented!"

Finally, lazy evaluation means R only evaluates function arguments if/when they are actually used.

f4 <- function(x){
  #x
  45
}
f4(x=stop("Let's pass an error."))
## [1] 45

Uncomment x to see what happens if we evlauate it.

Resources

Read more about functions here and here.

The second link is to Chapter 6 from the optional text “Advanced R”.

You can also read much more about functions in Chapter 7 of “The Art of R Programming.”

Practice
  1. Access and skim the help pages for ‘median()’, ‘mad()’, and ‘IQR’.
  2. Write a function ‘zScoreRobust’ that accepts a numeric vector and returns robust z-scores.
  3. Make the function you wrote robust to vectors containing “NA” values
  4. Generate some data from N(4,2) to test your functions.
  5. View the function at this link and answer the questions in the comments.

Conditionals

In programming, we often need to execute a piece of code only if some condition is true. Here are some of the R tools for doing this.

if statements

The workhorse for conditional execution in R is the if statement. In the syntax below, note the spacing around the condtion enclosed in the parantheses.

if (TRUE) {
  print('do something if true')
}
## [1] "do something if true"

There are different opions on whether to use the above or this:

if(TRUE){
  print('do something if true')
}
## [1] "do something if true"

You can choose a style of your choosing, but be consistent. Occasionally, with short statements it can be idomatic to include the condition on the same line without the braces:

if(TRUE) print('do something if true')
## [1] "do something if true"

Use an else to control the flow without separately checking the conditon’s negation:

if ({2+2}==5) {
  print('the statement is true')
} else {
  print('the statement is false')
}
## [1] "the statement is false"
result <- c(4,5)
report = ifelse({2+2}==result, 'true', 'false')
report
## [1] "true"  "false"

As you can see above, there is also an ifelse function that can be useful.

For more complex cases, you may want to check multiple condtions:

a = -1
b = 1
if (a*b > 0) {
  print('Zero is not between a and b')
} else if (a < b) {
    smaller = a
    larger = b
} else {
    smaller = b
    larger  = a
}

In all of the examples above, please pay close attention to the use of indentation for clariy.

switch

Use a switch when you have mulitple discrete options.

Here is a simple example:

cases = function(x) {
  switch(as.character(x),
    a=1,
    b=2,
    c=3,
    "Neither a, b, nor c."
  )
}
cases("a")
## [1] 1
cases("m")
## [1] "Neither a, b, nor c."
cases(8)
## [1] "Neither a, b, nor c."

Without the coercion, the final call will evaluate to NULL.

cases2 = function(x) {
  switch(x,
    a=1,
    b=2,
    c=3,
    "Neither a, b, nor c."
  )
}
cases(8)
## [1] "Neither a, b, nor c."

A switch can also be used with a numeric expression,

for(i in c(-1:3, 9))  print(switch(i, 1, 2 , 3, 4))
## NULL
## NULL
## [1] 1
## [1] 2
## [1] 3
## NULL

Here is a more useful example:

mySummary <- function(x){
  switch(class(x),
         factor=table(x),
         numeric=sprintf('mean=%4.2f,sd=%4.2f', mean(x), sd(x)),
          'Only defined for factor and numeric classes.')
}
for ( var in names(iris) ) {
  cat(var, ':\n', sep='')
  print( mySummary(iris[,var]) )
}
## Sepal.Length:
## [1] "mean=5.84,sd=0.83"
## Sepal.Width:
## [1] "mean=3.06,sd=0.44"
## Petal.Length:
## [1] "mean=3.76,sd=1.77"
## Petal.Width:
## [1] "mean=1.20,sd=0.76"
## Species:
## x
##     setosa versicolor  virginica 
##         50         50         50
practice
  1. Read the R code below and determine the value of twos and threes at the end.
twos = 0
threes = 0
for (i in 1:10) {
  if (i %% 2 == 0) {
    twos = twos + i
  } else if (i %% 3 = 0) {
    threes = threes + i 
  }
}
  1. Read the R code below and determine the value of x at the end.
x
for (i in 1:10) {
  x = x + switch(i %% 3, 1, 5, 10)
}

Control Statments

for loops

Here is the syntax for a basic for loop in R

for ( i in 1:10 ) {
   cat(i,'\n')
}
## 1 
## 2 
## 3 
## 4 
## 5 
## 6 
## 7 
## 8 
## 9 
## 10

Note that the loop and the iterator are evaluated within the global environment.

for (var in names(mtcars)) {
  cat( sprintf('average %s = %4.3f', var, mean(mtcars[,var])), '\n')
}
## average mpg = 20.091 
## average cyl = 6.188 
## average disp = 230.722 
## average hp = 146.688 
## average drat = 3.597 
## average wt = 3.217 
## average qsec = 17.849 
## average vs = 0.438 
## average am = 0.406 
## average gear = 3.688 
## average carb = 2.812
while

A while statement can be useful when you aren’t sure how many iterations are needed.
Here is an example that takes a random walk and terminates if the value is more than 10 units from 0.

maxIter <- 1e3 # always limit the total iterations allowed
val = vector(mode='numeric', length=maxIter)
val[1] = rnorm(1) ## intialize
k = 1
while ( abs(val[k]) < 10 & k <= maxIter ) {
  val[k+1] = val[k] + rnorm(1)
  k = k + 1
}
val = val[1:{k-1}]
plot(val, type='l')

key words

The following key words are useful within loops:

  • break - break out of the currently excuting loop
  • next - move to the next iteration immediately, without executing the rest of this iteration (continue in other languages such as C++)

Here is an example using next:

for (i in 1:10) {
  if (i %% 2 == 0) next
  cat(i,'\n')
}
## 1 
## 3 
## 5 
## 7 
## 9

Here is an example using break:

x = c()
for (i in 1:1e1) {
  if (i %% 3 == 0) break
  x = c(x,i)
}
print(x)
## [1] 1 2
practice

The Fibonacci sequence starts 1, 1, 2, … and continues with each new value formed by adding the two previous values.

  1. Write a function ‘Fib1’ which takes an argument ‘n’ and returns the \(n^{th}\) value of the Fibonacci sequence. Use a for loop in the function.

  2. Write a function ‘Fib2’ which does the same thing using a while loop.

  3. Use a switch to write a function that has a parameter loop = c('for', 'while') for calling either Fib1 or Fib2.

Important Classes

Matrices

Matrices are two-dimensional vectors organized into rows and columns. They always contain values of a single type.

Matrices are stored using ‘column-major ordering’ meaning that by default they are filled and operated on by column.

X <- matrix(1:10,nrow=5,ncol=2)
Y <- matrix(1:10,nrow=5,ncol=2,byrow = TRUE)
X
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
Y
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6
## [4,]    7    8
## [5,]    9   10
class(X)
## [1] "matrix"

R can do matrix multiplication and many other linear algebra computations.

X %*% t(Y)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   13   27   41   55   69
## [2,]   16   34   52   70   88
## [3,]   19   41   63   85  107
## [4,]   22   48   74  100  126
## [5,]   25   55   85  115  145
3*X
##      [,1] [,2]
## [1,]    3   18
## [2,]    6   21
## [3,]    9   24
## [4,]   12   27
## [5,]   15   30
c(1,2)*Y
##      [,1] [,2]
## [1,]    1    4
## [2,]    6    4
## [3,]    5   12
## [4,]   14    8
## [5,]    9   20

Matrices have both dimension and length.

dim(X)
## [1] 5 2
length(X)
## [1] 10
as.vector(X)
##  [1]  1  2  3  4  5  6  7  8  9 10
c(nrow(X),ncol(X))
## [1] 5 2
colnames(X) <- paste('Col',1:2,sep='')
rownames(X) <- letters[1:5]
X["a",]
## Col1 Col2 
##    1    6
X[1:3,'Col2']
## a b c 
## 6 7 8
Arrays

See ‘help(arrays)’.

Multiple types
Lists

In R a list is a generic container for storing values of multiple types.

myList <- list(Name='An example list',
               Matrix=diag(5),
               n=5
               )
myList
## $Name
## [1] "An example list"
## 
## $Matrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    1    0    0    0
## [3,]    0    0    1    0    0
## [4,]    0    0    0    1    0
## [5,]    0    0    0    0    1
## 
## $n
## [1] 5
class(myList)
## [1] "list"
length(myList)
## [1] 3
names(myList)
## [1] "Name"   "Matrix" "n"

You can access a specific element in a list by position or name:

myList[['Name']]
## [1] "An example list"
myList$Matrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    1    0    0    0
## [3,]    0    0    1    0    0
## [4,]    0    0    0    1    0
## [5,]    0    0    0    0    1

Note the use of double brackets (’[[‘n’]]) and compare to the single bracket case below.

class(myList['n'])
## [1] "list"
class(myList[['n']])
## [1] "numeric"
Data Frames

Data frame are perhaps the most common way to represent a data set in R. A data frame is like a matrix with observations or units in rows and variables in columns. It doesn’t require the columns to all be of the same type.

df <- data.frame(ID=1:10,
                 Group=
                   sample(0:1,10,replace=TRUE),
                 Var1=rnorm(10),
                 Var2=seq(0,1,length.out=10),
                 Var3=factor(
                   rep(c('a','b'),each=5)
                   )
                )
names(df)
## [1] "ID"    "Group" "Var1"  "Var2"  "Var3"
dim(df)
## [1] 10  5
length(df)
## [1] 5
nrow(df)
## [1] 10

We can access the values of a data frame both like a list:

df$ID
##  [1]  1  2  3  4  5  6  7  8  9 10
df[['Var3']]
##  [1] a a a a a b b b b b
## Levels: a b

or like a matrix

df[1:5,]
##   ID Group       Var1      Var2 Var3
## 1  1     0  0.5979480 0.0000000    a
## 2  2     0 -0.9325598 0.1111111    a
## 3  3     1 -1.0903565 0.2222222    a
## 4  4     0  0.9665632 0.3333333    a
## 5  5     1  0.4822307 0.4444444    a
df[,'Var2']
##  [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
##  [8] 0.7777778 0.8888889 1.0000000

Apply

Loops in R can be quite slow compared to other programming language on account of the overhead of many of the conveniences that make it useful for routine data analysis. Often, explicit loops can be avoided by using an ‘apply’ function.

Here is an example:

X = matrix(rep(1:5,each=5),5,5)
apply(X,1,sum)
## [1] 15 15 15 15 15
apply(X,2,sum)
## [1]  5 10 15 20 25

For lists use ‘lapply()’ or ‘sapply()’.

myList=list(x=1:5,y=-5:-1)
lapply(myList,sum)
## $x
## [1] 15
## 
## $y
## [1] -15
sapply(myList,sum)
##   x   y 
##  15 -15

The values in a data.frame are represented internally as a list, so use lapply with data frames.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
sapply(iris,class)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##    "numeric"    "numeric"    "numeric"    "numeric"     "factor"
apply(iris,2,class)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##  "character"  "character"  "character"  "character"  "character"

A very powerful construction for data manipulation is the use of apply with an implicit function.

sapply(mtcars,function(x){
  nVals = length(unique(x))
  return(nVals)
})
##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##   25    3   27   22   22   29   30    2    2    3    6
Exercises
  1. Use apply to get the class of each variable in the ‘mtcars’ data set.
  2. Use apply to find the row means and column means of the ‘attitude’ data.

Packages

Objectives:

  • Understand:
    • Basics of the R package system
    • What it means for a function to be ‘masked’
  • Be able to:
    • Install packages
    • Make a package available to R
    • Call functions from packages without loading
    • Remove packages

The R package system

Much of the utility of R is derived from an extensive collection of user and domain-expert contributed packages. Packages are simply a standardized way for people to share documented code and data. There are thousands of packages!

Packages are primarily distributed through three sources:

Installing packages

The primary way to install a package is using ‘install.packages(“pkg”)’.

#install.packages('lme4') # the package name should be passed as a character string

You can find the default location for your R packages using the .libPaths() function. If you don’t have write permission to this folder, you can set this directory to a personal library instead.

.libPaths() ## The default library location
## [1] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library"
.libPaths('/Users/jbhender/Rlib') #Create the directory first!
.libPaths()
## [1] "/Users/jbhender/Rlib"                                          
## [2] "/Library/Frameworks/R.framework/Versions/3.4/Resources/library"

To install a package to a personal library use the ‘lib’ option.

## install.packages("haven",lib='/Users/jbhender/Rlib')

If your computer has the necessary tools, packages can also be installed from source by downloading the package file and pointing directly to the source tar ball (‘.tgz’) or Windows binary.

Using packages in R

Installing a package does not make it available to R! There are two ways to use things from a package:

  • calling library("pkg") to add it to the search path
  • using the pkg::function construction.

These methods are illustrated below using the data set ‘InstEval’ distributed with the ‘lme4’ package.

#head(InstEval)
## Using the pkg::function construction
head(lme4::InstEval)
##   s    d studage lectage service dept y
## 1 1 1002       2       2       0    2 5
## 2 1 1050       2       1       1    6 2
## 3 1 1582       2       2       0    2 5
## 4 1 2050       2       2       1    3 3
## 5 2  115       2       1       0    5 2
## 6 2  756       2       1       0    5 4

The library("pkg") command adds a package to the search path.

search()
## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"
library(lme4)
## Loading required package: Matrix
search()
##  [1] ".GlobalEnv"        "package:lme4"      "package:Matrix"   
##  [4] "package:stats"     "package:graphics"  "package:grDevices"
##  [7] "package:utils"     "package:datasets"  "package:methods"  
## [10] "Autoloads"         "package:base"
head(InstEval)
##   s    d studage lectage service dept y
## 1 1 1002       2       2       0    2 5
## 2 1 1050       2       1       1    6 2
## 3 1 1582       2       2       0    2 5
## 4 1 2050       2       2       1    3 3
## 5 2  115       2       1       0    5 2
## 6 2  756       2       1       0    5 4

To remove a library from the search path use detach("package:pkg", unload=TRUE).

detach(package:lme4, unload=TRUE)
search()
##  [1] ".GlobalEnv"        "package:Matrix"    "package:stats"    
##  [4] "package:graphics"  "package:grDevices" "package:utils"    
##  [7] "package:datasets"  "package:methods"   "Autoloads"        
## [10] "package:base"