The material in this lesson is largely based on:
Chapter 18 of Advanced R
Chapter 14 of The Art of R Programming
You likely already know that binary data is composed of bits (0 or 1). You may also know that computer systems generally use bytes containing multiple bits as the basic addressable unit.
0In modern computer architectures a byte usually consists of 8 bits. When describing a data storage quantity in terms of bytes we traditionally use prefixes based on powers of 2, so that 1KB \(= 2^{10} (1024)\) bytes, a MB = \(2^{20}\) (1,048,576) bytes etc.
You may also want to review the section ‘Storage’ from Professor Shedden’s notes.
An understanding of virtual memory, segmentation, and caching may be of use at some point in the future but is not required for this course.
A useful tool for exploring memory management in R is Hadley Wickham’s pryr
package. It contains a function pryr::object_size
similar to object.size
but with these differences:
Here are some examples.
library(pryr); library(ggplot2)
nyc14 = data.table::fread('https://github.com/arunsrinivasan/flights/wiki/NYCflights14/flights14.csv')
# Simple size of objects
object.size(1:10)
## 88 bytes
pryr::object_size(1:10)
## 88 B
object.size(nyc14)
## 21469968 bytes
print(object.size(nyc14), units='Mb')
## 20.5 Mb
object_size(nyc14)
## 21.5 MB
compare_size(nyc14)
## base pryr
## 21469968 21468872
## How to understand the difference?
# Example from object_size documentation.
x = 1:1e5
z = list(x, x, x)
compare_size(z)
## base pryr
## 1200192 400112
object_size(x)
## 400 kB
object_size(z)
## 400 kB
object_size(x,z)
## 400 kB
To understand this last example, we need to recall R’s “copy on modify semantics” discussed during the lessons on data.table
. Here we will use the function tracemem()
to get the memory address of an object. When this changes we can be sure that an object has been copied. Similar functionality is available via pryr::address()
or data.table::address()
..
## Recall R has "copy-on-modify semantics"
invisible(tracingState(on=FALSE))
tracemem(x)
## [1] "<0x10c350000>"
tracemem(z)
## [1] "<0x7ff16873cd68>"
## pryr::address doesn't work well with knitr, but would be the
## same interactively
pryr::address(z)
## [1] "0x7ff165e8af48"
data.table::address(z)
## [1] "0x7ff16873cd68"
inspect(z)
## <VECSXP 0x7ff16873cd68>
## <INTSXP 0x10c2ee000>
## <INTSXP 0x10c28c000>
## <INTSXP 0x10c3b2000>
# Modify z, tracemem causes message to print
z[[2]] = -x
sapply(z, tracemem)
## [1] "<0x10c2ee000>" "<0x10c565000>" "<0x10c3b2000>"
# Having changed z[[2]], the object is larger.
object_size(z)
## 1.2 MB
object_size(x,z)
## 1.6 MB
Be aware that memory profiling is a compile time option, meaning it may not be available on all instances of R. By default, the mac OS and Windows builds distributed through CRAN have memory profiling enabled.
Here is an example of how environments impact object_size
.
f <- function(){
x <- 1:1e5
a ~ b
}
compare_size(f())
## base pryr
## 712 400864
g <- function(){
a ~ b
}
compare_size(g())
## base pryr
## 712 712
h <- function(){
x <- 1:1e5
x~y
}
compare_size(h())
## base pryr
## 712 400808
object_size(a~b)
## 656 B
object_size(1:1e5)
## 400 kB
Two other useful functions from pryr
are mem_used()
which adds up the total size of all objects in memory and mem_change()
which tracks changes to this quantity. When working with mem_change()
ignore anything ~2KB or smaller as this mostly is impacted by changes to .Rhistory.
ls()
## [1] "f" "g" "h" "nyc14" "x" "z"
mem_used()
## 64.1 MB
mem_change({new_vec = 1:1e6})
## 4 MB
# Negatives represent memory freed
mem_change(rm(new_vec))
## -4 MB
mem_change(rm(nyc14))
## -21.5 MB
This example is taken from section 18.1 of Advanced R.
Here we exam the size, in bytes, of R vectors of class integer with lengths 0 through 100.
sizes = sapply(0:100, function(n) object_size(seq_len(n)))
plot(0:100, sizes, xlab='Vector Length', ylab='Size (B)', type='s')
abline(h=40,lty='dashed')
sizes[1:20]
## [1] 40 48 48 56 56 72 72 72 72 88 88 88 88 104 104 104 104
## [18] 168 168 168
It turns out that empty vectors of any type occupy 40 bytes of memory,
sapply(c('numeric', 'logical', 'integer', 'raw', 'list'),
function(x) object.size(vector(mode=x, length=0))
)
## numeric logical integer raw list
## 40 40 40 40 40
These bytes are used to store the following components:
The additional 4 bytes are used for padding to ensure each of these elements starts on an 8 byte boundary. These boundaries are generally required by 64 bit CPU architectures, see here.
How do we interpret the remaining steps in the graph? First, consider the regular steps for the later vectors:
diff(sizes[41:45])
## [1] 8 0 8 0
For vectors beyond 128 bytes in size (excluding overhead) R requests memory from the OS in 8 byte chunks using the C function malloc()
. Since an integer occupies 4 bytes, the memory increases every other integer.
## Adjusted sizes
plot(0:100, sizes-40, xlab='Vector Length', ylab='Size (B) less overhead', type='s')
abline(h=0,lty='dashed')
abline(v=c(41,43), lty='dotted')
abline(h={8*c(1,2,4,6,8,16)}, col='blue', lty='dotted')
For vectors, smaller than 128 bytes in size R performs its own memory management using something called the ‘small vector pool’ to avoid unnecessary requests to the OS for RAM. For simplicity, it only allocates specific multiples of 8 bytes as shown in the plot. Note that these value correspond to the data held by the vector and not the 40 B of overhead. This small vector pool is expanded by a page in increments of 2000 bytes as needed.
Question 1: What vector lengths are shown by the vertical lines in the plot below? Where do the horizontal lines intersect the y-axis?
Question 2: Recall that an integer type is stored using 4 bytes while a numeric type uses 8 bytes. What are the values of a
and b
after running the R code below?
a = object_size(1:15)
b = object_size(as.numeric(1:15))
Question 3: What are the approximate values of mem_a - mem_c below?
x = 1:1e6
z = list(x, x, x)
object_size(x, z)
## 4 MB
mem_a = object_size(z) - object_size(x) # Exact in bytes
mem_b = mem_change(z[[1]] <- rep(2L, 1e6)) # Approximate in Mb
mem_c = mem_change(z[[2]][sample(1:1e6, 1)] <- runif(1)) # Approximate in Mb
We noted above that integers are stored using 4 bytes (32 bits) and doubles 8 bytes (64 bits). What about characters? According to the R documenation on CRAN, R uses a global pool of strings and pointers to them in actual strings.
# pointers to strings take 8 bytes each
x = 'abc'
x1e5 = rep(x,1e5)
object_size(x)
## 96 B
object_size(x1e5)
## 800 kB
object_size(x,x1e5)
## 800 kB
The global pool stores both the encoding of each string and the actual bytes.
In contrast, R objects of class factor
are stored as integers encoding the levels which are in turn strings. Since integers occupy only 4 bytes, if there are only a few levels the factor can have a smaller memory footprint.
x = sample(1:3,1e5, replace=TRUE)
object_size(x) - 40
## 400 kB
x_char = LETTERS[x]
object_size(x_char) - 40
## 800 kB
x_factor = as.factor(x_char)
object_size(x_factor) - 40
## 401 kB
In contrast, if there are many levels the factor may have a larger footprint. Read more from data.table
author Matt Dowle here.
Rprof
and profviz
The Rprof()
function can be used to profile R code for both speed and memory usage by sampling. This works by recording in a log every so often (by default .02 s) what functions are currently on the stack. It will also
Recall our comparisons of various R implementations for screening correlation coefficients for a large number of possible predictors in the rows of a matrix xmat
with a single outcome y
. Here we will add a Fisher transform as well and return just the indices where the sample coefficients is nominally significant.
# Example data
n = 3e2
m = 1e5
yvec = rnorm(n)
xmat = outer(array(1, m), yvec)
rmat = matrix(runif(m, -.8, .8), m, n)
xmat = rmat*xmat + sqrt(1 - rmat^2)*rnorm(n * m)
object_size(xmat)
## 240 MB
object_size(yvec)
## 2.44 kB
# When memory is an issue, be sure to clean up when possible.
mem_change(rm(rmat))
## -240 MB
# Functions to compare
cor_screen_1 = function(yvec, xmat){
r1 = NULL
for (i in 1:m) {
r1[i] = cor(xmat[i, ], yvec)
}
z = {.5*{log(1+r1) - log(1-r1)}}*sqrt(length(yvec)-3)
which(abs(z)>qnorm(.975))
}
cor_screen_2 = function(yvec, xmat){
r2 = apply(xmat, 1, function(v) cor(v, yvec))
z = {.5*{log(1+r2) - log(1-r2)}}*sqrt(length(yvec)-3)
which(abs(z)>qnorm(.975))
}
cor_screen_3 = function(yvec, xmat){
rmn = rowMeans(xmat)
xmat_c = xmat - outer(rmn, array(1, n))
rsd = apply(xmat, 1, sd)
xmat_s = xmat_c / outer(rsd, array(1, n))
yvec_s = {yvec - mean(yvec)} / sd(yvec)
r3 = xmat_s %*% yvec_s / {n - 1}
z = as.vector({.5*{log(1+r3) - log(1-r3)}} * sqrt(length(yvec)-3))
which(abs(z)>qnorm(.975))
}
cor_screen_4 = function(yvec, xmat){
rmn = rowMeans(xmat)
xmat_c = xmat - rmn
rvar = rowSums(xmat_c^2) / {dim(xmat)[2] - 1}
rsd = sqrt(rvar)
xmat_s = xmat_c / rsd
yvec_s = {yvec - mean(yvec)} / sd(yvec)
r4 = xmat_s %*% yvec_s / {n - 1}
z = as.vector({.5*{log(1+r4) - log(1-r4)}} * sqrt(length(yvec)-3))
which(abs(z)>qnorm(.975))
}
# Check that all are equal
s = list(cor_screen_1(yvec, xmat), cor_screen_2(yvec, xmat),
cor_screen_3(yvec, xmat), cor_screen_4(yvec, xmat)
)
sapply(2:4, function(i) setdiff(s[[1]],s[[i]]))
## [[1]]
## integer(0)
##
## [[2]]
## integer(0)
##
## [[3]]
## integer(0)
Here is an example of profiling cor_screen_1
for speed with Rprof()
.
Rprof(memory.profiling = TRUE, interval=.002)
invisible(cor_screen_1(yvec, xmat))
Rprof(NULL)
summaryRprof(memory = 'both')
## $by.self
## self.time self.pct total.time total.pct mem.total
## "cor" 0.216 20.65 0.976 93.31 1841.8
## "is.data.frame" 0.154 14.72 0.154 14.72 273.4
## "stopifnot" 0.130 12.43 0.396 37.86 696.7
## "match.call" 0.110 10.52 0.242 23.14 409.9
## "cor_screen_1" 0.068 6.50 1.046 100.00 1967.0
## "eval" 0.064 6.12 1.046 100.00 1967.0
## "sys.function" 0.056 5.35 0.082 7.84 142.1
## "sys.call" 0.052 4.97 0.068 6.50 112.6
## "sys.parent" 0.042 4.02 0.042 4.02 80.3
## "match.arg" 0.040 3.82 0.186 17.78 413.1
## "formals" 0.036 3.44 0.064 6.12 142.0
## "pmatch" 0.022 2.10 0.022 2.10 23.4
## "parent.frame" 0.014 1.34 0.014 1.34 21.1
## "all" 0.008 0.76 0.008 0.76 7.9
## "c" 0.008 0.76 0.008 0.76 24.6
## "is.atomic" 0.006 0.57 0.006 0.57 16.5
## "anyNA" 0.004 0.38 0.004 0.38 5.4
## "is.pairlist" 0.004 0.38 0.004 0.38 6.9
## "length" 0.004 0.38 0.004 0.38 12.6
## "abs" 0.002 0.19 0.002 0.19 1.5
## "baseenv" 0.002 0.19 0.002 0.19 7.3
## "invisible" 0.002 0.19 0.002 0.19 4.5
## "is.na" 0.002 0.19 0.002 0.19 1.4
##
## $by.total
## total.time total.pct mem.total self.time self.pct
## "cor_screen_1" 1.046 100.00 1967.0 0.068 6.50
## "eval" 1.046 100.00 1967.0 0.064 6.12
## "block_exec" 1.046 100.00 1967.0 0.000 0.00
## "call_block" 1.046 100.00 1967.0 0.000 0.00
## "evaluate_call" 1.046 100.00 1967.0 0.000 0.00
## "evaluate" 1.046 100.00 1967.0 0.000 0.00
## "handle" 1.046 100.00 1967.0 0.000 0.00
## "in_dir" 1.046 100.00 1967.0 0.000 0.00
## "knitr::knit" 1.046 100.00 1967.0 0.000 0.00
## "process_file" 1.046 100.00 1967.0 0.000 0.00
## "process_group.block" 1.046 100.00 1967.0 0.000 0.00
## "process_group" 1.046 100.00 1967.0 0.000 0.00
## "rmarkdown::render" 1.046 100.00 1967.0 0.000 0.00
## "timing_fn" 1.046 100.00 1967.0 0.000 0.00
## "withCallingHandlers" 1.046 100.00 1967.0 0.000 0.00
## "withVisible" 1.046 100.00 1967.0 0.000 0.00
## "cor" 0.976 93.31 1841.8 0.216 20.65
## "stopifnot" 0.396 37.86 696.7 0.130 12.43
## "match.call" 0.242 23.14 409.9 0.110 10.52
## "match.arg" 0.186 17.78 413.1 0.040 3.82
## "is.data.frame" 0.154 14.72 273.4 0.154 14.72
## "sys.function" 0.082 7.84 142.1 0.056 5.35
## "sys.call" 0.068 6.50 112.6 0.052 4.97
## "formals" 0.064 6.12 142.0 0.036 3.44
## "sys.parent" 0.042 4.02 80.3 0.042 4.02
## "pmatch" 0.022 2.10 23.4 0.022 2.10
## "parent.frame" 0.014 1.34 21.1 0.014 1.34
## "all" 0.008 0.76 7.9 0.008 0.76
## "c" 0.008 0.76 24.6 0.008 0.76
## "is.atomic" 0.006 0.57 16.5 0.006 0.57
## "anyNA" 0.004 0.38 5.4 0.004 0.38
## "is.pairlist" 0.004 0.38 6.9 0.004 0.38
## "length" 0.004 0.38 12.6 0.004 0.38
## "abs" 0.002 0.19 1.5 0.002 0.19
## "baseenv" 0.002 0.19 7.3 0.002 0.19
## "invisible" 0.002 0.19 4.5 0.002 0.19
## "is.na" 0.002 0.19 1.4 0.002 0.19
## "which" 0.002 0.19 1.5 0.000 0.00
##
## $sample.interval
## [1] 0.002
##
## $sampling.time
## [1] 1.046
Here are two other options for the memory parameter.
#summaryRprof(memory = 'tseries')
summaryRprof(memory = 'stats')
## index: "rmarkdown::render":"knitr::knit"
## vsize.small max.vsize.small vsize.large max.vsize.large
## 12747 414647 308554 120000044
## nodes max.nodes duplications tot.duplications
## 3282033 35004816 3060 1600184
## samples
## 523
profviz
to visualize profiling informationThe profviz
package is built on Rprof()
but aims to provide more useful summary information.
#install.packages('profvis')
library(profvis)
profvis(cor_screen_1(yvec, xmat), interval = .005)
profvis(cor_screen_2(yvec, xmat), interval = .005)
profvis(cor_screen_3(yvec, xmat), interval = .005)
profvis(cor_screen_4(yvec, xmat), interval = .005)