The material in this lesson is largely based on:
Chapter 18 of the first edition of Advanced R.
Chapter 14 of The Art of R Programming.
You likely already know that binary data is composed of bits (0 or 1). You may also know that computer systems generally use bytes containing multiple bits as the basic addressable unit.
In modern computer architectures a byte usually consists of 8 bits. When describing a data storage quantity in terms of bytes we traditionally use prefixes based on powers of 2, so that 1KB \(= 2^{10} (1024)\) bytes, a MB = \(2^{20}\) (1,048,576) bytes etc. But SI prefixes using base 10 are now more common common, 1kB = 1e3 bytes, 1 MB = 1e6 bytes.
You may also want to review the section ‘Storage’ from Professor Shedden’s notes.
An understanding of virtual memory, segmentation, and caching may be of use at some point in the future but is not required for this course.
A useful tool for exploring memory management in R is Hadley Wickham’s pryr
package. It contains a function pryr::object_size
similar to object.size
but with these differences:
Here are some examples.
library(pryr); library(ggplot2)
nyc14 = data.table::fread('https://github.com/arunsrinivasan/flights/wiki/NYCflights14/flights14.csv')
# Simple size of objects
object.size(1:10)
## 96 bytes
## 96 B
## 21501712 bytes
## 21.5 MB
## 21.5 MB
## base pryr
## 21501712 21500520
## How to understand the difference?
# Example from object_size documentation.
x = 1:1e5
z = list(x, x, x)
compare_size(z)
## base pryr
## 1200224 400128
## 400 kB
## 400 kB
## 400 kB
## 400 kB
## 1200.2 kB
To understand this last example, we need to recall R’s “copy on modify semantics” discussed during the lessons on data.table
. Here we will use the function tracemem()
to get the memory address of an object. When this changes we can be sure that an object has been copied. Similar functionality is available via pryr::address()
or data.table::address()
.
## [1] "<0x7fc0882ffcf8>"
## [1] "<0x7fc0856b0eb8>"
## pryr::address doesn't work well with knitr, but would be the
## same interactively
pryr::address(z)
## [1] "0x7fc0856b0eb8"
## [1] "0x7fc0856b0eb8"
## <VECSXP 0x7fc0856b0eb8>
## <INTSXP 0x7fc0882ffcf8>
## [INTSXP 0x7fc0882ffcf8]
## [INTSXP 0x7fc0882ffcf8]
## [1] "<0x7fc0882ffcf8>" "<0x110ad3000>" "<0x7fc0882ffcf8>"
## 800 kB
## 800 kB
Here is an example of how environments impact object_size
.
## base pryr
## 728 400888
## base pryr
## 728 728
## base pryr
## 728 400832
## 672 B
## 400 kB
Two other useful functions from pryr
are mem_used()
which adds up the total size of all objects in memory and mem_change()
which tracks changes to this quantity. When working with mem_change()
ignore anything ~2KB or smaller as this mostly is impacted by changes to .Rhistory.
## [1] "f" "g" "h" "nyc14" "x" "z"
## 73.3 MB
## 4 MB
## -4 MB
## -21.5 MB
This example is taken from section 18.1 of the first edition of Advanced R.
Here we exam the size, in bytes, of R vectors of class integer with lengths 0 through 100.
sizes = sapply(0:100, function(n) object_size(seq_len(n)))
plot(0:100, sizes, xlab='Vector Length', ylab='Size (B)', type='s')
abline(h=48, lty='dashed')
## [1] 48 56 56 64 64 80 80 80 80 96 96 96 96 112 112 112 112
## [18] 176 176 176
It turns out that empty vectors of any type occupy 48 bytes of memory,
sapply(c('numeric', 'logical', 'integer', 'raw', 'list'),
function(x) object_size(vector(mode=x, length=0))
)
## numeric logical integer raw list
## 48 48 48 48 48
These bytes are used to store the following components:
How do we interpret the remaining steps in the graph? First, consider the regular steps for the later vectors:
## [1] 8 0 8 0
For vectors beyond 128 bytes in size (excluding overhead) R requests memory from the OS in 8 byte chunks using the C function malloc()
. Since an integer occupies 4 bytes, the memory increases every other integer.
## Adjusted sizes
plot(0:100, sizes - 48, xlab='Vector Length', ylab='Size (B) less overhead', type='s')
abline(h=0,lty='dashed')
abline(v=c(41, 43), lty='dotted')
abline(h={8*c(1, 2, 4, 6, 8, 16)}, col='blue', lty='dotted')
For vectors, smaller than 128 bytes in size R performs its own memory management using something called the ‘small vector pool’ to avoid unnecessary requests to the OS for RAM. For simplicity, it only allocates bytes in multiples of 2 as shown in the plot. Note that these values correspond to the data held by the vector and not the 48 B of overhead. This small vector pool is expanded by a page in increments of 2000 bytes as needed.
Question 1: What vector lengths are shown by the vertical lines in the plot below? Where do the horizontal lines intersect the y-axis?
Question 2: Recall that an integer type is stored using 4 bytes while the double type uses 8 bytes. What are the values of a
and b
after running the R code below?
## [1] 0 8 16 32 48 64
Question 3: What are the approximate values of mem_a - mem_c below?
## 4 MB
We noted above that integers are stored using 4 bytes (32 bits) and doubles 8 bytes (64 bits). What about characters? According to the R documenation on CRAN, R uses a global pool of strings and pointers to them in actual strings.
## 112 B
## 800 kB
## base pryr
## 800104 800104
## 800 kB
The global pool stores both the encoding of each string and the actual bytes.
In contrast, R objects of class factor
are stored as integers encoding the levels which are in turn strings. Since integers occupy only 4 bytes, if there are only a few levels the factor can have a smaller memory footprint.
## 400 kB
## 800 kB
## 401 kB
In contrast, if there are many levels the factor may have a larger footprint. Read more from data.table
author Matt Dowle here.
Rprof
and profviz
The Rprof()
function can be used to profile R code for both speed and memory usage by sampling. This works by recording in a log every so often (by default .02 s) what functions are currently on the stack. It will also
Recall our comparisons of various R implementations for screening correlation coefficients for a large number of possible predictors in the rows of a matrix xmat
with a single outcome y
. Here we will add a Fisher transform as well and return just the indices where the sample coefficients is nominally significant.
# Example data
n = 3e2
m = 1e5
yvec = rnorm(n)
xmat = outer(array(1, m), yvec)
rmat = matrix(runif(m, -.8, .8), m, n)
xmat = rmat*xmat + sqrt(1 - rmat^2)*rnorm(n * m)
object_size(xmat)
## 240 MB
## 2.45 kB
## -240 MB
# Functions to compare
cor_screen_1 = function(yvec, xmat){
r1 = NULL
for (i in 1:m) {
r1[i] = cor(xmat[i, ], yvec)
}
z = {.5*{log(1+r1) - log(1-r1)}}*sqrt(length(yvec)-3)
which(abs(z)>qnorm(.975))
}
cor_screen_2 = function(yvec, xmat){
r2 = apply(xmat, 1, function(v) cor(v, yvec))
z = {.5*{log(1+r2) - log(1-r2)}}*sqrt(length(yvec)-3)
which(abs(z)>qnorm(.975))
}
cor_screen_3 = function(yvec, xmat){
rmn = rowMeans(xmat)
xmat_c = xmat - outer(rmn, array(1, n))
rsd = apply(xmat, 1, sd)
xmat_s = xmat_c / outer(rsd, array(1, n))
yvec_s = {yvec - mean(yvec)} / sd(yvec)
r3 = xmat_s %*% yvec_s / {n - 1}
z = as.vector({.5*{log(1+r3) - log(1-r3)}} * sqrt(length(yvec)-3))
which(abs(z)>qnorm(.975))
}
cor_screen_4 = function(yvec, xmat){
rmn = rowMeans(xmat)
xmat_c = xmat - rmn
rvar = rowSums(xmat_c^2) / {dim(xmat)[2] - 1}
rsd = sqrt(rvar)
xmat_s = xmat_c / rsd
yvec_s = {yvec - mean(yvec)} / sd(yvec)
r4 = xmat_s %*% yvec_s / {n - 1}
z = as.vector({.5*{log(1+r4) - log(1-r4)}} * sqrt(length(yvec)-3))
which(abs(z)>qnorm(.975))
}
# Check that all are equal
s = list(cor_screen_1(yvec, xmat), cor_screen_2(yvec, xmat),
cor_screen_3(yvec, xmat), cor_screen_4(yvec, xmat)
)
sapply(2:4, function(i) setdiff(s[[1]],s[[i]]))
## [[1]]
## integer(0)
##
## [[2]]
## integer(0)
##
## [[3]]
## integer(0)
Here is an example of profiling cor_screen_1
for speed with Rprof()
.
Rprof(memory.profiling = TRUE, interval=.002)
invisible(cor_screen_1(yvec, xmat))
Rprof(NULL)
summaryRprof(memory = 'both')
## $by.self
## self.time self.pct total.time total.pct mem.total
## "cor" 0.390 24.59 1.486 93.69 1484.4
## "is.data.frame" 0.298 18.79 0.298 18.79 296.5
## "stopifnot" 0.188 11.85 0.268 16.90 255.6
## "match.arg" 0.186 11.73 0.460 29.00 456.0
## "eval" 0.140 8.83 1.584 99.87 1575.9
## "cor_screen_1" 0.096 6.05 1.584 99.87 1575.9
## "pmatch" 0.054 3.40 0.056 3.53 66.6
## "formals" 0.042 2.65 0.100 6.31 102.3
## "sys.function" 0.038 2.40 0.058 3.66 65.2
## "...elt" 0.038 2.40 0.044 2.77 43.2
## "all" 0.024 1.51 0.024 1.51 20.2
## "sys.parent" 0.020 1.26 0.020 1.26 19.2
## "c" 0.014 0.88 0.014 0.88 22.2
## "sys.frame" 0.014 0.88 0.014 0.88 13.0
## "as.character" 0.008 0.50 0.008 0.50 8.9
## "is.na" 0.008 0.50 0.008 0.50 8.3
## "is.atomic" 0.006 0.38 0.006 0.38 5.5
## "...length" 0.004 0.25 0.004 0.25 5.7
## "anyNA" 0.004 0.25 0.004 0.25 10.6
## "invisible" 0.004 0.25 0.004 0.25 1.5
## "is.matrix" 0.004 0.25 0.004 0.25 3.2
## "is.numeric" 0.004 0.25 0.004 0.25 6.9
## "which" 0.002 0.13 0.002 0.13 1.1
##
## $by.total
## total.time total.pct mem.total self.time self.pct
## "block_exec" 1.586 100.00 1576.8 0.000 0.00
## "call_block" 1.586 100.00 1576.8 0.000 0.00
## "evaluate_call" 1.586 100.00 1576.8 0.000 0.00
## "evaluate::evaluate" 1.586 100.00 1576.8 0.000 0.00
## "evaluate" 1.586 100.00 1576.8 0.000 0.00
## "in_dir" 1.586 100.00 1576.8 0.000 0.00
## "knitr::knit" 1.586 100.00 1576.8 0.000 0.00
## "process_file" 1.586 100.00 1576.8 0.000 0.00
## "process_group.block" 1.586 100.00 1576.8 0.000 0.00
## "process_group" 1.586 100.00 1576.8 0.000 0.00
## "rmarkdown::render" 1.586 100.00 1576.8 0.000 0.00
## "withCallingHandlers" 1.586 100.00 1576.8 0.000 0.00
## "eval" 1.584 99.87 1575.9 0.140 8.83
## "cor_screen_1" 1.584 99.87 1575.9 0.096 6.05
## "handle" 1.584 99.87 1575.9 0.000 0.00
## "timing_fn" 1.584 99.87 1575.9 0.000 0.00
## "withVisible" 1.584 99.87 1575.9 0.000 0.00
## "cor" 1.486 93.69 1484.4 0.390 24.59
## "match.arg" 0.460 29.00 456.0 0.186 11.73
## "is.data.frame" 0.298 18.79 296.5 0.298 18.79
## "stopifnot" 0.268 16.90 255.6 0.188 11.85
## "formals" 0.100 6.31 102.3 0.042 2.65
## "sys.function" 0.058 3.66 65.2 0.038 2.40
## "pmatch" 0.056 3.53 66.6 0.054 3.40
## "...elt" 0.044 2.77 43.2 0.038 2.40
## "all" 0.024 1.51 20.2 0.024 1.51
## "sys.parent" 0.020 1.26 19.2 0.020 1.26
## "c" 0.014 0.88 22.2 0.014 0.88
## "sys.frame" 0.014 0.88 13.0 0.014 0.88
## "as.character" 0.008 0.50 8.9 0.008 0.50
## "is.na" 0.008 0.50 8.3 0.008 0.50
## "is.atomic" 0.006 0.38 5.5 0.006 0.38
## "...length" 0.004 0.25 5.7 0.004 0.25
## "anyNA" 0.004 0.25 10.6 0.004 0.25
## "invisible" 0.004 0.25 1.5 0.004 0.25
## "is.matrix" 0.004 0.25 3.2 0.004 0.25
## "is.numeric" 0.004 0.25 6.9 0.004 0.25
## "which" 0.002 0.13 1.1 0.002 0.13
## "sink" 0.002 0.13 0.9 0.000 0.00
## "w$close" 0.002 0.13 0.9 0.000 0.00
##
## $sample.interval
## [1] 0.002
##
## $sampling.time
## [1] 1.586
Here are two other options for the memory parameter.
## index: "rmarkdown::render":"knitr::knit"
## vsize.small max.vsize.small vsize.large max.vsize.large
## 67539 4328616 1691032 975262272
## nodes max.nodes duplications tot.duplications
## 1615126 42318304 505 400192
## samples
## 793
profvis
to visualize profiling informationThe profvis
package is built on Rprof()
but aims to provide more useful summary information.