Matrix Decompositions

Cholesky Factorization

The Cholesky Decomposition of a symmetric, positive-definite, real-value matrix \(\Sigma\) is the product of an upper-triangular matrix \(R\) and its transpose \(R'\) such that:

\[ \Sigma = R'R \] In R we can compute the Cholesky decomposition using chol:

## example covariance matrix
sigma = c(1, .5, .2, .5, 1, .3, .2, .3, 1)
dim(sigma) = c(3,3)
sigma

##      [,1] [,2] [,3]
## [1,]  1.0  0.5  0.2
## [2,]  0.5  1.0  0.3
## [3,]  0.2  0.3  1.0

## Cholesky Decomposition
R = chol(sigma)
R

##      [,1]      [,2]      [,3]
## [1,]    1 0.5000000 0.2000000
## [2,]    0 0.8660254 0.2309401
## [3,]    0 0.0000000 0.9521905

## check that we get sigma back
t(R) %*% R

##      [,1] [,2] [,3]
## [1,]  1.0  0.5  0.2
## [2,]  0.5  1.0  0.3
## [3,]  0.2  0.3  1.0

Correlation for Monte Carlo Studies

The Cholesky decomposition can be used to transform draws from an iid Gaussian distribution into draws from a multivariate normal distribution:

# generate iid normal variates
n = 1e5
p = 3
X = rnorm(n*p)
dim(X) = c(n, p)

# Correlations are approximately zero
round(cor(X), 2)

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

# Transform to multivariate normal
M = X %*% R
round(cor(M), 2)

##      [,1] [,2] [,3]
## [1,]  1.0  0.5  0.2
## [2,]  0.5  1.0  0.3
## [3,]  0.2  0.3  1.0

Obtaining regression coefficients

The Cholesky decomposition can also be used to solve systems of linear equations, such as the normal equations for a least squares regression:

\[ X'X \beta = X'Y. \] Using the Cholesky decomposition of \(X'X = R'R\) the above equations can be solved very efficiently using forward- and backward-substituion due to the triangularity of \(R'\) and \(R\). We first solve for \(z = R\beta\) using forward subsitution in:

\[ R'z = X'Y \]

and then solve for \(\beta\) using back substitution in:

\[ R\beta = z. \]

Here’s how this would look in R, along with a comparison of timing relative to lm.

beta = 1:p; dim(beta) = c(p, 1)
Y = M %*% beta + rnorm(n)

## use lm to check
tm_lm = system.time({
  beta_lm = coef(lm(Y~M))
})

# add column for intercept
X = cbind(1, M)

tm_chol = system.time({
 # Cholesky decomposition of X'X
 R = chol(t(X) %*% X)

 # forward subsitution
 z = solve(t(R), t(X) %*% Y)
 beta_chol = solve(R, z)
})

cbind('lm'=beta_lm, 'chol'=as.vector(beta_chol))

##                       lm         chol
## (Intercept) 5.082727e-05 5.082727e-05
## M1          9.944854e-01 9.944854e-01
## M2          2.000368e+00 2.000368e+00
## M3          2.996838e+00 2.996838e+00

rbind('qrXX' = tm_chol, 'chol'=tm_chol, 'lm'=tm_lm)

##      user.self sys.self elapsed user.child sys.child
## qrXX     0.005    0.000   0.005          0         0
## chol     0.005    0.000   0.005          0         0
## lm       0.194    0.008   0.203          0         0

The Cholesky decomposition is very efficient to compute and allows for solving systems of linear equations such as result from least squares very quickly. However, it is somwhat less numerically stable and can lead to compounding error when used with ill-conditioned matrices which can result from multicollinearity in regression models.

QR Decomposition

The QR decomposition of a matrix \(X\) is the product of an orthogonal matrix \(Q\) and an upper triangular matrix \(R\):

\[ X = QR, \qquad Q'Q = I. \]

It is commonly used to solve the normal equations in linear least squares as it is numerically more stable than the Cholesky decomposition. It is also reasonably efficient to compute.

In R we can compute the QR decomposition of a matrix using the qr:

# compute the QR decomposition
QR = qr(X)

# access Q and R
Q = qr.Q(QR)
R = qr.R(QR)
dim(qr.Q(QR))

## [1] 100000      4

dim(R)

## [1] 4 4

# R is upper triangular
R

##           [,1]       [,2]        [,3]         [,4]
## [1,] -316.2278   -1.23658   -1.219765   -0.6835824
## [2,]    0.0000 -316.26938 -158.686971  -64.5067253
## [3,]    0.0000    0.00000  273.348300   73.1714989
## [4,]    0.0000    0.00000    0.000000 -302.0640720

# Q is orthogonal
t(Q) %*% Q

##               [,1]          [,2]          [,3]          [,4]
## [1,]  1.000000e+00 -2.334084e-17  5.478779e-17 -1.017244e-17
## [2,] -2.334084e-17  1.000000e+00 -8.769502e-17 -7.084838e-17
## [3,]  5.478779e-17 -8.769502e-17  1.000000e+00 -1.468586e-17
## [4,] -1.017244e-17 -7.084838e-17 -1.468586e-17  1.000000e+00

There are two ways to solve the normal equations using a QR decomposition using either 1) \(X = QR\) or 2) \(X'X = QR\). Let’s do the first:

\[ \begin{align} X'X\beta &= X'Y \\ R'Q'QR\beta & = R'Q'Y \\ R'R\beta & = R'Q'Y\\ R\beta &= Q'Y. \end{align} \] Here is an implementation in R:

tm_qrX = system.time({
  QR = qr(X)
  beta_qrX = solve(qr.R(QR), t(qr.Q(QR)) %*% Y)
})
cbind('lm'=beta_lm, 'qrX'=as.vector(beta_qrX))

##                       lm          qrX
## (Intercept) 5.082727e-05 5.082727e-05
## M1          9.944854e-01 9.944854e-01
## M2          2.000368e+00 2.000368e+00
## M3          2.996838e+00 2.996838e+00

rbind('qrX' = tm_qrX, 'chol'=tm_chol, 'lm'=tm_lm)

##      user.self sys.self elapsed user.child sys.child
## qrX      0.060    0.003   0.064          0         0
## chol     0.005    0.000   0.005          0         0
## lm       0.194    0.008   0.203          0         0

Similarly, we could start with a QR decomposition of \(X'X\). In this case, \[ \begin{align} X'X\beta &= X'Y \\ QR\beta & = X'Y \\ Q'QR\beta & = Q'X'Y\\ R\beta &= Q'X'Y. \end{align} \]

Here is the second method implemented in R:

tm_qrXX = system.time({
  QR = qr(t(X) %*% X)
  beta_qrXX= solve(qr.R(QR), t(qr.Q(QR)) %*% t(X) %*% Y)
})
cbind('lm'=beta_lm, 'qrXX'=beta_qrXX)

##                       lm             
## (Intercept) 5.082727e-05 5.082727e-05
## M1          9.944854e-01 9.944854e-01
## M2          2.000368e+00 2.000368e+00
## M3          2.996838e+00 2.996838e+00

rbind('qrXX' = tm_qrXX, 'qrX' = tm_qrX, 'chol'=tm_chol, 'lm'=tm_lm)

##      user.self sys.self elapsed user.child sys.child
## qrXX     0.007    0.001   0.009          0         0
## qrX      0.060    0.003   0.064          0         0
## chol     0.005    0.000   0.005          0         0
## lm       0.194    0.008   0.203          0         0

The approach using the QR decomposition of \(X'X\) is the one implemetned in the lm.fit function used by R. Using the QR decomposition of \(X\) is not favored in practice as it is less numerically stable than other approaches including the Cholesky decomposition.

Singular Value Decomposition

The singular value decomposition or SVD is a generalized version of the eigen decomposition discussed below. The SVD breaks a matrix \(X\) into three parts - two orthonormal matrices \(U\) and \(V\) and a diagonal matrix \(D\):

\[ X = UDV' \] By covention \(U\), \(D\), and \(V\) are ordered so that the diagonal of \(D\) is largest in the upper left and smallest in the lower right. The values of \(D\) are called singular values, the columns of \(U\) are called left singular vectors and the columns of \(V\) right singular vectors.

In R, we can compute the SVD using thesvd function:

M.svd = svd(M)
names(M.svd)

## [1] "d" "u" "v"

M.svd$d

## [1] 411.1463 288.2727 220.3318

dim(M.svd$u)

## [1] 100000      3

dim(M.svd$v)

## [1] 3 3

SVD in Least Squares

Like the previous decompositions, the SVD can be used to solve the least squares normal equations. The SVD approach is most numerically stable in the presence of multicollinearity. Here is the math using the SVD of \(X\):

\[ \begin{align} X'X\beta &= X'Y \\ (VD'U')UDV' \beta &= VD'U'Y \\ D'V'\beta &= U'Y \\ \beta &= VD^{-1}U'Y \end{align} \] And here is an implementation in R:

tm_svd = system.time({
 X.svd = svd(X)
 beta_svd = with(X.svd, 
 v %*% diag(1/d) %*% t(u) %*% Y
                 )
})


cbind('lm'=beta_lm, 'svd'=as.vector(beta_svd))

##                       lm          svd
## (Intercept) 5.082727e-05 5.082727e-05
## M1          9.944854e-01 9.944854e-01
## M2          2.000368e+00 2.000368e+00
## M3          2.996838e+00 2.996838e+00

rbind('svd' = tm_svd, 'qrXX' = tm_qrXX, 'qrX' = tm_qrX, 'chol'=tm_chol, 'lm'=tm_lm)

##      user.self sys.self elapsed user.child sys.child
## svd      0.017    0.004   0.022          0         0
## qrXX     0.007    0.001   0.009          0         0
## qrX      0.060    0.003   0.064          0         0
## chol     0.005    0.000   0.005          0         0
## lm       0.194    0.008   0.203          0         0

Comparing Least Squares Methods

Let’s do a quick comparison to illustrate the relative speed of the various means for solving the normal equations \(X'X\beta = X'Y\).

First, we write functions to obtain \(\beta\) from \(X\) and \(Y\) using each method above.

## Functions
b_chol <- function(X, Y){
R = chol(t(X) %*% X)
 z = solve(t(R), t(X) %*% Y)
 solve(R, z)
}

b_qrXX <- function(X, Y){
 QR = qr(X)
 solve(qr.R(QR), t(qr.Q(QR)) %*% Y)
}

b_qrX <- function(X, Y){
  QR = qr(t(X) %*% X)
  solve(qr.R(QR), t(qr.Q(QR)) %*% t(X) %*% Y)
}

b_svd <- function(X, Y){
 with(svd(X), 
    v %*% diag(1/d) %*% t(u) %*% Y
     )
}

Next, we use the microbenchmark to compare the computations for a particular set of \(X\) and \(Y\):

library(microbenchmark)
n = 1e3
p = 25
X = rnorm(n*p); dim(X) = c(n, p)
Y = rnorm(n)

mb = microbenchmark(
  b_chol(X, Y),
  b_qrX(X, Y),
  b_qrXX(X, Y),
  b_svd(X, Y)
)
print(mb, digits=3, order='median' )

## Unit: microseconds
##          expr  min   lq mean median   uq   max neval
##  b_chol(X, Y)  596  659 1015    806  978  4595   100
##   b_qrX(X, Y) 1104 1215 1480   1379 1513  5033   100
##  b_qrXX(X, Y) 1747 1940 3070   2436 3003 52119   100
##   b_svd(X, Y) 2478 2675 3215   2958 3680  6452   100

ggplot2::autoplot(mb)

dim(mb)

## [1] 400   2

Next we compare these methods in terms of their numerical stability by sampling ill-conditioned matrices \(X\). We will do this by

sim_data = function(n, p, rho){
  
  # Compute Chol of Correlation Matrix
  sigma = matrix(rho, p, p)
  diag(sigma) = 1
  R = chol(sigma)
  
  # Compute data
  X = rnorm(n*p); dim(X) = c(n, p)
  X = X %*% R
  Y = X + rnorm(n)
  cn = with(svd(X), d[1]/d[p])
  list(X=X,Y=Y,cn=cn)
}

sim_beta = function(n, p, R){
 handle_fail = function(error) rep(NA, p)
 with(sim_data(n, p, R),
   tibble(
     chol = mean( { tryCatch(b_chol(X=X, Y=Y), error=handle_fail) -1 }^2),
     qrX  = mean( { tryCatch(b_qrX(X=X, Y=Y), error=handle_fail) - 1 }^2),
     qrXX = mean( { tryCatch(b_qrXX(X=X, Y), error=handle_fail) - 1 }^2),
     svd  = mean( { tryCatch(b_svd(X, Y), error=handle_fail) -  1 }^2),
     cn   = cn 
     ) 
 )
}

Now we can fun sim_beta() many times in parallel.

## Load libraries
library(doParallel)
library(doRNG)
library(dplyr)

## parameters
n = 1e2
p = 99
rho = .9999

## Establish Cluster
cl = makeCluster(3)
registerDoParallel(cl)

## reproducible parallel results
set.seed(9)
results = 
  foreach(i=1:1e3, .packages=c('dplyr'), .combine='bind_rows') %dorng% 
  {
    sim_beta(n, p, rho)
  }
stopCluster(cl)

And then summarize the result:

results %>% 
  summarize(chol_fail = sum(is.na(chol)),
            qrX_fail  = sum(is.na(qrX)),
            qrXX_fail = sum(is.na(qrXX)),
            svd_fail  = sum(is.na(svd)),
            chol_var  = mean(chol, na.rm=T),
            qrX_var   = mean(qrX, na.rm=T),
            qrXX_var  = mean(qrXX, na.rm=T),
            svd_var   = mean(svd, na.rm=T),
            cn = mean(cn, na.rm=T)
            ) %>% 
  mutate(chol_var = chol_var / svd_var,
         qrX_var  = qrX_var / svd_var,
         qrXX_var = qrXX_var / svd_var,
         svd_var  = svd_var / svd_var
         )

## # A tibble: 1 x 9
##   chol_fail qrX_fail qrXX_fail svd_fail chol_var  qrX_var qrXX_var svd_var
##       <int>    <int>     <int>    <int>    <dbl>    <dbl>    <dbl>   <dbl>
## 1         0        0         0        0 1.000003 218.9914        1       1
## # ... with 1 more variables: cn <dbl>

Eigen Decomposition

The spectral or eigen decomposition of a real-valued, symmetric positive definite matrix \(\Sigma\) consists of an orthogonal matrix of eigenvectors \(\Gamma\) and a diagonal matrix of eigenvalues \(\Lambda\) such that:

\[ \Sigma = \Gamma \Lambda \Gamma'. \] The eigen decomposition of \(X'X\) and the SVD of \(X\) are closely related. Suppose \(X = UDV'\), then

\[ X'X = VDU'UDV' = VD^2V'. \]

In R, you can compute an eigen decomposition using eigen:

## Generate some data
n = 100; p = 4
X = rnorm(n*p); dim(X) = c(n, p)
XX = t(X) %*% X

## Compute the Eigen decomposition
XX.eigen = eigen(XX)
XX.eigen

## eigen() decomposition
## $values
## [1] 138.94141 109.68122  91.52755  75.27914
## 
## $vectors
##             [,1]        [,2]       [,3]       [,4]
## [1,]  0.88997782 -0.09240942 0.31554619 -0.3159598
## [2,] -0.26511479 -0.16818903 0.92266080  0.2238831
## [3,] -0.35942069 -0.34045877 0.04520488 -0.8676757
## [4,] -0.09203473  0.92046769 0.21698889 -0.3117445

round(with(XX.eigen, t(vectors) %*% vectors), 8)

##      [,1] [,2] [,3] [,4]
## [1,]    1    0    0    0
## [2,]    0    1    0    0
## [3,]    0    0    1    0
## [4,]    0    0    0    1

Eigen decompositions have many uses which we will not cover here. Two prominent uses within statistcs are in dimensionality reduction - principal components analysis (PCA) and multidimensional scaling (MDS).

Multidimensional Scaling

MDS is a technique for converting a matrix of pairwise dissimilarities into a low-dimensional map that preserves the distances as well as possible.

The first step in any MDS is choosing a (dis)similarity measure.
Often this will be a metric or distance, i.e.: - Euclidean Distance for continuous variables - Manhattan distance or Jacaard dissimilarity for binary variables

Similarity measures can become dissimilarity measures by inverting \(x \to 1/x\) or subtracting from 1 \(x \to 1-x\).

Mathematical Details

Consider a matrix of dissimilarities \(D = \{d_{ij}\}_{i,j}\) metric MDS finds new coordinates \(X = \{(x_{i1}, x_{i2})\}_i\) that minimize the “stress” or “strain”, \[ \sum_{i,j} [d_{ij} - ||x_i - x_j||^2]^{1/2}. \]

The steps to perform a classical MDS are: - Obtain a matrix of squared pairwise dissimilarities - Double center this matrix by subtracting the row/column mean from each row/column - Compute the eigen decomposition of the double-centered dissimilarities

MDS Example 1 | Distances between US cities

The table below shows distances, in miles, between several major US cities.

data(UScitiesD) #Internal R data
UScitiesMDS <- cmdscale(UScitiesD)
mat <- as.matrix(UScitiesD)
colnames(mat) <- c('Atl','Chi','Den','Hou','LA','Mia','NY','SF','Sea','DC')

# Use Knitr to produce a nicer table.
knitr::kable(mat, format='html')  %>%
  kableExtra::kable_styling(bootstrap_options=c('striped','bordered'),font_size=18)

	Atl	Chi	Den	Hou	LA	Mia	NY	SF	Sea	DC
Atlanta	0	587	1212	701	1936	604	748	2139	2182	543
Chicago	587	0	920	940	1745	1188	713	1858	1737	597
Denver	1212	920	0	879	831	1726	1631	949	1021	1494
Houston	701	940	879	0	1374	968	1420	1645	1891	1220
LosAngeles	1936	1745	831	1374	0	2339	2451	347	959	2300
Miami	604	1188	1726	968	2339	0	1092	2594	2734	923
NewYork	748	713	1631	1420	2451	1092	0	2571	2408	205
SanFrancisco	2139	1858	949	1645	347	2594	2571	0	678	2442
Seattle	2182	1737	1021	1891	959	2734	2408	678	0	2329
Washington.DC	543	597	1494	1220	2300	923	205	2442	2329	0

MDS coordinates

We can use MDS to obtain a 2-dimensional map preserving distances as well as possible.

plot(UScitiesMDS, pch='', xlab='Coord1', ylab='Coord2',
     xlim=-c(-1750,1250), bg='grey')
text(UScitiesMDS, attr(UScitiesD, 'Labels'))

For interpretation, it can be helpful to change the sign on the axes.

UScitiesMDS <- -1*UScitiesMDS
plot(UScitiesMDS, pch='', xlab='Coord1', ylab='Coord2', xlim=c(-1750,1250))
text(UScitiesMDS, attr(UScitiesD, 'Labels'))

You can also aid in interpretation by assigning a name to each axis using subject matter knowledge. I also recommend removing the scales as only relative distances between points are meaningful.

plot(UScitiesMDS, pch='', xlab='<<< West    East >>>', 
     ylab='<<< South    North >>>', xaxt='n', yaxt='n', xlim=c(-1750,1250))
text(UScitiesMDS, attr(UScitiesD,'Labels'))

MDS Example 2 | Shortstop Defense

As a second example we will compare the defensive value of MLB shortstops from 2016. We will use a collection of “advanced” defensive metrics from fangraphs.com as our starting data.

We’ll start by importing and cleaning the data.

## Read in Data ##
# Data from www.fangraps.com #
df = read.csv('./SS_Def_2016.csv', stringsAsFactors = FALSE)

# save composite variables #
defMain <- df[, c('Inn','DRS','UZR','Def')]

# remove unwanted variables #
df[,c('Team','Pos','Inn','rSB','rARM','BIZ','Plays','FSR','UZR.150',
      'OOZ','RZR','playerid','CPP','RPP','TZL','ARM')] <- NULL

# df2 also removes composites
df2 <- df
df2[, c('DRS','UZR','Def')] <- NULL

# convert to a matrix to please cmdscale #
defenseMat <- as.matrix(df[,-1])
rownames(defenseMat) <- df$Name

# metrics have different scales, so convert to z-scores #
zScore <- function(x) {x-mean(x)}/sd(x)
defenseZ <- apply(defenseMat, 2, zScore)

head(defenseMat)

##                   rGDP rGFP rPM DRS  DPR RngR ErrR  UZR  Def
## Brandon Crawford     0    3  16  19  1.4 16.2  3.7 21.3 28.0
## Francisco Lindor    -2   -2  21  17 -0.8 18.0  3.6 20.8 27.8
## Freddy Galvis        0    2   3   5  0.7  9.1  5.3 15.1 22.0
## Addison Russell     -1    0  20  19 -1.1 14.5  2.0 15.4 21.9
## Andrelton Simmons    2    0  16  18  0.6 12.9  1.9 15.4 20.8
## Jose Iglesias        2    2  -1   3  1.9  2.6  7.2 11.6 17.6

All of these metrics are in units of ‘runs’, but have varying scales so we will work with z-scores.

Here is a heat map of the transformed values:

## heatmap ##
cols <- colorRampPalette(c('blue','white','red'))(999)
gplots::heatmap.2(defenseZ, col=cols, trace='none', sepcolor='grey', cexRow=.5)

Our first step in MDS is computing the (Euclidean) distances between pairs of players using the z-scores:

## compute pairwise distances ##
defenseDist <- as.matrix(dist(defenseZ, diag=T, upper=T))
cols2 <- colorRampPalette(c('white','red'))(999)
gplots::heatmap.2(defenseDist, col=cols2, trace='none',
          sepcolor='grey', cexRow=.5, cexCol=.5, scale='none',
          symm=TRUE, dendrogram='none')

Given the distances, an MDS algorithm returns a set of coordinates which can be plotted.

## compute MDS results ##
defMDS2 <- -1*cmdscale(defenseDist, 2)

## convert to data frames ##
dfMDS2 <- 
  cbind(
   data.frame(Player=df$Name,
             Coord1=round(defMDS2[,1],2),
             Coord2=round(defMDS2[,2],2)
             ), 
   df
  )

# Correlation of Coord1 with original variables #
#cor(dfMDS2$Coord1,defenseZ)
p1 <- ggplot(dfMDS2,aes(x=Coord1,y=Coord2,Name=Player)) +
  geom_point(aes(col=Def))

# We can make the plot interactive using 
# the plotly library
plotly::ggplotly(p1)

As before, it is generally helpful to use subject matter knowledge to create names or concepts for each coordinate. Below we look at the correlation of the first coordinate with each of the original variables.

r1 <- cor(dfMDS2$Coord1, defenseZ)
o <- order(r1)
r1ord <- r1[o]; names(r1ord) <- colnames(r1)[o]
barplot(r1ord, las=1, ylab='Correlation', main='Coordinate 1')

In this case, the first coordinate tracks overall defensive value which is closely tied to scores based on range. The second coordinate tracks other aspects of value, primarily value from turning double plays.

r2 <- cor(dfMDS2$Coord2, defenseZ)
o <- order(r2)
r2ord <- r2[o]; names(r2ord) <- colnames(r2)[o]
barplot(r2ord, las=1, ylab='Correlation', main='Coordinate 2')

Plot aspects such as color, symbol, and marker size can be used with the new coordinate system to help tell a coherent story.

# Here, we use marker size for total defensive value
# and color for one measure of value from double plays.
p3 <- ggplot(dfMDS2, aes(x=Coord1, y=Coord2, Name=Player)) +
  geom_point(aes(col=DPR,size=Def)) +
  xlab('Overall Defense / Range') + 
  ylab('Doulbe Play Value') +
  ggtitle('MLB Shortstops, 2016.')
plotly::ggplotly(p3)