Matrix decompositions play a fundamental role in data science and statistics.
For one, they are typically used to solve least-squares approximations and, consequently, also play a role through the iteratively re-weighted least squares algorithm in computationally optimizing the loss function in generalized linear models.
Matrix decompositions also have a prominent role in multivariate data analysis. In particular, they form the basis for principal components analysis and related dimension reduction techniques.
In this lecture we will define and discuss some of the most prominent matrix decompositions including:
The LU decomposition of a square matrix A is
\[ A = LU \] where \(L\) is lower-triangular and \(U\) is upper-triangular. It is commonly used to solve systems of linear equations and to find the determinant of a matrix.
Sometimes you will see this written as \(A = LDU\) where \(D\) is diagonal and both \(L\) and \(U\) are unit matrices with diagonal elements all equal to one. If not, it is conventional to require \(L\) to be a unit matrix and absorb \(D\) into \(U\).
Consider the linear system \(Ax = b\) with \(A \in \mathbb{R}^{p \times p}\) and known, \(b \in \mathbb{R}^{p}\) and known, and \(x \in \mathbb{R}^{p}\) unknown. Then, we can solve for \(x\) by first using forward substitution to solve,
\[ Ly = b \] for \(y\) and then using backward substitution to solve
\[ Ux = y. \]
In R the Matrix package provides implementations (via S4 methods) for LU factorizations.
# Construct an example square matrix: -----------------------------------------
A = rnorm(16)
dim(A) = c(4, 4)
# Compute the lu decomposition and explore the S4 class returned: -------------
A_lu = Matrix::lu(A)
names( attributes(A_lu) )
## [1] "x" "perm" "Dimnames" "Dim" "class"
P = Matrix::expand(A_lu)$P
U = Matrix::expand(A_lu)$U
L = Matrix::expand(A_lu)$L
# Permutations: ---------------------------------------------------------------
ind = apply(P, 1, which)
ind
## [1] 1 4 3 2
all(P %*% A == A[ind, ])
## [1] TRUE
# Representation: -------------------------------------------------------------
A - {L %*% U}[ind, ]
## 4 x 4 Matrix of class "dgeMatrix"
## [,1] [,2] [,3] [,4]
## [1,] 0 0.000000e+00 0 0.000000e+00
## [2,] 0 0.000000e+00 0 1.387779e-17
## [3,] 0 -2.081668e-17 0 0.000000e+00
## [4,] 0 0.000000e+00 0 0.000000e+00
x = rnorm(4); dim(x) = c(4, 1)
b = A %*% x
# Generic S3 method: ----------------------------------------------------------
solve(A, b) - x
## [,1]
## [1,] 0.000000e+00
## [2,] 2.220446e-16
## [3,] -1.110223e-16
## [4,] 0.000000e+00
solve
## function (a, b, ...)
## UseMethod("solve")
## <bytecode: 0x7fe8edb5a990>
## <environment: namespace:base>
y = solve(L[ind, ], b)
solve(U, y) - x
## [,1]
## [1,] 0.000000e+00
## [2,] 2.220446e-16
## [3,] -1.110223e-16
## [4,] 0.000000e+00
# Viewed differently: ---------------------------------------------------------
inv_ind = order(ind)
y = forwardsolve(L, b[inv_ind])
backsolve(U, y) - x
## [,1]
## [1,] 0.000000e+00
## [2,] 2.220446e-16
## [3,] -1.110223e-16
## [4,] 0.000000e+00
The Cholesky Decomposition of a symmetric, positive-definite, real-value matrix \(\Sigma\) is the product of an upper-triangular matrix \(R\) and its transpose \(R'\) such that:
\[
\Sigma = R'R
\] In R we can compute the Cholesky decomposition using chol
:
## example covariance matrix
sigma = c(1, .5, .2, .5, 1, .3, .2, .3, 1)
dim(sigma) = c(3,3)
sigma
## [,1] [,2] [,3]
## [1,] 1.0 0.5 0.2
## [2,] 0.5 1.0 0.3
## [3,] 0.2 0.3 1.0
## Cholesky Decomposition
R = chol(sigma)
R
## [,1] [,2] [,3]
## [1,] 1 0.5000000 0.2000000
## [2,] 0 0.8660254 0.2309401
## [3,] 0 0.0000000 0.9521905
## check that we get sigma back
t(R) %*% R
## [,1] [,2] [,3]
## [1,] 1.0 0.5 0.2
## [2,] 0.5 1.0 0.3
## [3,] 0.2 0.3 1.0
The Cholesky decomposition can be used to transform draws from an iid Gaussian distribution into draws from a multivariate normal distribution:
# generate iid normal variates
n = 1e5
p = 3
X = rnorm(n*p)
dim(X) = c(n, p)
# Correlations are approximately zero
round(cor(X), 2)
## [,1] [,2] [,3]
## [1,] 1.00 0 -0.01
## [2,] 0.00 1 0.00
## [3,] -0.01 0 1.00
# Transform to multivariate normal
M = X %*% R
round(cor(M), 2)
## [,1] [,2] [,3]
## [1,] 1.00 0.5 0.19
## [2,] 0.50 1.0 0.30
## [3,] 0.19 0.3 1.00
The Cholesky decomposition can also be used to solve systems of linear equations, such as the normal equations for a least squares regression:
\[ X'X \beta = X'Y. \] Using the Cholesky decomposition of \(X'X = R'R\) the above equations can be solved very efficiently using forward- and backward-substitution due to the triangular of \(R'\) and \(R\). We first solve for \(z = R\beta\) using forward substitution in:
\[ R'z = X'Y \]
and then solve for \(\beta\) using back substitution in:
\[ R\beta = z. \]
Here’s how this would look in R, along with a comparison of timing relative to lm
.
beta = 1:p; dim(beta) = c(p, 1)
Y = M %*% beta + rnorm(n)
## use lm to check
tm_lm = system.time({
beta_lm = coef(lm(Y~M))
})
# add column for intercept
X = cbind(1, M)
tm_chol = system.time({
# Cholesky decomposition of X'X
R = chol(t(X) %*% X)
# forward subsitution
z = solve(t(R), t(X) %*% Y)
beta_chol = solve(R, z)
})
cbind('lm'=beta_lm, 'chol'=as.vector(beta_chol))
## lm chol
## (Intercept) 0.003987378 0.003987378
## M1 1.004931781 1.004931781
## M2 1.996141957 1.996141957
## M3 2.996529412 2.996529412
rbind('qrXX' = tm_chol, 'chol'=tm_chol, 'lm'=tm_lm)
## user.self sys.self elapsed user.child sys.child
## qrXX 0.005 0.001 0.009 0 0
## chol 0.005 0.001 0.009 0 0
## lm 0.200 0.011 0.218 0 0
The Cholesky decomposition is very efficient to compute and allows for solving systems of linear equations such as result from least squares very quickly. However, it is somewhat less numerically stable and can lead to compounding error when used with ill-conditioned matrices which can result from multicollinearity in regression models.
The QR decomposition of a matrix \(X\) is the product of an orthogonal matrix \(Q\) and an upper triangular matrix \(R\):
\[ X = QR, \qquad Q'Q = I. \]
It is commonly used to solve the normal equations in linear least squares as it is numerically more stable than the Cholesky decomposition. It is also reasonably efficient to compute.
In R we can compute the QR decomposition of a matrix using the qr
:
# compute the QR decomposition
QR = qr(X)
# access Q and R
Q = qr.Q(QR)
R = qr.R(QR)
dim(qr.Q(QR))
## [1] 100000 4
dim(R)
## [1] 4 4
# R is upper triangular
R
## [,1] [,2] [,3] [,4]
## [1,] -316.2278 -1.220807 -1.22078 -0.8502981
## [2,] 0.0000 -315.995405 -158.28308 -61.0737297
## [3,] 0.0000 0.000000 -274.41712 -72.4483448
## [4,] 0.0000 0.000000 0.00000 301.4832780
# Q is orthogonal
t(Q) %*% Q
## [,1] [,2] [,3] [,4]
## [1,] 1.000000e+00 -1.683563e-17 3.564992e-17 -2.432594e-17
## [2,] -1.683563e-17 1.000000e+00 3.969959e-17 2.634230e-17
## [3,] 3.564992e-17 3.969959e-17 1.000000e+00 3.687727e-17
## [4,] -2.432594e-17 2.634230e-17 3.687727e-17 1.000000e+00
There are two ways to solve the normal equations using a QR decomposition using either 1) \(X = QR\) or 2) \(X'X = QR\). Let’s do the first:
\[ \begin{align} X'X\beta &= X'Y \\ R'Q'QR\beta & = R'Q'Y \\ R'R\beta & = R'Q'Y\\ R\beta &= Q'Y. \end{align} \] Here is an implementation in R:
tm_qrX = system.time({
QR = qr(X)
beta_qrX = solve(qr.R(QR), t(qr.Q(QR)) %*% Y)
})
cbind('lm'=beta_lm, 'qrX'=as.vector(beta_qrX))
## lm qrX
## (Intercept) 0.003987378 0.003987378
## M1 1.004931781 1.004931781
## M2 1.996141957 1.996141957
## M3 2.996529412 2.996529412
rbind('qrX' = tm_qrX, 'chol'=tm_chol, 'lm'=tm_lm)
## user.self sys.self elapsed user.child sys.child
## qrX 0.059 0.003 0.062 0 0
## chol 0.005 0.001 0.009 0 0
## lm 0.200 0.011 0.218 0 0
Similarly, we could start with a QR decomposition of \(X'X\). In this case, \[ \begin{align} X'X\beta &= X'Y \\ QR\beta & = X'Y \\ Q'QR\beta & = Q'X'Y\\ R\beta &= Q'X'Y. \end{align} \]
Here is the second method implemented in R:
tm_qrXX = system.time({
QR = qr(t(X) %*% X)
beta_qrXX= solve(qr.R(QR), t(qr.Q(QR)) %*% t(X) %*% Y)
})
cbind('lm'=beta_lm, 'qrXX'=beta_qrXX)
## lm
## (Intercept) 0.003987378 0.003987378
## M1 1.004931781 1.004931781
## M2 1.996141957 1.996141957
## M3 2.996529412 2.996529412
rbind('qrXX' = tm_qrXX, 'qrX' = tm_qrX, 'chol'=tm_chol, 'lm'=tm_lm)
## user.self sys.self elapsed user.child sys.child
## qrXX 0.007 0.001 0.008 0 0
## qrX 0.059 0.003 0.062 0 0
## chol 0.005 0.001 0.009 0 0
## lm 0.200 0.011 0.218 0 0
The approach using the QR decomposition of \(X'X\) is the one implemented in the lm.fit
function used by R. Using the QR decomposition of \(X\) is not favored in practice as it is less numerically stable than other approaches including the Cholesky decomposition.
The singular value decomposition or SVD is a generalized version of the eigen decomposition discussed below. The SVD breaks a matrix \(X\) into three parts - two orthonormal matrices \(U\) and \(V\) and a diagonal matrix \(D\):
\[ X = UDV' \] By convention \(U\), \(D\), and \(V\) are ordered so that the diagonal of \(D\) is largest in the upper left and smallest in the lower right. The values of \(D\) are called singular values, the columns of \(U\) are called left singular vectors and the columns of \(V\) right singular vectors.
In R, we can compute the SVD using thesvd
function:
M.svd = svd(M)
names(M.svd)
## [1] "d" "u" "v"
M.svd$d
## [1] 409.6556 288.9366 220.8709
dim(M.svd$u)
## [1] 100000 3
dim(M.svd$v)
## [1] 3 3
Like the previous decompositions, the SVD can be used to solve the least squares normal equations. The SVD approach is most numerically stable in the presence of multicollinearity. Here is the math using the SVD of \(X\):
\[ \begin{align} X'X\beta &= X'Y \\ (VD'U')UDV' \beta &= VD'U'Y \\ D'V'\beta &= U'Y \\ \beta &= VD^{-1}U'Y \end{align} \] And here is an implementation in R:
tm_svd = system.time({
X.svd = svd(X)
beta_svd = with(X.svd,
v %*% diag(1/d) %*% t(u) %*% Y
)
})
cbind('lm'=beta_lm, 'svd'=as.vector(beta_svd))
## lm svd
## (Intercept) 0.003987378 0.003987378
## M1 1.004931781 1.004931781
## M2 1.996141957 1.996141957
## M3 2.996529412 2.996529412
rbind('svd' = tm_svd, 'qrXX' = tm_qrXX, 'qrX' = tm_qrX, 'chol'=tm_chol, 'lm'=tm_lm)
## user.self sys.self elapsed user.child sys.child
## svd 0.016 0.003 0.020 0 0
## qrXX 0.007 0.001 0.008 0 0
## qrX 0.059 0.003 0.062 0 0
## chol 0.005 0.001 0.009 0 0
## lm 0.200 0.011 0.218 0 0
Let’s do a quick comparison to illustrate the relative speed of the various means for solving the normal equations \(X'X\beta = X'Y\).
First, we write functions to obtain \(\beta\) from \(X\) and \(Y\) using each method above.
## Functions
b_chol <- function(X, Y){
R = chol(t(X) %*% X)
z = solve(t(R), t(X) %*% Y)
solve(R, z)
}
b_chol2 <- function(X, Y){
R = chol( crossprod(X) )
z = solve(t(R), crossprod(X, Y))
solve(R, z)
}
b_qrXX <- function(X, Y){
QR = qr(X)
solve(qr.R(QR), t(qr.Q(QR)) %*% Y)
}
b_qrXX_method <- function(X, Y){
solve.qr( qr( crossprod(X) ), crossprod(X, Y) )
}
b_qrX <- function(X, Y){
QR = qr(t(X) %*% X)
solve(qr.R(QR), t(qr.Q(QR)) %*% t(X) %*% Y)
}
b_svd <- function(X, Y){
with(svd(X),
v %*% diag(1/d) %*% t(u) %*% Y
)
}
Next, we use the microbenchmark
to compare the computations for a particular set of \(X\) and \(Y\):
library(microbenchmark)
n = 1e3
p = 25
X = rnorm(n*p); dim(X) = c(n, p)
Y = rnorm(n)
mb = microbenchmark(
b_chol(X, Y),
b_chol2(X, Y),
b_qrX(X, Y),
b_qrXX(X, Y),
b_qrXX_method(X, Y),
b_svd(X, Y)
)
print(mb, digits=3, order='median' )
## Unit: microseconds
## expr min lq mean median uq max neval cld
## b_chol2(X, Y) 420 441 553 465 502 5728 100 a
## b_qrXX_method(X, Y) 434 466 585 493 546 3566 100 a
## b_chol(X, Y) 611 662 1038 765 979 5285 100 a
## b_qrX(X, Y) 1120 1216 1729 1379 1650 6558 100 a
## b_qrXX(X, Y) 1712 1988 3598 2654 3748 69723 100 b
## b_svd(X, Y) 2498 2642 4252 2997 3930 82087 100 b
ggplot2::autoplot(mb)
dim(mb)
## [1] 600 2
Next we compare these methods in terms of their numerical stability by sampling ill-conditioned matrices \(X\). We will do this by
sim_data = function(n, p, rho){
# Compute Chol of Correlation Matrix
sigma = matrix(rho, p, p)
diag(sigma) = 1
R = chol(sigma)
# Compute data
X = rnorm(n*p); dim(X) = c(n, p)
X = X %*% R
Y = X %*% matrix(1, nrow = p) + rnorm(n)
cn = with(svd(X), d[1]/d[p])
list(X=X,Y=Y,cn=cn)
}
sim_beta = function(n, p, R){
handle_fail = function(error) rep(NA, p)
with(sim_data(n, p, R),
tibble(
chol = mean( { tryCatch(b_chol(X=X, Y=Y), error=handle_fail) -1 }^2),
qrX = mean( { tryCatch(b_qrX(X=X, Y=Y), error=handle_fail) - 1 }^2),
qrXX = mean( { tryCatch(b_qrXX(X=X, Y), error=handle_fail) - 1 }^2),
svd = mean( { tryCatch(b_svd(X, Y), error=handle_fail) - 1 }^2),
cn = cn
)
)
}
Now we can run sim_beta()
many times in parallel.
## Load libraries
library(doParallel)
library(doRNG)
library(dplyr)
## parameters
n = 1e2
p = 99
rho = .999995
## Establish Cluster
cl = makeCluster(3)
registerDoParallel(cl)
## reproducible parallel results
set.seed(42)
results =
foreach(i=1:1e3, .packages=c('dplyr'), .combine='bind_rows') %dorng%
{
sim_beta(n, p, rho)
}
stopCluster(cl)
And then summarize the result:
results %>%
summarize(chol_fail = sum(is.na(chol)),
qrX_fail = sum(is.na(qrX)),
qrXX_fail = sum(is.na(qrXX)),
svd_fail = sum(is.na(svd)),
chol_var = mean(chol, na.rm=T),
qrX_var = mean(qrX, na.rm=T),
qrXX_var = mean(qrXX, na.rm=T),
svd_var = mean(svd, na.rm=T),
cn = mean(cn, na.rm=T)
) %>%
mutate(chol_var = chol_var / svd_var,
qrX_var = qrX_var / svd_var,
qrXX_var = qrXX_var / svd_var,
svd_var = svd_var / svd_var
)
## # A tibble: 1 x 9
## chol_fail qrX_fail qrXX_fail svd_fail chol_var qrX_var qrXX_var svd_var
## <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0 1.000003 218.9914 1 1
## # ... with 1 more variables: cn <dbl>
The spectral or eigen decomposition of a real-valued, symmetric positive definite matrix \(\Sigma\) consists of an orthogonal matrix of eigenvectors \(\Gamma\) and a diagonal matrix of eigenvalues \(\Lambda\) such that:
\[ \Sigma = \Gamma \Lambda \Gamma'. \] The eigen decomposition of \(X'X\) and the SVD of \(X\) are closely related. Suppose \(X = UDV'\), then
\[ X'X = VDU'UDV' = VD^2V'. \]
In R, you can compute an eigen decomposition using eigen
:
## Generate some data
n = 100; p = 4
X = rnorm(n*p); dim(X) = c(n, p)
XX = t(X) %*% X
## Compute the Eigen decomposition
XX.eigen = eigen(XX)
XX.eigen
## eigen() decomposition
## $values
## [1] 138.94141 109.68122 91.52755 75.27914
##
## $vectors
## [,1] [,2] [,3] [,4]
## [1,] 0.88997782 -0.09240942 0.31554619 -0.3159598
## [2,] -0.26511479 -0.16818903 0.92266080 0.2238831
## [3,] -0.35942069 -0.34045877 0.04520488 -0.8676757
## [4,] -0.09203473 0.92046769 0.21698889 -0.3117445
round(with(XX.eigen, t(vectors) %*% vectors), 8)
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
Eigen decompositions have many uses which we will not cover here. Two prominent uses within statistcs are in dimensionality reduction - principal components analysis (PCA) and multidimensional scaling (MDS).
MDS is a technique for converting a matrix of pairwise dissimilarities into a low-dimensional map that preserves the distances as well as possible.
The first step in any MDS is choosing a (dis)similarity measure.
Often this will be a metric or distance, i.e.: - Euclidean Distance for continuous variables - Manhattan distance or Jacaard dissimilarity for binary variables
Similarity measures can become dissimilarity measures by inverting \(x \to 1/x\) or subtracting from 1 \(x \to 1-x\).
Consider a matrix of dissimilarities \(D = \{d_{ij}\}_{i,j}\) metric MDS finds new coordinates \(X = \{(x_{i1}, x_{i2})\}_i\) that minimize the “stress” or “strain”, \[ \sum_{i,j} [d_{ij} - ||x_i - x_j||^2]^{1/2}. \]
The steps to perform a classical MDS are: - Obtain a matrix of squared pairwise dissimilarities - Double center this matrix by subtracting the row/column mean from each row/column - Compute the eigen decomposition of the double-centered dissimilarities
The table below shows distances, in miles, between several major US cities.
data(UScitiesD) #Internal R data
UScitiesMDS <- cmdscale(UScitiesD)
mat <- as.matrix(UScitiesD)
colnames(mat) <- c('Atl','Chi','Den','Hou','LA','Mia','NY','SF','Sea','DC')
# Use Knitr to produce a nicer table.
knitr::kable(mat, format='html') %>%
kableExtra::kable_styling(bootstrap_options=c('striped','bordered'),font_size=18)
Atl | Chi | Den | Hou | LA | Mia | NY | SF | Sea | DC | |
---|---|---|---|---|---|---|---|---|---|---|
Atlanta | 0 | 587 | 1212 | 701 | 1936 | 604 | 748 | 2139 | 2182 | 543 |
Chicago | 587 | 0 | 920 | 940 | 1745 | 1188 | 713 | 1858 | 1737 | 597 |
Denver | 1212 | 920 | 0 | 879 | 831 | 1726 | 1631 | 949 | 1021 | 1494 |
Houston | 701 | 940 | 879 | 0 | 1374 | 968 | 1420 | 1645 | 1891 | 1220 |
LosAngeles | 1936 | 1745 | 831 | 1374 | 0 | 2339 | 2451 | 347 | 959 | 2300 |
Miami | 604 | 1188 | 1726 | 968 | 2339 | 0 | 1092 | 2594 | 2734 | 923 |
NewYork | 748 | 713 | 1631 | 1420 | 2451 | 1092 | 0 | 2571 | 2408 | 205 |
SanFrancisco | 2139 | 1858 | 949 | 1645 | 347 | 2594 | 2571 | 0 | 678 | 2442 |
Seattle | 2182 | 1737 | 1021 | 1891 | 959 | 2734 | 2408 | 678 | 0 | 2329 |
Washington.DC | 543 | 597 | 1494 | 1220 | 2300 | 923 | 205 | 2442 | 2329 | 0 |
We can use MDS to obtain a 2-dimensional map preserving distances as well as possible.
plot(UScitiesMDS, pch='', xlab='Coord1', ylab='Coord2',
xlim=-c(-1750,1250), bg='grey')
text(UScitiesMDS, attr(UScitiesD, 'Labels'))
For interpretation, it can be helpful to change the sign on the axes.
UScitiesMDS <- -1*UScitiesMDS
plot(UScitiesMDS, pch='', xlab='Coord1', ylab='Coord2', xlim=c(-1750,1250))
text(UScitiesMDS, attr(UScitiesD, 'Labels'))
You can also aid in interpretation by assigning a name to each axis using subject matter knowledge. I also recommend removing the scales as only relative distances between points are meaningful.
plot(UScitiesMDS, pch='', xlab='<<< West East >>>',
ylab='<<< South North >>>', xaxt='n', yaxt='n', xlim=c(-1750,1250))
text(UScitiesMDS, attr(UScitiesD,'Labels'))
As a second example we will compare the defensive value of MLB shortstops from 2016. We will use a collection of “advanced” defensive metrics from fangraphs.com as our starting data.
We’ll start by importing and cleaning the data.
## Read in Data ##
# Data from www.fangraps.com #
df = read.csv('../data/SS_Def_2016.csv', stringsAsFactors = FALSE)
# save composite variables #
defMain <- df[, c('Inn','DRS','UZR','Def')]
# remove unwanted variables #
df[,c('Team','Pos','Inn','rSB','rARM','BIZ','Plays','FSR','UZR.150',
'OOZ','RZR','playerid','CPP','RPP','TZL','ARM')] <- NULL
# df2 also removes composites
df2 <- df
df2[, c('DRS','UZR','Def')] <- NULL
# convert to a matrix to please cmdscale #
defenseMat <- as.matrix(df[,-1])
rownames(defenseMat) <- df$Name
# metrics have different scales, so convert to z-scores #
zScore <- function(x) {x-mean(x)}/sd(x)
defenseZ <- apply(defenseMat, 2, zScore)
head(defenseMat)
## rGDP rGFP rPM DRS DPR RngR ErrR UZR Def
## Brandon Crawford 0 3 16 19 1.4 16.2 3.7 21.3 28.0
## Francisco Lindor -2 -2 21 17 -0.8 18.0 3.6 20.8 27.8
## Freddy Galvis 0 2 3 5 0.7 9.1 5.3 15.1 22.0
## Addison Russell -1 0 20 19 -1.1 14.5 2.0 15.4 21.9
## Andrelton Simmons 2 0 16 18 0.6 12.9 1.9 15.4 20.8
## Jose Iglesias 2 2 -1 3 1.9 2.6 7.2 11.6 17.6
All of these metrics are in units of ‘runs’, but have varying scales so we will work with z-scores.
Here is a heat map of the transformed values:
## heatmap ##
cols <- colorRampPalette(c('blue','white','red'))(999)
gplots::heatmap.2(defenseZ, col=cols, trace='none', sepcolor='grey', cexRow=.5)
Our first step in MDS is computing the (Euclidean) distances between pairs of players using the z-scores:
## compute pairwise distances ##
defenseDist <- as.matrix(dist(defenseZ, diag=T, upper=T))
cols2 <- colorRampPalette(c('white','red'))(999)
gplots::heatmap.2(defenseDist, col=cols2, trace='none',
sepcolor='grey', cexRow=.5, cexCol=.5, scale='none',
symm=TRUE, dendrogram='none')
Given the distances, an MDS algorithm returns a set of coordinates which can be plotted.
## compute MDS results ##
defMDS2 <- -1*cmdscale(defenseDist, 2)
## convert to data frames ##
dfMDS2 <-
cbind(
data.frame(Player=df$Name,
Coord1=round(defMDS2[,1],2),
Coord2=round(defMDS2[,2],2)
),
df
)
# Correlation of Coord1 with original variables #
#cor(dfMDS2$Coord1,defenseZ)
p1 <- ggplot(dfMDS2,aes(x=Coord1,y=Coord2,Name=Player)) +
geom_point(aes(col=Def))
# We can make the plot interactive using
# the plotly library
plotly::ggplotly(p1)
As before, it is generally helpful to use subject matter knowledge to create names or concepts for each coordinate. Below we look at the correlation of the first coordinate with each of the original variables.
r1 <- cor(dfMDS2$Coord1, defenseZ)
o <- order(r1)
r1ord <- r1[o]; names(r1ord) <- colnames(r1)[o]
barplot(r1ord, las=1, ylab='Correlation', main='Coordinate 1')
In this case, the first coordinate tracks overall defensive value which is closely tied to scores based on range. The second coordinate tracks other aspects of value, primarily value from turning double plays.
r2 <- cor(dfMDS2$Coord2, defenseZ)
o <- order(r2)
r2ord <- r2[o]; names(r2ord) <- colnames(r2)[o]
barplot(r2ord, las=1, ylab='Correlation', main='Coordinate 2')
Plot aspects such as color, symbol, and marker size can be used with the new coordinate system to help tell a coherent story.
# Here, we use marker size for total defensive value
# and color for one measure of value from double plays.
p3 <- ggplot(dfMDS2, aes(x=Coord1, y=Coord2, Name=Player)) +
geom_point(aes(col=DPR,size=Def)) +
xlab('Overall Defense / Range') +
ylab('Doulbe Play Value') +
ggtitle('MLB Shortstops, 2016.')
plotly::ggplotly(p3)