6/6/2017

Overview

  • Dimension reduction refers to transforming data with many related variables into data with fewer variables for the purpose of visualization and analysis.

  • Common methods we will discuss today:
    • Principal Components Analysis (PCA)
    • Factor Analysis (FA)
    • Multidimensional Scaling (MDS)

Objectives

  • Participants will be familiar with common techniques for dimension reduction
  • Participants will understand the similarities and differences among the presented techniques
  • I will use case-studies and examples to introduce methods while keeping mathematical details to a minimum.

Examples

Comparison of methods

  • In a principal components analysis we create a new set of uncorrelated variables which are linear combinations of the original variables. These new variables - the principal components - are chosen sequentially to maximize the variance explained.

  • In a factor analysis we look for a small number of latent factors which explain the correlation structure among the original variables.

  • Multidimensional scaling is a technique for converting a matrix of pairwise distances into a low-dimensional map that preserves the distances as well as possible.

Principal Components Analysis

PCA Toy Example

Diabetes

  • Our first toy example uses data on 145 adults from three groups: controls, 'chemical' diabetics, and overt diabetics.
  • Here is a correlation matrix for the three variable, glucose, insulin, and steady state plasma glucose (sspg):
##         glucose insulin  sspg
## glucose    1.00    0.96 -0.40
## insulin    0.96    1.00 -0.35
## sspg      -0.40   -0.35  1.00

Scatter Plot Matrix

  • Here is a scatter plot matrix.

Three variables in two dimensions

  • The high correlation between insulin and glucose indicates that most of the variation in the data occurs along two directions.

Three variables in two dimensions

  • The goal of PCA is to find the directions of maximum variation.

Mathematical Details

  • Consider a data matrix \(D\) consisting of continuous variables.
    Mathematically, PCA solves the following problem:

\[ w_1 := \arg\max_{||w||=1} \textrm{var}(Dw) = \arg\max_{||w||=1} w'X'Xw \]

\[ w_k := \arg\max_{||w||=1} \textrm{var}(D_kw) \]

where \(D_k := D - \sum_{j<k} Dww'\) is the residual relative to the first \(k-1\) components.

  • In practice, this is accomplished using:
    • The eigen-decomposition of the covariance matrix cov(\(X\)) or correlation matrix cor(\(X\))
    • The singular value decomposition of \(X\)

Eigen-decomposition

  • This is the eigen-decomposition of a (symmetric) matrix: \[ \Sigma = \Gamma \Lambda \Gamma' \]
  • The eigenvectors \(\Gamma\) (loadings) can be viewed as weights for creating new orthogonal variables. from the original variables: \(D_{new} = D\Gamma\).
  • The eigenvalues for each component are proportional to the explained variance.

Explained Variance

  • The eigenvalues for each component are proportional to the explained variance.
  • The diabetes correlation matrix has eigenvalues: 2.196, 0.769, 0.035

Component Loadings

  • Each of the new components is a linear combination of the original variables
  • These are the weights for the first two variables:

Plotting the new variables

  • After dimension reduction, we can create a scatter plot of the new variables.

Biplots

  • You will often see the component loadings and new variables portrayed as a 'biplot'.

Covariance vs Correlation

  • In the diabetes example, we use the eigen-decomposition of the correlation matrix because the have different levels of natural variation.
  • Working with correlations is equivalent to first transforming all variables into z-scores.
  • This means that each of the original variables is weighted equally.
  • When the original variables are on the same scale, it can make sense to use the eigen-decomposition of the covaraince matrix so that variables with higher variance are weighted more heavily.

PCA Example

Epithelial Genes in CTC

  • In this example, the variables are eight genes (measured by qPCR) known to be epithelial markers.
  • The samples are the results of an assay for capturing circulating tumor cells (CTCs) in men with metastatic prostate cancer.
  • The goal of this analysis is to use data from positive and negative controls to create a predictive model for identifying samples with CTCs present.

Epithelial Genes

  • Here are the data as a scatter plot matrix:

Choosing not to scale

  • In this example, I chose not to scale the variables since the values were already on a normalized scale.
  • Here are the results, after flipping the sign for both components:

Defining an 'epithelial score'

  • In this case, we can use the first component as an 'epithelial score' with weights:

Defining an 'epithelial score'

  • In the actual application, this score is used to classify which samples contain CTCs.

Factor Analysis

Factor Analysis

  • Factor analysis is similar to PCA, but takes a different point of view.
  • Rather than looking for directions of maximum variation, in a factor analysis we look for a small number of latent factors to explain the relationships among the observed variables.

Example

  • For example, we might think grades in 5 high school subjects are primarily explained by students' reading comprehension and quantitative preparation.

Mathematical Details (1/2)

  • Consider a data matrix \(D\) with \(p\) variables (columns) that have been centered.
  • In a factor analysis, we look for a small number \(k\) of latent factors \(F = [F_1, \dots, F_k]\) such that: \[ D \sim LF + \epsilon \]
  • The loading matrix \(L\) has \(p\) rows and \(k\) columns. Each row of \(L\) explains how one of the original variables is related to the latent factors.
  • The matrix of factor scores \(F\) has \(n\) rows and \(k\) columns meaning that each observation has a unique set of factor scores.
  • We usually assume the factors \(F\) are uncorrelated and have mean zero, and that \(F\) and \(\epsilon\) are independent.

Mathematical Details (2/2)

  • In a factor analysis, we look for a small number \(k\) of latent factors \(F = [F_1, \dots, F_k]\) such that: \[ D \sim LF + \epsilon \]

  • Another way to look at this is in terms of covariance matrices, \[ \textrm{cov}(D) = LL' + \Psi \] with \(\Psi\) describing the unique or unexplained variance for each original variable.

Example: Measures of Glycemic Control

  • For diabetics, blood sugar control has important health consequences.
  • Continuous glucose monitors are instruments that measure someones glucose levels at regular intervals (~5 min).
  • Many 'metrics' for glycemic control have been proposed for mapping this time series to a single summary statistic.

  • The following example works with a collection of metrics computed on baseline data from a JDRF clinical trial (n=443) that have been normalized using Box-Cox transformations.

Glycemic Control Metrics: Correlation

  • A heat map of the correlation matrix shows there are 2-4 distinct groups of metrics.

Examining the Loadings

  • We will begin with a model using three factors.
  • Here are the loadings (L):
    Factor1 Factor2 Factor3
    DySF -0.018 0.060 0.784
    Mean 0.920 -0.355 0.114
    SD 0.931 0.208 0.234
    SD.Slope 0.554 0.201 0.781
    Time.Hyper 0.942 -0.283 0.097
    Time.Hypo -0.076 0.976 0.137
    AUC.Hyper 0.877 -0.057 0.102
    AUC.Hypo 0.126 0.871 0.126
    CONGA.4 0.876 0.252 0.329
    MODD 0.909 0.230 0.238
    ADRR 0.722 0.433 0.468
    HBGI 0.966 -0.197 0.149
    LBGI -0.134 0.974 0.122
    M.Value 0.296 -0.198 -0.033
    MAGE 0.921 0.233 0.226
    MAG 0.518 0.256 0.772
    GRADE 0.959 -0.110 0.169
  • Recall: Metric \(\sim ~ L_1F_1 + L_2F_2 + L_3F_3\)

Loadings and explained variance

  • For each variable, the squared loadings tell us the percent of variance explained by each factor.
Factor1 Factor2 Factor3
DySF 0.00 0.00 0.61
Mean 0.85 0.13 0.01
SD 0.87 0.04 0.05
SD.Slope 0.31 0.04 0.61
Time.Hyper 0.89 0.08 0.01
Time.Hypo 0.01 0.95 0.02
AUC.Hyper 0.77 0.00 0.01
AUC.Hypo 0.02 0.76 0.02
CONGA.4 0.77 0.06 0.11
MODD 0.83 0.05 0.06
ADRR 0.52 0.19 0.22
HBGI 0.93 0.04 0.02
LBGI 0.02 0.95 0.01
M.Value 0.09 0.04 0.00
MAGE 0.85 0.05 0.05
MAG 0.27 0.07 0.60
GRADE 0.92 0.01 0.03

Cumulative variance

  • The overall variance explained by each factor is simply the average across variables.
    Factor1 Factor2 Factor3
    Variance Explained 52.3 20.4 14.4
    Cumulative Variance 52.3 72.7 87.1

Choosing the number of factors

  • We can use the amount of variance explained by each factor to select how many to retain.
  • There are goodness of fit test as well, but more appropriate when designing a single measurement scale.

Choosing the number of factors

  • Adding a third factor increases the explained variance by 15%, while a fourth factor explains less than 2% of additional variance.
    Factor 1 Factor 2 Factor 3 Factor 4
    2 56.7 79.2
    3 52.3 72.7 87.1
    4 52 72.5 86.4 88.9

Plotting the explained variance

  • A bar chart of the explained variance (squared loadings) allows us to visualize the relations among the variables and latent factors.

Plotting the loadings

  • It is also useful to plot the raw loadings in factor space.

Factor scores

  • The latent factor scores can be used in downstream analyses.

Multidimensional Scaling

MDS

  • Multidimensional scaling is a technique for converting a matrix of pairwise dissimilarities into a low-dimensional map that preserves the distances as well as possible.

  • The first step in any MDS is choosing a similarity measure. Often this will be a metric or distance, i.e.:
    • Euclidean Distance for continuous variables
    • Manhattan distance or Jacaard dissimilarity for binary variables
  • Similarity measures can often become dissimilarity measures by inverting \(x \to 1/x\) or subtracting from 1 \(x \to 1-x\).

Mathematical Datails (1/2)

  • Consider a matrix of dissimilarities \(D = \{d_{ij}\}_{i,j}\)
  • Metric MDS finds new coordinates \(X = \{(x_{i1}, x_{i2})\}_i\) that minimize the "stress" or "strain", \[ \sum_{i,j} [d_{ij} - ||x_i - x_j||^2]^{1/2}. \]

Mathematical Details (2/2)

  • The steps to perform a classical MDS are:
    • Obtain a matrix of squared pairwise dissimilarities
    • Double center this matrix by subtracting the row/column mean from each row/column
    • Compute the eigen decomposition of the double-centered dissimilarities
  • In metric MDS a (possibly non-euclidean) distance is used and the optimization problem is solved using an iterative majorization-maximization algorithm.
  • Non-metric MDS can be used when the dissimilarities are obtained directly rather than computed from other variables or otherwise exhibit nonlinearity.

MDS Example 1

Distances between US cities

  • The table below shows distances, in miles, between several major US cities.
    Atl Chi Den Hou LA Mia NY SF Sea DC
    Atlanta 0 587 1212 701 1936 604 748 2139 2182 543
    Chicago 587 0 920 940 1745 1188 713 1858 1737 597
    Denver 1212 920 0 879 831 1726 1631 949 1021 1494
    Houston 701 940 879 0 1374 968 1420 1645 1891 1220
    LosAngeles 1936 1745 831 1374 0 2339 2451 347 959 2300
    Miami 604 1188 1726 968 2339 0 1092 2594 2734 923
    NewYork 748 713 1631 1420 2451 1092 0 2571 2408 205
    SanFrancisco 2139 1858 949 1645 347 2594 2571 0 678 2442
    Seattle 2182 1737 1021 1891 959 2734 2408 678 0 2329
    Washington.DC 543 597 1494 1220 2300 923 205 2442 2329 0

MDS coordinates

  • We can use MDS to obtain a 2-dimensional map preserving distances as well as possible.

Orienting the axes

  • For interpretation, it can be helpful to change the sign on the axes.

Naming the axes

  • You can also aid in interpretation by assigning a name to each axis using subject matter knowledge.
  • I recommend removing the scales as only the distances between points is meaningful.

MDS Example 2

Shortstop Defense

  • As a final example we will compare the defensive value of MLB shortstops from 2016.
  • We will use a collection of advanced defensive metrics from www.fangraphs.com as our starting data.
##                   rGDP rGFP rPM DRS  DPR RngR ErrR  UZR  Def
## Brandon Crawford     0    3  16  19  1.4 16.2  3.7 21.3 28.0
## Francisco Lindor    -2   -2  21  17 -0.8 18.0  3.6 20.8 27.8
## Freddy Galvis        0    2   3   5  0.7  9.1  5.3 15.1 22.0
## Addison Russell     -1    0  20  19 -1.1 14.5  2.0 15.4 21.9
## Andrelton Simmons    2    0  16  18  0.6 12.9  1.9 15.4 20.8
## Jose Iglesias        2    2  -1   3  1.9  2.6  7.2 11.6 17.6

Choosing a scale

  • All of these metrics are in units of 'runs', but have varying scales so we will work with z-scores.
  • Here is a heat map of the transformed values:

Computing distances

  • Our first step is computing the Euclidean distance between each pair of players using the z-scores:

Plotting the new coordinates

  • Given the distances, an MDS algorithm returns a set of coordinates which can be plotted.

Naming the Coordinates (1/2)

  • As before, it is generally helpful to use subject matter knowledge to create names or concepts for each coordinate.

  • Below we look at the correlation of the first coordinate with each of the original variables.

Naming the Coordinates (2/2)

  • In this case, the first coordinate tracks overall defense value which is closely tied to scores based on range.
  • The second coordinate tracks other aspects of value, primarily value from turning double plays.

The End Result

  • Plot aspects such as color, symbol, and marker size can be used with the new coordinate system to help tell a coherent story.

Summary

  • The goal of dimension reduction is to reduce multivariate data to a smaller number of variables while maintaing and eluciditing structure or relations between variables.
  • PCA and factor analysis summarize the relationships/similarities among variables using correlation or covariance.
  • MDS make use of dissimilarities between the units of interest.
  • In their classical forms, all methods make use of a matrix decomposition to form new variables from old.
  • Often a small number of new variables suffice to account for most of the important information – this leads to dimension reduction.