Introduction to (Classical) Multidimensional Scaling

Multidimensional scaling (MDS) is a dimension reduction technique that is most useful for visualizing high-dimensional data during the exploratory phase. MDS “represents measurements of similarity (or dissimilarity) among pairs of objects as distances between points” [1]. Simply put, objects closer together (in terms of MDS coordinates) are more similar than objects farther apart. There are two types of MDS: metric, which deals with quantitative dissimilarities, and non-metric, which deals with qualitative (ordinal) dissimilarities. Within metric MDS, classical MDS involves minimizing a loss function called Strain, which is equivalent to performing eigenvalue decomposition on a Gram matrix constructed from the dissimilarity measure. When the dissimilarity measure is Euclidean distance, classical MDS is the same as Principal Components Analysis. However, the dissimilarity measure need not be Euclidean distance, as we demonstrate below. Indeed, other dissimilarity measures may do a better job of approximating the original data.

In this tutorial, we will use two R functions, torgerson from the smacof package and cmdscale from the stats package (which is available in base R), plus the mds command from Stata, to perform classical MDS.

Data Description

The data used in the following examples come from the Cars93 dataset available in R’s MASS package. The observations of the dataset are 93 randomly selected 1993 passenger car models listed in the Consumer Reports issue and PACE Buying Guide. Information contained in the dataset includes various automobile metrics such as price, horsepower, and luggage capacity. For raw data click here.

Our three examples will involve briefly cleaning the data, calculating dissimilarities among our chosen variables in four different ways, performing classical MDS, and plotting the resulting coordinates in two-dimensional space. The four measures of dissimilarity we will be using are Euclidean distance, Manhattan distance, Maximum distance, and a dissimilarity measure based on the correlation matrix.

Euclidean distance between the attributes of objects \(i\) and \(j\) is calculated as: \[ \sqrt{(x_{i1} - x_{j1})^2 + (x_{i2} - x_{j2})^2 + \cdots + (x_{ik} - x_{jk})^2} \]

Manhattan distance between the attributes of objects \(i\) and \(j\) is calculated as: \[ |x_{i1} - x_{j1}| + |x_{i2} - x_{j2}| + \cdots + |x_{ik} - x_{jk}| \]

Maximum distance between the attributes of objects \(i\) and \(j\) is calculated as: \[ \max(|x_{i1} - x_{j1}|, |x_{i2} - x_{j2}|,\ldots,|x_{ik} - x_{jk}|) \]

The correlation-based measure of dissimilarity used by Stata (and implemented by us in R) for a matrix \(X\) is calculated as: \[ \sqrt{2\cdot(1-corr(X))} \]

The variables we will be using for MDS are presented in the table below.

Variable Description
EngineSize Engine size in liters
Horsepower Maximum horsepower
Length Length in inches
Luggage.room Luggage capacity in cubic feet
Max.Price Price for a premium version in $1,000
Min.Price Price for a basic version in $1,000
MPG.city City miles per US gallon by EPA rating
MPG.highway Highway miles per US gallon by EPA rating
Price Average of minimum and maximum price in $1,000
Rear.seat.room Rear seat room in inches
Rev.per.mile Engine revolutions per mile in highest gear
RPM Revs per minute at maximum horsepower
Turn.circle U-turn space in feet
Wheelbase Wheelbase in inches
Width Width in inches

Implementation

R (smacof & data.table)

In this section we will use R packages smacof and data.table to perform MDS on the Cars93 dataset. The smacof package is used to implement various approaches for multidimensional scaling while data.table is used for data manipulation. First, we load the appropriate libraries.

# Libraries 
library(MASS)
library(tidyverse)
library(ggplot2)
library(data.table)
library(smacof)

The dataset for this example is stored in the MASS package. The following code loads the data in as class data.table which allows for data manipulation using the data.table package.

# Get data
Cars93 = as.data.table(MASS::Cars93)

The only variables with missing values are Rear.seat.room and Luggage.room. We drop the observations containing these missing values.

Cars93 = Cars93[!is.na(Rear.seat.room) & !is.na(Luggage.room)]

Next, we select our variables and standardize them using the scale function to give all variables equal weighting.

Cars93_cont_scaled = Cars93[,.(Min.Price, Price, Max.Price,
                               MPG.city, MPG.highway, EngineSize,
                               Horsepower, RPM, Rev.per.mile,
                               Length, Wheelbase, Width,
                               Turn.circle, Rear.seat.room,
                               Luggage.room)] %>%
                      scale()

Then we create different types of dissimilarity matrices using the code below. Any of these dissimilarity measures can be used for MDS.

for (method in c('euclidean', 'manhattan', 'maximum')) {
  assign(paste("distance", method, sep='_'), dist(Cars93_cont_scaled, method = method))
}
distance_correlation = sqrt(2*(1-cor(t(Cars93_cont_scaled))))

Lastly, we perform MDS using the torgerson function from the smacof package. This function takes the dissimilarity matrix as its argument and performs classical MDS. It is worth noting that torgeson has an optional argument, p, which is used to specify the dimensionality of the resulting coordinates. We chose the default of 2 dimensions because it is easiest to visualize. However, using more dimensions can better approximate the original data. The code below performs MDS using Euclidian distance and then plots the resulting coordinates.

# Euclidean Distance
torgerson_euclidean = torgerson(distance_euclidean)
df_euclidean = data.frame(
  Z1=torgerson_euclidean[,1], Z2=torgerson_euclidean[,2],
  Model=Cars93$Model)
ggplot(df_euclidean, aes(x=Z1,y=Z2),main='Main') + 
  geom_point() + 
  geom_text(aes(label=Model), hjust=-0.2, size=3)

The code below performs MDS using Manhattan distance and then plots the resulting coordinates.

# Manhattan Distance
torgerson_manhattan = torgerson(distance_manhattan)
df_manhattan = data.frame(
  Z1=torgerson_manhattan[,1], Z2=torgerson_manhattan[,2],
  Model=Cars93$Model)
ggplot(df_manhattan, aes(x=Z1,y=Z2),main='Main') + 
  geom_point() + 
  geom_text(aes(label=Model), hjust=-0.2, size=3)

The code below performs MDS using Maximum distance and then plots the resulting coordinates.

# Maximum Distance 
torgerson_maximum = torgerson(distance_maximum)
df_maximum = data.frame(
  Z1=torgerson_maximum[,1], Z2=torgerson_maximum[,2],
  Model=Cars93$Model)
ggplot(df_maximum, aes(x=Z1,y=Z2),main='Main') + 
  geom_point() + 
  geom_text(aes(label=Model), hjust=-0.2, size=3)

The code below performs MDS using the correlation-based dissimilarity measure and then plots the resulting coordinates.

# Correlation
torgerson_correlation = torgerson(distance_correlation)
df_correlation = data.frame(
  Z1=torgerson_correlation[,1], Z2=torgerson_correlation[,2],
  Model=Cars93$Model)
ggplot(df_correlation, aes(x=Z1,y=Z2),main='Main') + 
  geom_point() + 
  geom_text(aes(label=Model), hjust=-0.2, size=3)

R (stats & dplyr)

In this section, we will use R packages stats (included in base R) and dplyr to perform MDS on the Cars93 dataset. The stats package will be used for multidimensional scaling and the dplyr package will be used for data manipulation. The dataset, Cars 93, is stored in the MASS package.

# Libraries
library(MASS)
library(tidyverse)
library(dplyr)
library(ggplot2)

The following code loads the data in as class tibble so we can manipulate the data using dplyr.

# Data
cars93 = 
  as_tibble(Cars93)

The variables ‘Rear.seat.room’ and ‘Luggage.room’ have observations with missing values so we will drop those.

# Drop missing values
cars93 = filter(cars93,!is.na(Rear.seat.room) & !is.na(Luggage.room))

We select our variables of interest and standardize them using the scale function to give all variables equal weighting.

# Select variables of interest
cars93_scaled = 
  cars93 %>%
  select(Min.Price, Price, Max.Price,
         MPG.city, MPG.highway, EngineSize,
         Horsepower, RPM, Rev.per.mile,
         Length, Wheelbase, Width,
         Turn.circle, Rear.seat.room,
         Luggage.room) %>%
  scale() %>%
  as_tibble()

Next, we create different types of dissimilarity matrices. As stated in the introduction, we are interested in Euclidean distance, Manhattan distance, Maximum distance, and a correlation matrix.

# Dissimilarity matrices
for (method in c('euclidean', 'manhattan', 'maximum')) {
  assign(paste("distance", method, sep='_'), dist(cars93_scaled, method = method))
}
distance_correlation = sqrt(2*(1-cor(t(cars93_scaled))))

Lastly, we perform MDS for each of the matrices above using the cmdscale function from the stats package. The cmdscale function accepts six arguments with the first two being ‘d’ and ‘k’. The argument ‘d’ is a distance structure or in our case, a dissimilarity matrix. The argument ‘k’ represents the dimension we wish to use to represent the data. We chose to use the default ‘k=2’ because it is the easiest to visualize.

The code below performs MDS using Euclidian distance and then plots the resulting coordinates.

# Euclidean Distance
euclidean = cmdscale(distance_euclidean, k=2) 
euclidean_df = data_frame(
  Z1=euclidean[,1], Z2=euclidean[,2],
  Model=Cars93$Model)
ggplot(euclidean_df, aes(x=Z1,y=Z2), main='Main') + 
  geom_point() + 
  geom_text(aes(label=Model), hjust=-0.2, size=3)

The code below performs MDS using Manhattan distance and then plots the resulting coordinates.

# Manhattan Distance
manhattan = cmdscale(distance_manhattan, k=2)
manhattan_df = data.frame(
  Z1=manhattan[,1], Z2=manhattan[,2],
  Model=Cars93$Model)
ggplot(manhattan_df, aes(x=Z1,y=Z2),main='Main') + 
  geom_point() + 
  geom_text(aes(label=Model), hjust=-0.2, size=3)

The code below performs MDS using Maximum distance and then plots the resulting coordinates.

# Maximum Distance 
maximum = cmdscale(distance_maximum, k=2)
maximum_df = data.frame(
  Z1=maximum[,1], Z2=maximum[,2],
  Model=Cars93$Model)
ggplot(maximum_df, aes(x=Z1,y=Z2),main='Main') + 
  geom_point() + 
  geom_text(aes(label=Model), hjust=-0.2, size=3)

The code below performs MDS using the correlation-based dissimilarity measure and then plots the resulting coordinates.

# Correlation
correlation = cmdscale(distance_correlation, k=2)
correlation_df = data.frame(
  Z1=correlation[,1], Z2=correlation[,2],
  Model=Cars93$Model)
ggplot(correlation_df, aes(x=Z1,y=Z2),main='Main') + 
  geom_point() + 
  geom_text(aes(label=Model), hjust=-0.2, size=3)

Stata (mds)

In this section we will use the Stata command mds to perform MDS on the Cars93 dataset. First, we import a CSV version of the data.

//Import data 
import delim using "Cars93.csv", varnames(1) clear

The only variables with missing values are rearseatroom and luggageroom. We drop the observations containing these missing values. Because the missing values were encoded as “NA” strings, these two variables were imported as string variables, and we have to destring these variables to make them numeric before continuing onto our analysis.

//Drop observations with missing values in rearseatroom or luggageroom
drop if rearseatroom == "NA" | luggageroom == "NA"
destring rearseatroom luggageroom, replace

This step is not necessary, but to save ourselves from having to use the lengthy list of variables twice in the upcoming mds step, we store our variables of interest in a macro. Remember that Stata macros only exist temporarily, so you would not want run this code separately from the remaining code.

//Macro of continuous variables
local varlist minprice price maxprice mpgcity mpghighway enginesize /*
    */ horsepower rpm revpermile fueltankcapacity length wheelbase width /*
    */ turncircle rearseatroom luggageroom weight

Lastly, we perform MDS using the mds command. This command uses our list of variables plus some very important options. The ‘id’ option is used to identify unique observations and label them when plotted. The ‘std’ option standardizes the variables so that they are all given equal weight. We also rely on the default value of ‘classical’ for the ‘method’ option to ensure classical MDS is performed. Alternatively we could have specified ‘strain’ for the ‘loss’ option and ‘identity’ for the ‘transform’ option to perform classical MDS. Finally, you will see us use the ‘measure’ option to specify the dissimarlity measure (the default is Euclidean). It is also worth noting that mds has an additional option, dimension, which is used to specify the dimensionality of the resulting coordinates. We chose the default of 2 dimensions because it is easiest to visualize. However, using more dimensions can better approximate the original data. The code below performs MDS using Euclidian distance and then plots the resulting coordinates.

//MDS: Euclidean distances
mds `varlist', id(model) std(`varlist') noplot
mdsconfig, autoaspect ynegate msize(vsmall) mlabsize(vsmall)
Euclidean

Euclidean

The code below performs MDS using Manhattan distance (as the ‘measure’ option is specificed as ‘manhattan’) and then plots the resulting coordinates.

//MDS: Manhattan distances
mds `varlist', id(model) std(`varlist') measure(manhattan) noplot
mdsconfig, autoaspect ynegate msize(vsmall) mlabsize(vsmall)
Manhattan

Manhattan

The code below performs MDS using Maximum distance (as the ‘measure’ option is specified as ‘maximum’) and then plots the resulting coordinates.

//MDS: Maximum distances
mds `varlist', id(model) std(`varlist') measure(maximum) noplot
mdsconfig, autoaspect xnegate ynegate msize(vsmall) mlabsize(vsmall)
Maximum

Maximum

The code below performs MDS using the correlation-based dissimilarity measure (as the ‘measure’ option is specified as ‘correlation’) and then plots the resulting coordinates.

//MDS: Correlation
mds `varlist', id(model) std(`varlist') measure(correlation) noplot
mdsconfig, autoaspect ynegate msize(vsmall) mlabsize(vsmall)
Correlation

Correlation

Conclusion

MDS is a convenient way to learn about data. Using this technique, we were able to condense the information provided by 15 different variables into a simple two-dimensional plot. One can look at this plot and immediately draw conclusions. For example, the Cadillac Seville is consistently plotted close to the Infiniti Q45, suggesting that the two models were very similar. Indeed, both were 5-passenger luxury midsize sedans with 8-cylinder engines and even looked quite alike. Another way to use MDS is to look for clusters that can have important implications for additional analyses. In our data there are no clusters that really jump out, but we could do further exploration of the limited separation we see by coloring the points according to the categorical variables available in the data. Finally, we want to emphasize that one is not restricted to two-dimensional classical MDS. All three packages we used for our examples are capable of expanding the dimensionality of the MDS coordinates, and two of the packages, R’s smacof and Stata’s mds, can perform a different type of metric MDS called “distance MDS,” or, in Stata, “modern MDS.” This involves minimizing a loss function called Stress instead of Strain.

References

[1] https://link.springer.com/chapter/10.1007/978-1-4757-2711-1_1
[2] https://en.wikipedia.org/wiki/Multidimensional_scaling
[3] http://www-stat.wharton.upenn.edu/~buja/PAPERS/paper-mds-jcgs.pdf
[4] https://www.stata.com/manuals13/mvmds.pdf