Multidimensional scaling (MDS) is a dimension reduction technique that is most useful for visualizing high-dimensional data during the exploratory phase. MDS “represents measurements of similarity (or dissimilarity) among pairs of objects as distances between points” [1]. Simply put, objects closer together (in terms of MDS coordinates) are more similar than objects farther apart. There are two types of MDS: metric, which deals with quantitative dissimilarities, and non-metric, which deals with qualitative (ordinal) dissimilarities. Within metric MDS, classical MDS involves minimizing a loss function called Strain, which is equivalent to performing eigenvalue decomposition on a Gram matrix constructed from the dissimilarity measure. When the dissimilarity measure is Euclidean distance, classical MDS is the same as Principal Components Analysis. However, the dissimilarity measure need not be Euclidean distance, as we demonstrate below. Indeed, other dissimilarity measures may do a better job of approximating the original data.
In this tutorial, we will use two R functions, torgerson
from the smacof
package and cmdscale
from the stats
package (which is available in base R), plus the mds
command from Stata, to perform classical MDS.
The data used in the following examples come from the Cars93 dataset available in R’s MASS
package. The observations of the dataset are 93 randomly selected 1993 passenger car models listed in the Consumer Reports issue and PACE Buying Guide. Information contained in the dataset includes various automobile metrics such as price, horsepower, and luggage capacity. For raw data click here.
Our three examples will involve briefly cleaning the data, calculating dissimilarities among our chosen variables in four different ways, performing classical MDS, and plotting the resulting coordinates in two-dimensional space. The four measures of dissimilarity we will be using are Euclidean distance, Manhattan distance, Maximum distance, and a dissimilarity measure based on the correlation matrix.
Euclidean distance between the attributes of objects \(i\) and \(j\) is calculated as: \[ \sqrt{(x_{i1} - x_{j1})^2 + (x_{i2} - x_{j2})^2 + \cdots + (x_{ik} - x_{jk})^2} \]
Manhattan distance between the attributes of objects \(i\) and \(j\) is calculated as: \[ |x_{i1} - x_{j1}| + |x_{i2} - x_{j2}| + \cdots + |x_{ik} - x_{jk}| \]
Maximum distance between the attributes of objects \(i\) and \(j\) is calculated as: \[ \max(|x_{i1} - x_{j1}|, |x_{i2} - x_{j2}|,\ldots,|x_{ik} - x_{jk}|) \]
The correlation-based measure of dissimilarity used by Stata (and implemented by us in R) for a matrix \(X\) is calculated as: \[ \sqrt{2\cdot(1-corr(X))} \]
The variables we will be using for MDS are presented in the table below.
Variable | Description |
---|---|
EngineSize | Engine size in liters |
Horsepower | Maximum horsepower |
Length | Length in inches |
Luggage.room | Luggage capacity in cubic feet |
Max.Price | Price for a premium version in $1,000 |
Min.Price | Price for a basic version in $1,000 |
MPG.city | City miles per US gallon by EPA rating |
MPG.highway | Highway miles per US gallon by EPA rating |
Price | Average of minimum and maximum price in $1,000 |
Rear.seat.room | Rear seat room in inches |
Rev.per.mile | Engine revolutions per mile in highest gear |
RPM | Revs per minute at maximum horsepower |
Turn.circle | U-turn space in feet |
Wheelbase | Wheelbase in inches |
Width | Width in inches |
smacof
& data.table
)In this section we will use R packages smacof
and data.table
to perform MDS on the Cars93 dataset. The smacof
package is used to implement various approaches for multidimensional scaling while data.table
is used for data manipulation. First, we load the appropriate libraries.
# Libraries
library(MASS)
library(tidyverse)
library(ggplot2)
library(data.table)
library(smacof)
The dataset for this example is stored in the MASS
package. The following code loads the data in as class data.table which allows for data manipulation using the data.table
package.
# Get data
Cars93 = as.data.table(MASS::Cars93)
The only variables with missing values are Rear.seat.room and Luggage.room. We drop the observations containing these missing values.
Cars93 = Cars93[!is.na(Rear.seat.room) & !is.na(Luggage.room)]
Next, we select our variables and standardize them using the scale
function to give all variables equal weighting.
Cars93_cont_scaled = Cars93[,.(Min.Price, Price, Max.Price,
MPG.city, MPG.highway, EngineSize,
Horsepower, RPM, Rev.per.mile,
Length, Wheelbase, Width,
Turn.circle, Rear.seat.room,
Luggage.room)] %>%
scale()
Then we create different types of dissimilarity matrices using the code below. Any of these dissimilarity measures can be used for MDS.
for (method in c('euclidean', 'manhattan', 'maximum')) {
assign(paste("distance", method, sep='_'), dist(Cars93_cont_scaled, method = method))
}
distance_correlation = sqrt(2*(1-cor(t(Cars93_cont_scaled))))
Lastly, we perform MDS using the torgerson
function from the smacof
package. This function takes the dissimilarity matrix as its argument and performs classical MDS. It is worth noting that torgeson
has an optional argument, p, which is used to specify the dimensionality of the resulting coordinates. We chose the default of 2 dimensions because it is easiest to visualize. However, using more dimensions can better approximate the original data. The code below performs MDS using Euclidian distance and then plots the resulting coordinates.
# Euclidean Distance
torgerson_euclidean = torgerson(distance_euclidean)
df_euclidean = data.frame(
Z1=torgerson_euclidean[,1], Z2=torgerson_euclidean[,2],
Model=Cars93$Model)
ggplot(df_euclidean, aes(x=Z1,y=Z2),main='Main') +
geom_point() +
geom_text(aes(label=Model), hjust=-0.2, size=3)
The code below performs MDS using Manhattan distance and then plots the resulting coordinates.
# Manhattan Distance
torgerson_manhattan = torgerson(distance_manhattan)
df_manhattan = data.frame(
Z1=torgerson_manhattan[,1], Z2=torgerson_manhattan[,2],
Model=Cars93$Model)
ggplot(df_manhattan, aes(x=Z1,y=Z2),main='Main') +
geom_point() +
geom_text(aes(label=Model), hjust=-0.2, size=3)
The code below performs MDS using Maximum distance and then plots the resulting coordinates.
# Maximum Distance
torgerson_maximum = torgerson(distance_maximum)
df_maximum = data.frame(
Z1=torgerson_maximum[,1], Z2=torgerson_maximum[,2],
Model=Cars93$Model)
ggplot(df_maximum, aes(x=Z1,y=Z2),main='Main') +
geom_point() +
geom_text(aes(label=Model), hjust=-0.2, size=3)
The code below performs MDS using the correlation-based dissimilarity measure and then plots the resulting coordinates.
# Correlation
torgerson_correlation = torgerson(distance_correlation)
df_correlation = data.frame(
Z1=torgerson_correlation[,1], Z2=torgerson_correlation[,2],
Model=Cars93$Model)
ggplot(df_correlation, aes(x=Z1,y=Z2),main='Main') +
geom_point() +
geom_text(aes(label=Model), hjust=-0.2, size=3)
stats
& dplyr
)In this section, we will use R packages stats
(included in base R) and dplyr
to perform MDS on the Cars93 dataset. The stats
package will be used for multidimensional scaling and the dplyr
package will be used for data manipulation. The dataset, Cars 93, is stored in the MASS
package.
# Libraries
library(MASS)
library(tidyverse)
library(dplyr)
library(ggplot2)
The following code loads the data in as class tibble so we can manipulate the data using dplyr
.
# Data
cars93 =
as_tibble(Cars93)
The variables ‘Rear.seat.room’ and ‘Luggage.room’ have observations with missing values so we will drop those.
# Drop missing values
cars93 = filter(cars93,!is.na(Rear.seat.room) & !is.na(Luggage.room))
We select our variables of interest and standardize them using the scale
function to give all variables equal weighting.
# Select variables of interest
cars93_scaled =
cars93 %>%
select(Min.Price, Price, Max.Price,
MPG.city, MPG.highway, EngineSize,
Horsepower, RPM, Rev.per.mile,
Length, Wheelbase, Width,
Turn.circle, Rear.seat.room,
Luggage.room) %>%
scale() %>%
as_tibble()
Next, we create different types of dissimilarity matrices. As stated in the introduction, we are interested in Euclidean distance, Manhattan distance, Maximum distance, and a correlation matrix.
# Dissimilarity matrices
for (method in c('euclidean', 'manhattan', 'maximum')) {
assign(paste("distance", method, sep='_'), dist(cars93_scaled, method = method))
}
distance_correlation = sqrt(2*(1-cor(t(cars93_scaled))))
Lastly, we perform MDS for each of the matrices above using the cmdscale
function from the stats
package. The cmdscale
function accepts six arguments with the first two being ‘d’ and ‘k’. The argument ‘d’ is a distance structure or in our case, a dissimilarity matrix. The argument ‘k’ represents the dimension we wish to use to represent the data. We chose to use the default ‘k=2’ because it is the easiest to visualize.
The code below performs MDS using Euclidian distance and then plots the resulting coordinates.
# Euclidean Distance
euclidean = cmdscale(distance_euclidean, k=2)
euclidean_df = data_frame(
Z1=euclidean[,1], Z2=euclidean[,2],
Model=Cars93$Model)
ggplot(euclidean_df, aes(x=Z1,y=Z2), main='Main') +
geom_point() +
geom_text(aes(label=Model), hjust=-0.2, size=3)
The code below performs MDS using Manhattan distance and then plots the resulting coordinates.
# Manhattan Distance
manhattan = cmdscale(distance_manhattan, k=2)
manhattan_df = data.frame(
Z1=manhattan[,1], Z2=manhattan[,2],
Model=Cars93$Model)
ggplot(manhattan_df, aes(x=Z1,y=Z2),main='Main') +
geom_point() +
geom_text(aes(label=Model), hjust=-0.2, size=3)
The code below performs MDS using Maximum distance and then plots the resulting coordinates.
# Maximum Distance
maximum = cmdscale(distance_maximum, k=2)
maximum_df = data.frame(
Z1=maximum[,1], Z2=maximum[,2],
Model=Cars93$Model)
ggplot(maximum_df, aes(x=Z1,y=Z2),main='Main') +
geom_point() +
geom_text(aes(label=Model), hjust=-0.2, size=3)
The code below performs MDS using the correlation-based dissimilarity measure and then plots the resulting coordinates.
# Correlation
correlation = cmdscale(distance_correlation, k=2)
correlation_df = data.frame(
Z1=correlation[,1], Z2=correlation[,2],
Model=Cars93$Model)
ggplot(correlation_df, aes(x=Z1,y=Z2),main='Main') +
geom_point() +
geom_text(aes(label=Model), hjust=-0.2, size=3)
mds
)In this section we will use the Stata command mds
to perform MDS on the Cars93 dataset. First, we import a CSV version of the data.
//Import data
import delim using "Cars93.csv", varnames(1) clear
The only variables with missing values are rearseatroom and luggageroom. We drop the observations containing these missing values. Because the missing values were encoded as “NA” strings, these two variables were imported as string variables, and we have to destring these variables to make them numeric before continuing onto our analysis.
//Drop observations with missing values in rearseatroom or luggageroom
drop if rearseatroom == "NA" | luggageroom == "NA"
destring rearseatroom luggageroom, replace
This step is not necessary, but to save ourselves from having to use the lengthy list of variables twice in the upcoming mds
step, we store our variables of interest in a macro. Remember that Stata macros only exist temporarily, so you would not want run this code separately from the remaining code.
//Macro of continuous variables
local varlist minprice price maxprice mpgcity mpghighway enginesize /*
*/ horsepower rpm revpermile fueltankcapacity length wheelbase width /*
*/ turncircle rearseatroom luggageroom weight
Lastly, we perform MDS using the mds
command. This command uses our list of variables plus some very important options. The ‘id’ option is used to identify unique observations and label them when plotted. The ‘std’ option standardizes the variables so that they are all given equal weight. We also rely on the default value of ‘classical’ for the ‘method’ option to ensure classical MDS is performed. Alternatively we could have specified ‘strain’ for the ‘loss’ option and ‘identity’ for the ‘transform’ option to perform classical MDS. Finally, you will see us use the ‘measure’ option to specify the dissimarlity measure (the default is Euclidean). It is also worth noting that mds
has an additional option, dimension, which is used to specify the dimensionality of the resulting coordinates. We chose the default of 2 dimensions because it is easiest to visualize. However, using more dimensions can better approximate the original data. The code below performs MDS using Euclidian distance and then plots the resulting coordinates.
//MDS: Euclidean distances
mds `varlist', id(model) std(`varlist') noplot
mdsconfig, autoaspect ynegate msize(vsmall) mlabsize(vsmall)
The code below performs MDS using Manhattan distance (as the ‘measure’ option is specificed as ‘manhattan’) and then plots the resulting coordinates.
//MDS: Manhattan distances
mds `varlist', id(model) std(`varlist') measure(manhattan) noplot
mdsconfig, autoaspect ynegate msize(vsmall) mlabsize(vsmall)
The code below performs MDS using Maximum distance (as the ‘measure’ option is specified as ‘maximum’) and then plots the resulting coordinates.
//MDS: Maximum distances
mds `varlist', id(model) std(`varlist') measure(maximum) noplot
mdsconfig, autoaspect xnegate ynegate msize(vsmall) mlabsize(vsmall)
The code below performs MDS using the correlation-based dissimilarity measure (as the ‘measure’ option is specified as ‘correlation’) and then plots the resulting coordinates.
//MDS: Correlation
mds `varlist', id(model) std(`varlist') measure(correlation) noplot
mdsconfig, autoaspect ynegate msize(vsmall) mlabsize(vsmall)
MDS is a convenient way to learn about data. Using this technique, we were able to condense the information provided by 15 different variables into a simple two-dimensional plot. One can look at this plot and immediately draw conclusions. For example, the Cadillac Seville is consistently plotted close to the Infiniti Q45, suggesting that the two models were very similar. Indeed, both were 5-passenger luxury midsize sedans with 8-cylinder engines and even looked quite alike. Another way to use MDS is to look for clusters that can have important implications for additional analyses. In our data there are no clusters that really jump out, but we could do further exploration of the limited separation we see by coloring the points according to the categorical variables available in the data. Finally, we want to emphasize that one is not restricted to two-dimensional classical MDS. All three packages we used for our examples are capable of expanding the dimensionality of the MDS coordinates, and two of the packages, R’s smacof
and Stata’s mds
, can perform a different type of metric MDS called “distance MDS,” or, in Stata, “modern MDS.” This involves minimizing a loss function called Stress instead of Strain.