\[ \newcommand{\R}{\mathbb{R}} \renewcommand{\vec}[1]{\mathbf{#1}} \newcommand{\X}{\mathcal{X}} \newcommand{\D}{\mathcal{D}} \newcommand{\G}{\mathcal{G}} \newcommand{\L}{\mathcal{L}} \newcommand{\X}{\mathcal{X}} \newcommand{\Parents}{\mathrm{Parents}} \newcommand{\NonDesc}{\mathrm{NonDesc}} \newcommand{\I}{\mathcal{I}} \newcommand{\dsep}{\text{d-sep}} \newcommand{\Cat}{\mathrm{Categorical}} \newcommand{\Bin}{\mathrm{Binomial}} \]
In real world, grouping similar items is quite useful, such as recommending similar articles to users who are interested in specific news, and recommending specific products to groups of customers based on similar behaviors. Hierarchical clustering is one of the effective methods to achieve this goal.
Hierarchical clustering is an unsupervised learning algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.
There are two types of hierarchical clustering strategies: Agglomerative and Divisive. In Agglomerative Clustering, we regard each observation as a data point and define it as a cluster at the beginning. Then, pairs of clusters are merged as one moves up the hierarchy, depending on the distance between each other.
There are five different methods for agglomerative hierarchical clustering: Single Linkage, Complete Linkage, Average Linkage, Ward Method, and Centroid Method. The most commonly used linkages are single, complete and average. The reason why we adopt average linkage here is it overcomes the limitation of other two methods (Yim & Ramdeen, 2015):
Concern on single linkage: As it defines the minimum distance between cases from two clusters as the distance between two clusters, the clusters would be merged together just because one case from a cluster and one from another cluster are close to each other, which could produce chaining effect and lead the result to deviate from the reality.
Concern on complete linkage: Though this method avoids the chaining effect, it will be largely affected by the outliers. For example, if one case from cluster 2 is considerably far from the rest cases in cluster 2 and the cases from cluster 1, then cluster 1 and cluster 2 would not be able to be joined together even though all cases from 2 clusters are quite close except for the outlier.
Average linkage is the compromise between above linkage measures, providing a more accurate and appropriate evaluation of the inter-cluster distance. Hence, our tutorial adopts average linkage in hierarchical clustering. In average linkage, we can define the distance between two clusters to be the average distance between all data points in the first cluster and all data points in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters that have the smallest average linkage distance.
Elbow method is one of the cluster stopping rules for deciding the optimal number of clusters. According to wikipedia, the marginal effect of decreasing intra-cluster variance by introducing new clusters is reducing. At a specific point, the new cluster would not add as much information as before. Hence, when we plot the total intra-cluster variation (WSS) against the number of clusters, there will be a inflection point on the graph, which would be our optimal cut-off point by Elbow method.
The formula to calculate the total intra-cluster variation (WSS): \[ \sum_{i=1}^k\sum_{\vec{x}\in S_i}(\vec{x}-\mu_i)^2 \] where we partition a set of observations \(\vec{x} = (x_1, x_2, ., x_n)\) into k (\(\leq n\)) clusters \(\vec{S} = {S_1, S_2, ., S_k}\)
We’ll use the built-in R data set USArrests, which contains statistics in arrests per 100,000 residents for assaultAssault
, murderMurder
, and rapeRape
in each of the 50 US states in 1973. It also gives the percent of the population living in urban areasUrbanPop
. Thus, it is a dataframe with 50 observations on 4 variables.
Hierarchical clustering can be applied to the situation when underlying data has a hierarchical structure. Its time complexity is generally in the order of \(O(n^2 logn)\) , where n is the number of data points. So, when the number of data points is too large, we may not want to rely on Hierarchical clustering.
R packages required:
### use package 'cluster' to do hierarchical clustering
library('cluster')
### use package 'factoextra'and 'dplyr' to do elbow method
library('dplyr')
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library('factoextra')
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
Firstly, we load data from R as df.
df=USArrests
Then, view the first six lines of the dataframe by head().
head(df)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
Make a summary of the data to see the statistics properties of the data set.
summary(df)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
As we don’t want the clustering algorithm to depend to an arbitrary variable unit, we start by standardizing the data using scale() and then view the new data.
df <- scale(df)
head(df)
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
Great! Now we can come to the clustering! In R, we use package ‘cluster’ to do agglomerative hierarchical clustering. We use dist
to calculate the Euclidean Distance between each obeservation and divide them to different clusters according to the average distance. Finally, we plot the dendrograms to illustrate the outcome of hierarchical clustering.
### compute the dissimilarity values
d <- dist(df, method = "euclidean")
### Hierarchical clustering using Average Linkage
hc1 <- hclust(d, method = "average" )
# Plot the obtained dendrogram
plot(hc1, cex = 0.6, hang = -1)
The height of the cut to the dendrogram controls the number of clusters obtained. It plays the same role as the k in k-means clustering. Thus, we need to decide the value of k first.
We use Elbow Method to determine the number of clusters obtained.
Firstly, write a finction f to compute the WSS of each k.
### write a function to compute WSS of each k
f=function(k){
sub_grp <- cutree(hc1, k)
df=as.data.frame(df)
df2=df %>%
mutate(cluster = sub_grp)%>%
group_by(cluster)%>%
summarize(M=mean(Murder),A=mean(Assault),U=mean(UrbanPop),R=mean(Rape))
df1=df %>%
mutate(cluster = sub_grp)
df3=left_join(df1,df2,by="cluster")
D=(df3$Murder-df3$M)^2+(df3$Assault-df3$A)^2+(df3$UrbanPop-df3$U)^2+(df3$Rape-df3$R)^2
df4=cbind(df3,D)
WSS=sum(df4$D)
WSS
}
Then, plot the WSS against k to find the inflection point.
### plot WSS against each k from 1 to 10
WSS=c()
for (i in 1:10){WSS[i]=f(i)}
K=c(1:10)
SSW_K=as.data.frame(cbind(WSS,K))
p=ggplot(SSW_K, aes(x=K, y=WSS)) + geom_line() + geom_point()
p + scale_x_continuous(breaks=K, labels = K) + labs(title = "Elbow Method")
From the plot above we can see that if k < 5, the change of WSS is very fast; While k > 5, the change of WSS becomes slow. Thus, we can determine the number of clusters is 5.
Now we divide the states into 5 clusters based on the outcomes of agglomerative hierarchical clustering.
plot(hc1, cex = 0.6)
rect.hclust(hc1, k = 5, border = 2:5)
Like R and Python, we should load the data into STATA first. We use import delimited
to do that.
clear
set more off
import delimited "./USArrests.csv"
list in 1/6
command allows us to view the first 6 rows of the data. We use rename
command to change the variable name and label it using label
.
list in 1/6
+-------------------------------------------------+
| v1 murder assault urbanpop rape |
|-------------------------------------------------|
1. | Alabama 13.2 236 58 21.2 |
2. | Alaska 10 263 48 44.5 |
3. | Arizona 8.1 294 80 31 |
4. | Arkansas 8.8 190 50 19.5 |
5. | California 9 276 91 40.6 |
|-------------------------------------------------|
6. | Colorado 7.9 204 78 38.7 |
+-------------------------------------------------+
rename v1 state
label variable state "State"
describe
could display the type of different variables, and summarize
shows us the summary information of the data.
describe
Contains data
obs: 50
vars: 5
size: 1,250
--------------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------------------------------------------------------------------
state str14 %14s State
murder float %9.0g Murder
assault int %8.0g Assault
urbanpop byte %8.0g UrbanPop
rape float %9.0g Rape
--------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
summarize
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
state | 0
murder | 50 7.788 4.35551 .8 17.4
assault | 50 170.76 83.33766 45 337
urbanpop | 50 65.54 14.47476 32 91
rape | 50 21.232 9.366384 7.3 46
As the variable state stored in string format, there is no summary data.
As we don’t want the clustering algorithm to depend to an arbitrary variable unit, we start by standardizing the data. To standardize data, we use std
in STATA. Here we use a loop to standardize different variables.
foreach var in murder assault urbanpop rape {
egen `var'_new = std(`var')
}
This is our new data.
list in 1/6
+-----------------------------------------------------------------------------------------------+
| state murder assault urbanpop rape murder~w assaul~w urbanpo~w rape_new |
|-----------------------------------------------------------------------------------------------|
1. | Alabama 13.2 236 58 21.2 1.242564 .7828394 -.5209066 -.0034164 |
2. | Alaska 10 263 48 44.5 .5078625 1.106822 -1.211764 2.484203 |
3. | Arizona 8.1 294 80 31 .0716335 1.478803 .99898 1.042878 |
4. | Arkansas 8.8 190 50 19.5 .2323494 .230868 -1.073593 -.1849166 |
5. | California 9 276 91 40.6 .2782682 1.262814 1.758923 2.06782 |
|-----------------------------------------------------------------------------------------------|
6. | Colorado 7.9 204 78 38.7 .0257146 .3988593 .8608086 1.864967 |
+-----------------------------------------------------------------------------------------------+
Great! Now we can come to the clustering! In STATA, we use cluster linkage variables, name(name of resulting cluster analysis)
to complete the task, and option dendrogram
helps us visualize the cluster.
cluster averagelinkage murder_new assault_new urbanpop_new rape_new, name(cluster2)
cluster dendrogram cluster2, labels(state) xlabel(, angle(90) labsize(*.75)) title("Hierarchical Clustering Dendrogram")
Then we generate 1 to 10 clusters solutions for states and preparign for Elbow methods comparison using groups
function. The states will be assigned to different groups.
cluster gen avg = groups(1/10), name(cluster2)
The result will be like this:
list in 1/6
+-----------------------------------------------------------------------------------+
| state avg1 avg2 avg3 avg4 avg5 avg6 avg7 avg8 avg9 avg10 |
|-----------------------------------------------------------------------------------|
1. | Alabama 1 1 1 1 1 1 1 1 1 1 |
2. | Alaska 1 1 2 3 3 3 4 4 5 6 |
3. | Arizona 1 1 1 2 2 2 2 2 2 3 |
4. | Arkansas 1 2 3 4 4 4 5 5 6 7 |
5. | California 1 1 1 2 2 2 3 3 4 5 |
|-----------------------------------------------------------------------------------|
6. | Colorado 1 1 1 2 2 2 2 2 2 3 |
+-----------------------------------------------------------------------------------+
Below is how we create the within-cluster sum of squares (WSS) matrix. We use forloop to iterate various clusters and generate WSS.
matrix WSS = J(10,2,.) // create an empty matrix with 10 rows 2 columns.
matrix colnames WSS = k WSS // set column name
forvalues k = 1(1)10 {
scalar ws`k' = 0
foreach v of varlist murder_new assault_new urbanpop_new rape_new {
quietly anova `v' avg`k' // use anova to get the WSS stored in "rss"
scalar ws`k' = ws`k' + e(rss)
}
matrix WSS[`k', 1] = `k' // fill in the first column as number of clusters
matrix WSS[`k', 2] = ws`k' // the second column of the matrix is the WSS value
}
Here is our Elbow plot.
_matplot WSS, columns(2 1) connect(l) xlabel(#10) name(plot1, replace) title ("Optimal number of cluster")
As the graph suggests, after 5 clusters, the reducing speed of WSS becomes much slower. Hence, the best number of clusters is 5 according to Elbow method. We use line
to superimpose a horizontal line on the plot.
cluster dendrogram cluster2, labels(state) xlabel(, angle(90) labsize(*.75)) title("Hierarchical Clustering Dendrogram") yline (2)
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn import preprocessing
import numpy as np
import pandas as pd
%matplotlib inline
crimerate_data = pd.read_csv('USArrests.csv')
Since Murder
, Assault
, and Rape
are in unit per 100,000, while UrbanPop
is in percent, we need first to normalize all the features.
statesList = crimerate_data.iloc[:, 0].values # label
data = crimerate_data.iloc[:, 1:5].values
data_scaled = preprocessing.scale(data)
We can check the normalized results by
data_scaled.mean(axis=0)
array([-7.10542736e-17, 1.38777878e-16, -4.39648318e-16, 8.59312621e-16])
data_scaled.std(axis=0)
array([1., 1., 1., 1.])
# generate the linkage matrix ('average' method is selected)
linked = linkage(data_scaled, 'average')
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist # computing the distance
c, coph_dists = cophenet(linked, pdist(data_scaled))
The closer the value is to 1, the better the clustering preserves the original distances
c
0.7180382379320472
The linkage matrix is an array of length n - 1
, where n
is the number of our starting singleton clusters.
linked[i]
includes the information of which two clusters are merged at iteration i
according the selected evalutation metric.
For example:
linked[0]
array([14. , 28. , 0.2079438, 2. ])
The returned array is in the format of [idx1, idx2, dist, sample_count]
. Therefore, at iteration 0
, since indices 14
and 28
only have a distance of 0.2079438
, they are merged together first.
linked[:10]
array([[14. , 28. , 0.2079438 , 2. ],
[12. , 31. , 0.35377437, 2. ],
[13. , 15. , 0.43312429, 2. ],
[22. , 48. , 0.49909939, 2. ],
[35. , 52. , 0.53043045, 3. ],
[19. , 30. , 0.54082482, 2. ],
[18. , 50. , 0.57847034, 3. ],
[36. , 46. , 0.59956023, 2. ],
[45. , 54. , 0.67765914, 4. ],
[40. , 47. , 0.71809843, 2. ]])
It can be noticed that idx2
of linked[4]
is 52
, but we only have 50
data points in total. Actually, all idx >= 50
represents the cluster gererated in linked[idx - 50]
. Therefore, for this case it just means to merge sample 35 (Oklahoma)
to our samples 13 (Indiana)
and 15 (Kansas)
.
plt.figure(figsize=(12, 6))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('States')
plt.ylabel('distance')
dendrogram(linked,
orientation='top', #The direction to plot the dendrogram
#The root at the top, and descendent links going downwards
labels=statesList,
distance_sort='descending',
show_leaf_counts=True)
plt.show()
According to this, we can visualize the distance between clusters by plotting the heatmap
from scipy.spatial.distance import squareform
# Compute and plot first dendrogram.
fig = plt.figure(figsize=(9,9))
# x ywidth height
ax1 = fig.add_axes([0.05,0.1,0.2,0.6])
Y = linked
Z1 = dendrogram(Y, orientation='right',labels=statesList) # adding/removing the axes
ax1.set_xticks([])
# Compute and plot second dendrogram.
ax2 = fig.add_axes([0.3,0.71,0.6,0.2])
Z2 = dendrogram(Y)
ax2.set_xticks([])
ax2.set_yticks([])
#Compute and plot the heatmap
axmatrix = fig.add_axes([0.3,0.1,0.6,0.6])
idx1 = Z1['leaves']
idx2 = Z2['leaves']
D = squareform(pdist(data_scaled))
D = D[idx1,:]
D = D[:,idx2]
im = axmatrix.matshow(D, aspect='auto', origin='lower', cmap=plt.cm.YlGnBu)
axmatrix.set_xticks([])
axmatrix.set_yticks([])
# Plot colorbar.
axcolor = fig.add_axes([0.91,0.1,0.02,0.6])
plt.colorbar(im, cax=axcolor)
<matplotlib.colorbar.Colorbar at 0x1a14d702b0>
According to wikipedia, the goal is to minimize the total intra-cluster variation (WSS). Therefore, we can implement the Elbow Method as a reference for determining the optimal number of clusters:
from sklearn.cluster import AgglomerativeClustering
def wss_calculation(K, data):
WSS = []
for i in range(K):
cluster = AgglomerativeClustering(n_clusters= i+1, affinity='euclidean', linkage='average')
cluster.fit_predict(data)
# cluster index
label = cluster.labels_
wss = []
for j in range(i+1):
# extract each cluster according to its index
idx = [t for t, e in enumerate(label) if e == j]
cluster = data[idx,]
# calculate the WSS:
cluster_mean = cluster.mean(axis=0)
distance = np.sum(np.abs(cluster - cluster_mean)**2,axis=-1)
wss.append(sum(distance))
WSS.append(sum(wss))
return WSS
WSS=wss_calculation(10, data_scaled)
cluster_range = range(1, 11)
plt.figure(figsize=(10,5))
plt.title('Optimal number of cluster')
plt.xlabel('Number of cluster (k)')
plt.ylabel('Total intra-cluster variation')
plt.plot(cluster_range, WSS, marker = "x")
[<matplotlib.lines.Line2D at 0x108a21c18>]
The gain in explained variance reduces significantly from 4 to 5 to 6 (‘elbow’ is at k = 5). So, optimal number of clusters could be 5.
Alternatively, we can just focus on finding where the acceleration of distance growth is the biggest:
last = linked[-10:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.figure(figsize=(10,5))
plt.title('Optimal number of cluster')
plt.xlabel('Number of cluster')
plt.plot(idxs, last_rev, marker = "o", label="distance")
accele = np.diff(last, 2) # 2nd derivative of the distances
accele_rev = accele[::-1]
plt.plot(idxs[:-2] + 1, accele_rev, marker = "x", label = "2nd derivative of distance growth")
plt.legend()
plt.show()
k = accele_rev.argmax() + 2 # if idx 0 is the max of this we want 2 clusters
print ("clusters:", k)
clusters: 5
To illustrate the cut-off distance which determines the selected number of cluster, we can plot a horizontal line in the dendrogram. This line defines the minimum distance required to be a separate cluster. For many other cases, where there is a large number of samples, it will be concise to plot the annotated truncated dendrogram with cut-off line by using fancy_dendrogram().
# the following code is from
# [https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/]
def fancy_dendrogram(*args, **kwargs):
plt.figure(figsize=(10,5))
max_d = kwargs.pop('max_d', None)
if max_d and 'color_threshold' not in kwargs:
kwargs['color_threshold'] = max_d
annotate_above = kwargs.pop('annotate_above', 0)
ddata = dendrogram(*args, **kwargs)
if not kwargs.get('no_plot', False):
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')
for i, d, c in zip(ddata['icoord'], ddata['dcoord'], ddata['color_list']):
x = 0.5 * sum(i[1:3])
y = d[1]
if y > annotate_above:
plt.plot(x, y, 'o', c=c)
plt.annotate("%.3g" % y, (x, y), xytext=(0, -5),
textcoords='offset points',
va='top', ha='center')
if max_d:
plt.axhline(y=max_d, c='k')
return ddata
According to the result of Elbow Method, we can select 5 clusters for our dataset, and therefore we can select the cut-off distance to be 2.0 as shown in the following figure.
fancy_dendrogram(
linked,
truncate_mode='lastp',
p=12,
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True,
annotate_above=1,
max_d=2, # a horizontal cut-off line
)
plt.show()
From the five clusters shown in the dendrogram above, we can see the outcomes of hierarchical clustering have some relationship with geographic locations of the 50 states. For example, most states in the south are divided into one cluster and Alaska is divided into another cluster itself.