K-means Clustering is one of unsupervised learning algorithm used when you have unlabeled data. The goal of this algorithm is to find groups of data. It works iterativelly to assign each point to one of K groups based on their feature similarity.
K-means Clustering is applicable and powerful in many fields. For instance, we can cluster the behavior of customers through purchase history when doing business; in the field of healthcare, similar patients can be identified based on their attributes to explore costs, treatments, or outcomes; we can also use the features to cluster the images based on the colors in real time to segment sequence of images/video.
The source of the following algorithm is from wikipedia. Given an initial set of k means \(m_1^1,...,m_k^1,\) the algorithm proceeds by alternating between two steps.
Assignment step: Assign each observation to the cluster whose mean has the least squared Euclidean distance, this is intuitively the “nearest” mean. \[S_{i}^{(t)} = \lbrace x_{p}:\|x_{p}-m_{i}^{(t)}\|^{2} \leqslant \|x_{p}-m_{j}^{(t)}\|^{2} \forall j, 1\leqslant j \leqslant k \rbrace\] where each \(x_p\) is assigned to exactly one \(S^{(t)}\), even if it could be assigned to two or more of them.
Update step: Calculate the new means to be the centroids of the observations in the new clusters \[m_{i}^{t+1} = \frac{1}{|S_{i}^{t}|\sum_{x_{j}\in S_{i}^{(t)}}x_{j} }\]
The algorithm has converged when the assignments no longer change. However, there is no guarantee that the optimum is found using this algorithm.
For our data analysis below, we will use the Seeds Data from UCI Machine Learning Respository. There are 210 instances of dimension 7 along with a categorical label (1, 2, or 3) indicating three different varieties of wheat (Kama, Rosa or Canadian). The seven continuous variables are geometric parameters of wheat kernels including area, perimeter, compactness, length of kernel, width of kernel, asymmetry coefficient and length of kernel groove. We are going to perform K-means Clustering and assign cluster labels to all instances. Performance is measured by comparing cluster labels with true labels. The following tutorial demonstrates the method using three languages: R, Python and Stata.
library(dplyr)
library(tidyr)
data_seeds = data.frame(read.table('http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt'))
names(data_seeds) = c('area','perimeter','compactness','length','width','coefficient','groove','types')
First, we use function eudist to calculate the Euclidean distance of each points with the cluster centers.
eudist = function(x1, x2){
l = sqrt(sum((x1[1:7]-x2[1:7])^2))
return(l)}
Second, function clus assign each point to a group through finding the minimum distance with the cluster centers.
clus = function(x, p){
index = apply(p, 1, eudist, x)
group = which.min(unname(index))
return(group)
}
Through calculating the mean of original groups, we can renew the cluster centers and then asign the points to new groups by function newgroup. Repeat this step until the result converging and we will get the final groups. As the number of iteration goes to 6, the cluster centers don’t change anymore which means convergence so we stop the iteration.
pmat = data_seeds[1:3,]
newgroup = function(data_seeds, pmat){
group_ind = apply(data_seeds, 1, clus, pmat)
data_new = cbind(data_seeds, group_ind)
p1 = data_new %>% filter(group_ind == 1) %>% dplyr::select(c(area,perimeter,compactness,length,width,coefficient,groove)) %>% apply(2,mean)
p2 = data_new %>% filter(group_ind == 2) %>% dplyr::select(c(area,perimeter,compactness,length,width,coefficient,groove)) %>% apply(2,mean)
p3 = data_new %>% filter(group_ind == 3) %>% dplyr::select(c(area,perimeter,compactness,length,width,coefficient,groove)) %>% apply(2,mean)
pmat = rbind(p1,p2,p3)
pmat = rbind(p1,p2,p3)
return(pmat)
}
pmatnew = list()
for (i in 1:6){
if (i == 1) {pmatnew[[i]] = newgroup(data_seeds, pmat)}
else pmatnew[[i]] = newgroup(data_seeds, pmatnew[[i-1]])
}
group = apply(data_seeds, 1, clus, pmatnew[[6]])
seed_new = cbind(data_seeds, group)
seeds = seed_new %>% group_by(types, group) %>% summarize(count=n()) %>% spread(types, count, fill=0)
knitr::kable(seeds, caption='Types Fitted')
group | 1 | 2 | 3 |
---|---|---|---|
1 | 1 | 60 | 0 |
2 | 57 | 10 | 0 |
3 | 12 | 0 | 70 |
Prediction accuracy: 90%
There is a function kmeans built in R. We can compare the result with our previous result.
fit = kmeans(data_seeds, 3)
seed_new = cbind(data_seeds, group = fit$cluster) %>% group_by(types, group) %>% summarize(count=n()) %>% spread(types, count, fill=0)
knitr::kable(seed_new, caption='Types Fitted through Kmeans Package')
group | 1 | 2 | 3 |
---|---|---|---|
1 | 1 | 60 | 0 |
2 | 5 | 0 | 70 |
3 | 64 | 10 | 0 |
Prediction accuracy: 92%
The following function read_data read in the dataset from UCI website and return instances and true labels.
import urllib2
def read_data(url):
""" Reads instances and labels from a file. """
f = urllib2.urlopen(url).readlines()
instances = []
labels = []
for line in f:
# read both feature values and label
instance_and_label = [float(x) for x in line.split()]
# Remove label (last item) from instance_and_label and append it to labels
labels.append(instance_and_label[-1])
instance_and_label.pop()
# Append the instance to instances
instances.append(instance_and_label)
return instances, labels
For a given dataset, we set the number of clusters K = number of unique labels. The following function num_unique_labels returns the number of unique elements in the list labels.
def num_unique_labels(labels):
""" Return number of unique elements in the list labels. """
return len(set(labels))
We first write a function to compute the Euclidean Distance between two points.
def euclidean_squared(p1, p2):
""" Return squared Euclidean distance between two points p1 and p2. """
return sum([abs(x-y)**2 for (x, y) in zip(p1, p2)])
Given instances, we write a function to find the “nearest” center for each observtion. The following function assign_cluster_ids returns a list cluster_ids such that cluster_ids[i] is the index of the center closest to instances[i].
def assign_cluster_ids(instances, centers):
""" Assigns each instance the id of the center closest to it. """
n = len(instances)
cluster_ids = n*[0] # create list of zeros
for i in range(n):
# Compute distances of instances[i] to each of the centers using a list
# comprehension. Make use of the euclidean_squared function defined
# above.
distances = [euclidean_squared(instances[i],centers[j]) for j in range(len(centers))]
# Find the minimum distance.
min_distance = min(distances)
# Set the cluster id to be the index at which min_distance
# is found in the list distances.
cluster_ids[i] = distances.index(min_distance)
return cluster_ids
The following function recompute_centers recomputes the new centers as the centroids of the observations in the new clusters.
def recompute_centers(instances, cluster_ids, centers):
""" Compute centers (means) given cluster ids. """
K = len(centers)
n = len(cluster_ids)
for i in range(K):
# Find indices of of those instances whose cluster id is i.
# Use a single list comprehension.
one_cluster = [j for j in range(n) if cluster_ids[j] == i]
cluster_size = len(one_cluster)
if cluster_size == 0: # empty cluster
raise Exception("kmeans: empty cluster created.")
# Suppose one_cluster is [i1, i2, i3, ... ]
# Compute the mean of the points instances[i1], instances[i2], ...
# using a call to reduce().
# Supply the right 1st arg: a lambda function (this should take two
# points [represented as lists] as arguments and return their sum) and
# the right 2nd arg: a list (computed using a list comprehension)
sum_cluster = reduce(lambda x, y: [p+q for p,q in zip(x,y)], [instances[j] for j in one_cluster])
centers[i] = [x/cluster_size for x in sum_cluster]
Now we are ready to perform K-means Clustering. The following function cluster_using_kmeans perform Assignment step and Update step iteratively until the iteration converges.
import random
def cluster_using_kmeans(instances, K):
""" Cluster instances using the K-means algorithm."""
# Choose initial centers at random from the given instances
centers = random.sample(instances, K)
# create initial cluster ids
cluster_ids = assign_cluster_ids(instances, centers)
converged = False
while not converged:
# recompute centers; note function returns None, modifies centers
# directly
recompute_centers(instances, cluster_ids, centers)
# re-assign cluster ids
new_cluster_ids = assign_cluster_ids(instances, centers)
if new_cluster_ids == cluster_ids: # no change in clustering
converged = True
else:
cluster_ids = new_cluster_ids
return cluster_ids, centers
We are now ready to perform K-means Clustering on our dataset.
from pandas import crosstab
import numpy as np
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt'
instances, labels = read_data(url)
print 'Read %d instances and %d labels from file %s.' % (len(instances), len(labels), data_file)
# Raise exception if data file is not in good format
if len(instances) != len(labels):
raise Exception('Expected equal number of instances and labels.')
else:
n = len(instances)
# Find number of clusters by finding out how many unique elements are there in labels.
K = num_unique_labels(labels)
print 'Found %d unique labels.' % K
# Run k-means clustering to cluster the instances.
cluster_ids, centers = cluster_using_kmeans(instances, K)
# Print the confusion matrix of provided labels and the found clustering
crosstab(np.array(labels), np.array(cluster_ids), rownames=['label'], colnames=['cluster'])
The two-way cross table is as follows,
We have 22 misclassified cases out of 210 cases in total. Prediction accuracy is 90%.
In Stata we could use cluster function to do k-means clustering. There are already 3 groups in origin data set, so we let K equal to 3.
## First we import data in stata and rename them.
import delimited http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt
rename v1 areaA
rename v2 perimeterP
rename v3 compactness
rename v4 lengthKernel
rename v5 width
rename lengthKernel length
rename v6 coef
rename v7 lengthGroove
rename v8 group
## We use areaA perimeterP compactness length coef width and lengthGroove variables to do cluster.
cluster kmeans areaA perimeterP compactness length coef width lengthGroove, k(3) measure(L2) start(group(group)
rename _clus_1 cluster
## Display result
tabulate group cluster
The two-way cross table is as follows. We have 22 misclassified instances. Prediction accuracy is 90%.