K-means Clustering

Example

For our data analysis below, we will use the Seeds Data from UCI Machine Learning Respository. There are 210 instances of dimension 7 along with a categorical label (1, 2, or 3) indicating three different varieties of wheat (Kama, Rosa or Canadian). The seven continuous variables are geometric parameters of wheat kernels including area, perimeter, compactness, length of kernel, width of kernel, asymmetry coefficient and length of kernel groove. We are going to perform K-means Clustering and assign cluster labels to all instances. Performance is measured by comparing cluster labels with true labels. The following tutorial demonstrates the method using three languages: R, Python and Stata.

R

Load Data

library(dplyr)
library(tidyr)
data_seeds = data.frame(read.table('http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt'))
names(data_seeds) = c('area','perimeter','compactness','length','width','coefficient','groove','types')

Assignment Step

First, we use function eudist to calculate the Euclidean distance of each points with the cluster centers.

eudist = function(x1, x2){
  l = sqrt(sum((x1[1:7]-x2[1:7])^2))
  return(l)}

Second, function clus assign each point to a group through finding the minimum distance with the cluster centers.

clus = function(x, p){
  index = apply(p, 1, eudist, x)
  group = which.min(unname(index))
  return(group)
}

Iterative Step

Through calculating the mean of original groups, we can renew the cluster centers and then asign the points to new groups by function newgroup. Repeat this step until the result converging and we will get the final groups. As the number of iteration goes to 6, the cluster centers don’t change anymore which means convergence so we stop the iteration.

pmat = data_seeds[1:3,]
newgroup = function(data_seeds, pmat){
  group_ind = apply(data_seeds, 1, clus, pmat)
  data_new = cbind(data_seeds, group_ind)
  p1 = data_new %>% filter(group_ind == 1) %>% dplyr::select(c(area,perimeter,compactness,length,width,coefficient,groove)) %>% apply(2,mean)
  p2 = data_new %>% filter(group_ind == 2) %>% dplyr::select(c(area,perimeter,compactness,length,width,coefficient,groove)) %>% apply(2,mean)
  p3 = data_new %>% filter(group_ind == 3) %>% dplyr::select(c(area,perimeter,compactness,length,width,coefficient,groove)) %>% apply(2,mean)
  pmat = rbind(p1,p2,p3)
  pmat = rbind(p1,p2,p3)
  return(pmat)
}
pmatnew = list()
for (i in 1:6){
  if (i == 1) {pmatnew[[i]] = newgroup(data_seeds, pmat)}
  else pmatnew[[i]] = newgroup(data_seeds, pmatnew[[i-1]])
}
group = apply(data_seeds, 1, clus, pmatnew[[6]])
seed_new = cbind(data_seeds, group) 
seeds = seed_new %>% group_by(types, group) %>% summarize(count=n()) %>% spread(types, count, fill=0)
knitr::kable(seeds, caption='Types Fitted')

Types Fitted
group	1	2	3
1	1	60	0
2	57	10	0
3	12	0	70

Prediction accuracy: 90%

Comparison

There is a function kmeans built in R. We can compare the result with our previous result.

fit = kmeans(data_seeds, 3)
seed_new = cbind(data_seeds, group = fit$cluster) %>% group_by(types, group) %>% summarize(count=n()) %>% spread(types, count, fill=0)
knitr::kable(seed_new, caption='Types Fitted through Kmeans Package')

Types Fitted through Kmeans Package
group	1	2	3
1	1	60	0
2	5	0	70
3	64	10	0

Prediction accuracy: 92%

Python

Read in data

The following function read_data read in the dataset from UCI website and return instances and true labels.

import urllib2

def read_data(url):
    """ Reads instances and labels from a file. """

    f = urllib2.urlopen(url).readlines()
    instances = []
    labels = []

    for line in f:

        # read both feature values and label
        instance_and_label = [float(x) for x in line.split()]
        
        # Remove label (last item) from instance_and_label and append it to labels
        labels.append(instance_and_label[-1])
        instance_and_label.pop()
        
        # Append the instance to instances
        instances.append(instance_and_label)

    return instances, labels

Determine K

For a given dataset, we set the number of clusters K = number of unique labels. The following function num_unique_labels returns the number of unique elements in the list labels.

def num_unique_labels(labels):
    """ Return number of unique elements in the list labels. """
    
    return len(set(labels))

Assignment Step

We first write a function to compute the Euclidean Distance between two points.

def euclidean_squared(p1, p2):
    """ Return squared Euclidean distance between two points p1 and p2. """

    return sum([abs(x-y)**2 for (x, y) in zip(p1, p2)])

Given instances, we write a function to find the “nearest” center for each observtion. The following function assign_cluster_ids returns a list cluster_ids such that cluster_ids[i] is the index of the center closest to instances[i].

def assign_cluster_ids(instances, centers):
    """ Assigns each instance the id of the center closest to it. """

    n = len(instances)
    cluster_ids = n*[0]  # create list of zeros

    for i in range(n):

        # Compute distances of instances[i] to each of the centers using a list
        # comprehension. Make use of the euclidean_squared function defined
        # above.
        distances = [euclidean_squared(instances[i],centers[j]) for j in range(len(centers))]

        # Find the minimum distance.
        min_distance = min(distances)

        # Set the cluster id to be the index at which min_distance
        # is found in the list distances.
        cluster_ids[i] = distances.index(min_distance)

    return cluster_ids

Update Step

The following function recompute_centers recomputes the new centers as the centroids of the observations in the new clusters.

def recompute_centers(instances, cluster_ids, centers):
    """ Compute centers (means) given cluster ids. """

    K = len(centers)
    n = len(cluster_ids)

    for i in range(K):

        # Find indices of of those instances whose cluster id is i.
        # Use a single list comprehension.
        one_cluster = [j for j in range(n) if cluster_ids[j] == i]
        cluster_size = len(one_cluster)
        if cluster_size == 0:  # empty cluster
            raise Exception("kmeans: empty cluster created.")

        # Suppose one_cluster is [i1, i2, i3, ... ]
        # Compute the mean of the points instances[i1], instances[i2], ...
        # using a call to reduce().
        # Supply the right 1st arg: a lambda function (this should take two
        # points [represented as lists] as arguments and return their sum) and
        # the right 2nd arg: a list (computed using a list comprehension)
        sum_cluster = reduce(lambda x, y: [p+q for p,q in zip(x,y)], [instances[j] for j in one_cluster])
        centers[i] = [x/cluster_size for x in sum_cluster]

Iteratively Update Clusters

Now we are ready to perform K-means Clustering. The following function cluster_using_kmeans perform Assignment step and Update step iteratively until the iteration converges.

import random

def cluster_using_kmeans(instances, K):
    """ Cluster instances using the K-means algorithm."""
    
    # Choose initial centers at random from the given instances
    centers = random.sample(instances, K)

    # create initial cluster ids
    cluster_ids = assign_cluster_ids(instances, centers)

    converged = False
    while not converged:

        # recompute centers; note function returns None, modifies centers
        # directly
        recompute_centers(instances, cluster_ids, centers)

        # re-assign cluster ids
        new_cluster_ids = assign_cluster_ids(instances, centers)

        if new_cluster_ids == cluster_ids:  # no change in clustering
            converged = True
        else:
            cluster_ids = new_cluster_ids

    return cluster_ids, centers

Implement K-means Clustering on Seeds Dataset

We are now ready to perform K-means Clustering on our dataset.

from pandas import crosstab
import numpy as np

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt'
instances, labels = read_data(url)
print 'Read %d instances and %d labels from file %s.' % (len(instances), len(labels), data_file)

# Raise exception if data file is not in good format
if len(instances) != len(labels):
    raise Exception('Expected equal number of instances and labels.')
else:
    n = len(instances)

# Find number of clusters by finding out how many unique elements are there in labels.
K = num_unique_labels(labels)
print 'Found %d unique labels.' % K

# Run k-means clustering to cluster the instances.
cluster_ids, centers = cluster_using_kmeans(instances, K)

# Print the confusion matrix of provided labels and the found clustering
crosstab(np.array(labels), np.array(cluster_ids), rownames=['label'], colnames=['cluster'])

The two-way cross table is as follows,

We have 22 misclassified cases out of 210 cases in total. Prediction accuracy is 90%.

Stata

In Stata we could use cluster function to do k-means clustering. There are already 3 groups in origin data set, so we let K equal to 3.

## First we import data in stata and rename them.

import delimited http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt

rename v1 areaA
rename v2 perimeterP
rename v3 compactness
rename v4 lengthKernel
rename v5 width
rename lengthKernel length
rename v6 coef
rename v7 lengthGroove
rename v8 group

## We use areaA perimeterP compactness length coef width and lengthGroove variables to do cluster.
cluster kmeans areaA perimeterP compactness length coef width lengthGroove, k(3) measure(L2) start(group(group)
rename _clus_1 cluster

## Display result
tabulate group cluster

The two-way cross table is as follows. We have 22 misclassified instances. Prediction accuracy is 90%.

K-means Clustering

by Group 14 - Zetan Li, Hsuan-Yu Yeh, Jing Chu

Introduction

Use in real life

Algorithm

Example

R

Load Data

Assignment Step

Iterative Step

Comparison

Python

Read in data

Determine K

Assignment Step

Update Step

Iteratively Update Clusters

Implement K-means Clustering on Seeds Dataset

Stata