Introduction For K Nearest Neighbor Classification

For KNN classification, the input consists of the k closest training examples in the feature space and the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

Data Description

We use seeds data set: https://archive.ics.uci.edu/ml/datasets/seeds. Seven geometric parameters of wheat kernels were given:
1. area A
2. perimeter P
3. compactness C = \(4*pi*A/P^2\)
4. length of kernel
5. width of kernel
6. asymmetry coefficient
7. length of kernel groove
All of these parameters were real-valued continuous. The observations also have a categorical parameter: class. The seeds are labelled as 1, 2, 3.

Different ANALYSIS

R DATA ANALYSIS EXAMPLES

This example uses the following packages. Make sure that you can load them before trying to run the examples on this page. If you do not have a package installed, run: install.packages(“packagename”).

require(ggplot2)
require(class)
require(sampling)
require(e1071)
require(caret)

Version info: Code for this page was tested in R version 3.4.2

Data Preparation

seed = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt")
colnames(seed) = c('Area', 'Perimeter', 'Compactness', 'Kernel_len', 'Kernel_wid', 'Asymmetry_coef', 'Groove_len', 'Class')
head(seed)
##    Area Perimeter Compactness Kernel_len Kernel_wid Asymmetry_coef
## 1 15.26     14.84      0.8710      5.763      3.312          2.221
## 2 14.88     14.57      0.8811      5.554      3.333          1.018
## 3 14.29     14.09      0.9050      5.291      3.337          2.699
## 4 13.84     13.94      0.8955      5.324      3.379          2.259
## 5 16.14     14.99      0.9034      5.658      3.562          1.355
## 6 14.38     14.21      0.8951      5.386      3.312          2.462
##   Groove_len Class
## 1      5.220     1
## 2      4.956     1
## 3      4.825     1
## 4      4.805     1
## 5      5.175     1
## 6      4.956     1
seed$Class = as.factor(seed$Class)
dim(seed)
## [1] 210   8

Training Sets and Test Sets

We choose 80% of the samples as the train set and the remaining as the test set. Stratified sampling is used to make sure we have enough samples of each class in the train set. When training, we will compute the distance among train samples. Hence, it’s better to scale all variables before training.

set.seed(2017)
# scale the data
#seed[,1:7]=apply(seed[,1:7], 2, scale)
# stratified sampling 
classes = lapply(levels(seed$Class), function(x) which(seed$Class == x))
train = lapply(classes, function(class) sample(class, 0.8*length(class), replace = F))
train = unlist(train)
test = (1:nrow(seed))[-train]
seedtrainx = seed[train,1:7]
seedtrainy = seed[train,8]
seedtestx = seed[test,1:7]
seedtesty = seed[test,8]
seedknn = knn(train = seedtrainx, cl = seedtrainy, test = seedtestx, k = 5)
summary(seedknn)
##  1  2  3 
##  9 17 16
table(seedtesty, seedknn)
##          seedknn
## seedtesty  1  2  3
##         1  8  3  3
##         2  0 14  0
##         3  1  0 13

We use function knn in package Class to get the result. The funtion argument cl is the factor of true classifications of training set. We simply set k=5 as a example. We can see that 7 test samples’ classification are wrong among all 42 test samples. Hence, the error rate is 16.7%.

Cross-Validation to Select k

To choose the best parameter k, we can use cross-validation. Here, we devided the train set into 10 folds.

set.seed(2017)
knn1 = tune.knn(x= seedtrainx, y = seedtrainy, k = 1:20, tunecontrol=tune.control(sampling = "cross"), cross = 10)
summary(knn1)
## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##   k
##  16
## 
## - best performance: 0.05992647 
## 
## - Detailed performance results:
##     k      error dispersion
## 1   1 0.07242647 0.09410057
## 2   2 0.07794118 0.06423720
## 3   3 0.08970588 0.07078625
## 4   4 0.09595588 0.07739919
## 5   5 0.07757353 0.06434582
## 6   6 0.08382353 0.06498103
## 7   7 0.07757353 0.06434582
## 8   8 0.07757353 0.06434582
## 9   9 0.07757353 0.06434582
## 10 10 0.07169118 0.06294894
## 11 11 0.06580882 0.06089219
## 12 12 0.07794118 0.07721172
## 13 13 0.06580882 0.06089219
## 14 14 0.06580882 0.06089219
## 15 15 0.06580882 0.06089219
## 16 16 0.05992647 0.06438315
## 17 17 0.06580882 0.06089219
## 18 18 0.07794118 0.07721172
## 19 19 0.07169118 0.06294894
## 20 20 0.07757353 0.07006654
ggplot(knn1$performances, aes(x = k, y = error)) + geom_line(linetype = 1) + ylab("error rate")

We use function tune.knn in package e1071 to get the result. The funtion arguments tunecontrol and cross are used to choose a 10-fold cross-validation. We can see from the table or figure that the best k is 1.

Use Funtion Train to do KNN

Besides function knn, we can also use function train in package caret to do knn.

set.seed(2017)
knn2 = train(seedtrainx, seedtrainy, method = "knn", 
             trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3),
             tuneGrid = expand.grid(.k=1:20))
knn2
## k-Nearest Neighbors 
## 
## 168 samples
##   7 predictor
##   3 classes: '1', '2', '3' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 150, 151, 151, 150, 153, 150, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9342347  0.9012122
##    2  0.9207271  0.8808625
##    3  0.9194771  0.8791241
##    4  0.9151879  0.8726031
##    5  0.9246623  0.8869019
##    6  0.9182734  0.8771618
##    7  0.9267620  0.8899326
##    8  0.9311601  0.8966291
##    9  0.9311765  0.8965800
##   10  0.9250191  0.8873681
##   11  0.9244172  0.8864907
##   12  0.9287228  0.8929798
##   13  0.9326580  0.8988949
##   14  0.9284749  0.8925947
##   15  0.9345234  0.9016483
##   16  0.9246786  0.8868879
##   17  0.9282435  0.8922474
##   18  0.9200953  0.8799887
##   19  0.9281046  0.8920148
##   20  0.9241694  0.8861363
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 15.
ggplot(knn2$results, aes(x = k, y = 1 - Accuracy)) + geom_line(linetype = 1) + ylab("error rate")

In arugment trControl we choose method as repeatedcv. It can do multiple times of cross-validation to increase the accuracy. We can see from the table or figure that the best k is still 1.

PYTHON DATA ANALYSIS EXAMPLES

Data Preparation

First step is data preparation.

import numpy as np
names = ['Area', 'Perimeter', 'Compactness', 'Kernel_len', 'Kernel_wid', 'Asymmetry_coef', 'Groove_len', 'Class']
f = open('seeds_dataset.txt', 'r')
instances = []
labels = []
for line in f:
    instance_and_label = [float(x) for x in line.split()]
    labels.append(instance_and_label.pop())
    instances.append(instance_and_label)
x = np.array(instances)
y = np.array(labels)

Training Sets and Test Sets

Then separate data set into testing data and training data.

from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

Evaluate Accuracy

Choose an arbitrary K = 1, fit KNN to our training data and evaluate the accuracy on testing data.

from sklearn.neighbors import KNeighborsClassifier
import math
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(x_train, y_train)
pred = knn.predict(x_test)
print round(float(sum(pred==y_test))/len(y_test),2)

It gives 93% accuracy.

Cross-Validation to Select k

Use 10-fold cross validation to find the optimal K which gives the lowest average misclassification error.

from sklearn.model_selection import cross_val_score
mylist = list(range(1,50))
cv_scores = []
for k in mylist:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, x_train, y_train, cv=10, scoring='accuracy')
    cv_scores.append(scores.mean())
    
import matplotlib.pyplot as plt
error = [1 - x for x in cv_scores]

optimal_k = mylist[error.index(min(error))]
print "The optimal number of neighbors is %d" % optimal_k

plt.plot(mylist, error)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()


The above figure is the misclassification error vs K. The optimal number of neighbors is 18.

Implement KNN algorithm

We could also write our own KNN classification algorithm.

import math
import random

def euclidean_squared(p1, p2):
    return sum([abs(x-y)**2 for (x, y) in zip(p1, p2)])

def knn_classifier(test_data_point, train_data, train_labels, k, possible_labels):
    """ Implements K Nearest Neighbors classification

    Find K nearest neighbors of point in train_data. Then
    the most frequently appearing train_labels (among the labels
    of these neighbors) are calculated. If there are 
    more than one with max frequency, then return one at random.

    possible_labels is the number of possible labels.
    (Possible labels are assumed to be 1, ..., possible_labels.)
    """

    # compute distances of training examples to this point
    distances = [math.sqrt(euclidean_squared(x, test_data_point)) for x in train_data]

    # put distances, taining examples and training labels together
    all_three = zip(distances, train_data, train_labels)

    # sort the triples by distances
    all_three_sorted = sorted(all_three, key = lambda x: x[0])

    # after sorting extract the labels from the first k triples
    nearest_k_labels = [x[2] for x in all_three_sorted[0:k]]

    # initialize label frequencies to 0's
    freq = [0]*(possible_labels + 1)

    # compute labels frequencies in nearest_k_labels
    for label in range(1, possible_labels + 1):
        freq[label] = len([x for x in nearest_k_labels if x == label])

    # find the labels that have maximum frequency
    max_freq = max(freq)
    max_freq_labels = [x for x in range(possible_labels + 1) if freq[x] == max_freq]

    # if there are several labels with max frequency,
    # return one of them at random
    return random.choice(max_freq_labels)

pred = []
for i in range(len(x_test)):
    a = knn_classifier(x_test[i], x_train, y_train, 1, 3)
    pred.append(a)
print pred

The pred gives the same list as knn.predict(x_test) where K=1.

STATA DATA ANALYSIS EXAMPLES

Data Preparation

Import and clean data.

set more off

import delimited "/Users/Demi/Desktop/Project/Group/seeds_dataset.csv", clear

label variable a "area"
label variable p "perimeter"
label variable c "compactness"
label variable l_kernel "length of kernel"
label variable w_kernel "width of kernel"
label variable coefficient "asymmetry coefficient"
label variable l_groove "length of kernel groove"

Proportion Chosen

Discriminate based on 1/3 of the data.

set seed 123
generate u = runiform()
sort u

Cross-Validation to Select k

For loop set k from 1 to 35. The 1st column is the k-value, and remains are the error rate under a certain k-value.

local i 1
matrix knn_mat = J(35,5,0)
matlist knn_mat

forvalues k = 1/35 {
    
    discrim knn a-l_groove in 1/70, group(wheat) k(`k') notable lootable priors(proportional)
    estat errorrate
    
    matrix knn_mat[`k',1] = `k'
    matrix knn_mat[`k',2] = r(erate_count)

    local i=`i'+1
}

matlist knn_mat

Convert the matrix to dataset, then rename variables

clear
svmat knn_mat, names(knn)

rename knn1 k
rename knn2 wheat1
rename knn3 wheat2
rename knn4 wheat3
rename knn5 total

Insert graph of k and error rate.

graph twoway line total k, graphregion(color(white)) ylabel(,nogrid)

Things to consider