Introduction

Lasso

Lasso (least absolute shrinkage and selection operator) (also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces. It was introduced by Robert Tibshirani in 1996 based on Leo Breiman’s nonnegative garrote. Lasso was originally formulated for least squares models and this simple case reveals a substantial amount about the behavior of the estimator, including its relationship to ridge regression and best subset selection and the connections between lasso coefficient estimates and so-called soft thresholding. It also reveals that (like standard linear regression) the coefficient estimates need not be unique if covariates are collinear.

The estimation of coefficients is shown below, it includes a loss part and a penalty part similar to ridge regression. The advantage of the penalty part of the lasso is that it allows for regression coefficients to go to exactly zero. This in turn makes the models easier to interpret since only a few important coefficients are kept.

\[\hat\beta=argmin\{\frac{1}{N}||y-X\beta||_2^2+\lambda||\beta||_1\}\]

Multinomial Logistic Regression

Multinomial logistic regression is a particular solution to the classification problem that assumes that a linear combination of the observed features and some problem-specific parameters can be used to determine the probability of each particular outcome of the dependent variable. The best values of the parameters for a given problem are usually determined from some training data (e.g. some people for whom both the diagnostic test results and blood types are known, or some examples of known words being spoken).

For the multinomial model, suppose the response variable has \(K\) levels \(G=\{1,2,3,...,K\}\). Here we model \[Pr(G=k|X=x)=\frac{e^{\beta_0+\beta_{k}^{T} x}}{\sum_{l=1}^{K}e^{\beta_{0l}+\beta_{l}^T x}}\] Let \(Y\) be the \(N\times K\) indicator response matrix, with elements \(y_{il}=I(g_i=l)\). Then we combine multinomial model with Lasso penalty, the elastic-net penalized negative log-likelihood function becomes \[l(\{\beta_{0k},\beta_k\}_1^K)=-[\frac{1}{N}\sum_{i=1}^{N}( \sum_{k=1}^{K}y_{il}(\beta_0+x_i^T\beta_{k})-log(\sum_{k=1}^{K}e^{\beta_0+x_i^T\beta_{k}}) )]+\lambda[\sum_{j=1}^{p}||\beta_j||_1]\] The standard Newton algorithm can be tedious here. Instead, we use a so-called partial Newton algorithm by making a partial quadratic approximation to the log-likelihood, allowing only (\(\beta_{0k}\),\(\beta_k\)) to vary for a single class at a time. For each value of \(\lambda\), we first cycle over all classes indexed by k, computing each time a partial quadratic approximation about the parameters of the current class. Then the inner procedure is almost the same as for the binomial case.

Library: Glmnet

Glmnet is a package that fits a generalized linear model via penalized maximum likelihood. The regularization path is computed for the lasso or elasticnet penalty at a grid of values for the regularization parameter lambda. The algorithm is extremely fast, and can exploit sparsity in the input matrix x. It fits linear, logistic and multinomial, poisson, and Cox regression models. A variety of predictions can be made from the fitted models.

Regularization is a technique that constraints the value of the estimated coefficients by setting some of them to zero which helps interpretability of the final model. So in a way model selection is performed automatically by the model. A parameter t is used in controlling the amount of regularization, the best tuning parameter that balances underfitting and overfitting is chosen.

In the case of the MNIST data, a multinomial family is selected with the response being the digits 0-9.

Dataset: MNIST

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. Every MNIST data point has two parts: an image of a handwritten digit and a corresponding label. We’ll call the images “x” and the labels “y”. Both the training set and test set contain images and their corresponding labels; It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.

Handwritten Digits Recognition

R

load("C:/Users/zzr/mnist.RData")

The data set includes 2 parts, one part is the train dataset with 60000 observations, the other part is the test dataset with 10000 observations. We can visualize the dataset by a particular function:

par(mfrow=c(2,3))
show_digit(train[1,-1])
show_digit(train[2,-1])
show_digit(train[3,-1])
show_digit(train[4,-1])
show_digit(train[5,-1])
show_digit(train[6,-1])

Load Libraries

library(glmnet)

Cross Validation

As for the model: \[\hat\beta=argmin\{\frac{1}{N}||y-X\beta||_2^2+\lambda||\beta||_1\}\] \(\lambda\) is not a fixed parameters, we use cross validation on trainset to fix \(\lambda\) with the minimum of MSE. Here we only use 1000 observasions to demonstrate the cross validation due to large computation. cv.glmnet is used for cross validation of Lasso model.

train1=train[1:1000,]
Y=train1$Y
X=as.matrix(train1[,2:785])
fit=cv.glmnet(X,Y,family = "multinomial",type.measure = "class")
plot(fit)

According to the plot, we can find the \(\lambda\) which minimize the misclassification error. Hence, we can get the best \(\lambda\) by cross validation.

Coefficents

par(mfrow=c(1,2))
plot(fit$glmnet.fit)

The coefficents for each response is shown as above. We can find how Lasso shrinkage the coefficients.

Prediction

The fitted Lasso model can be used for prediction(recognition) on testset. We can get the best \(\lambda\) by lambda.min in R.

testx=as.matrix(test[,2:785])
testy=test$Y
pred=predict(fit,testx,s=fit$lambda.min,type="class")
par(mfrow=c(2,3))
for(i in 1:6){
  show_digit(testx[i,])
  title(sprintf("prediction = %s",pred[i]))
}

The prediction accuracy:

mean(testy==pred)
## [1] 0.8356

The accuracy can be much higher if we use the whole trainset.

Python

Load Data From Sklearn

%matplotlib inline
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')

The dataset MNIST consists of pairs: “Handwritten digit image” and “Label”. Digit ranges from 0 to 9, meaning 10 patterns in total.

handwritten digit image: This is gray scale image with size 28 x 28 pixel.

label : This is actual digit number this handwritten digit image represents. It is either 0 to 9.

print("Image Data Shape" , mnist.data.shape)
print("Label Data Shape", mnist.target.shape)
Image Data Shape (70000, 784)
Label Data Shape (70000,)

There are 70000 elements in MNIST.

Split Data into Training and Testing Sets

Here we use the train_test_split function in sklearn to split arrays or matrices into random train and test subsets.

from sklearn.model_selection import train_test_split
train_img, test_img, train_lbl, test_lbl = train_test_split(
    mnist.data, mnist.target, test_size=1/7.0, random_state=0)

Training set contains 60000 images and test set contains 10000 images. Each image is 28 x 28 pixel.

print(train_img.shape)
(60000, 784)
print(train_lbl.shape)
(60000,)
print(test_img.shape)
(10000, 784)
print(test_lbl.shape)
(10000,)

Show the Images and Labels

Bellow we show the first 5 images in training set.

import numpy as np
import matplotlib.pyplot as plt
plt.figure(figsize=(20,4))
for index, (image, label) in enumerate(zip(train_img[0:5], train_lbl[0:5])):
    plt.subplot(1, 5, index + 1)
    plt.imshow(np.reshape(image, (28,28)), cmap=plt.cm.gray)
    plt.title('Training: %i\n' % label, fontsize = 20)

Train the Model with Cross Validation

Glmnet is a package that fits a generalized linear model via penalized maximum likelihood. The regularization path is computed for the lasso or elasticnet penalty at a grid of values for the regularization parameter lambda. It fits linear, logistic and multinomial, poisson, and Cox regression models. A variety of predictions can be made from the fitted models. It can also fit multi-response linear regression.

Function cvglmnet uses cross validation to fit a multinomial logistic regression.

from cvglmnet import cvglmnet
train_img_sub = train_img[0:1000]
train_lbl_sub = train_lbl[0:1000]
model = cvglmnet(x = train_img_sub, y = train_lbl_sub, family = 'multinomial', ptype = 'class')

Prediction

Make the predictions using cvglmnetPredict function. ’lambda.min’is the lambda at which the minimal MSE is achieved.

predictions = cvglmnetPredict(model, newx = test_img, s = 'lambda_min', ptype = 'class')

Pick the first 5 images and compare them with the prediction lable to check how the model fits. We can pick any random image we want.

plt.figure(figsize=(20,4))
for index, (image, label) in enumerate(zip(test_img[0:5], predictions[0:5])):
    plt.subplot(1, 5, index + 1)
    plt.imshow(np.reshape(image, (28,28)), cmap=plt.cm.gray)
    plt.title('Prediction: %i\n' % label, fontsize = 20)

Prediction Accuracy

Accuracy is defined as the fraction of correct predictions among all predictions.

np.sum(predictions == test_lbl)/predictions.shape
array([ 0.833])

Accuracy in this model equals 0.833.

MATLAB

Data Load

First the data is loaded into MATLAB using helper functions that can read the MNIST data file format, these functions were developed by Yann LeCun from NYU. More details on methodology can be found there (3). The training datasets and labels as well as the testing data is loaded.

% Load datasets
train_images = loadMNISTImages('train-images.idx3-ubyte');
train_labels = loadMNISTLabels('train-labels.idx1-ubyte');
Xtrain = train_images.';
Ytrain = train_labels;
test_images = loadMNISTImages('t10k-images.idx3-ubyte');
test_labels = loadMNISTLabels('t10k-labels.idx1-ubyte');
Xtest = test_images.';
Ytest = test_labels;
% We are using display_network from the autoencoder code
display_network(train_images(:,1:100)); % Show the first 100 images
disp(train_labels(1:10));

Image and data preview

A small sample of the data and images can be seen below.

Each image is represented by a vector of length 784 which is the 28x28 pixel matrix of the centered and scaled handwritten digit. 60,000 observations of this type of vector exist for each of the training images. The regression is performed on this matrix with each column is used as a covariate. Here is a sample of the raw matrix

Glmnet and CV on train

A Glmnet Regression analysis is conducted on the training datasets while using default cross-validation to get the best model for the training data. The source for this code and methodology is explained in more detail on the standford website (1). For demonstration purposes a 1000 sample subset from the training set is used in this example due to the computation time required with the full dataset. Results from this demonstration should not be taken as true validation of the lasso methodology, but as a guide of setup procedure and interpretation.

% Take a 1000 sample subset of the train and test data
Xtrain_sub = Xtrain(1:1000,:);
Ytrain_sub = Ytrain(1:1000,:);
Xtest_sub = X2(1:1000,:);
Ytest_sub = Y2(1:1000,:);
 
% Use glmnet with cross validation to calculate model
cvfit=cvglmnet(Xtrain_sub, Ytrain_sub,"multinomial", "grouped");
cvglmnetPlot(cvfit)
Ytrain_predict = cvglmnetPredict(cvfit, Xtrain_sub, 'lambda_1se', 'class');
Ytest_predict = cvglmnetPredict(cvfit, Xtest_sub, 'lambda_1se', 'class');

Glmnet lambda plot

Glmnet lambda plot shows the MSE as a function of the regularization parameter lambda, the recommended lambda setting is highlighted in blue which balances the overfitting and underfitting of the model. Two lines, first one represents lambda with minimum MSE, and the second line is one standard deviation away. The one standard deviation lambda is used in the prediction model as is recommended by the creators of the algorithm. The degrees of freedom of the top axis shows the number of variables selected at the particular lambda values.

Prediction accuracy

% error rate / accuracy
train_accuracy = sum(Ytrain_predict == Ytrain_sub)/ length(Ytrain_sub)*100;
test_accuracy = sum(Ytest_predict == Ytest_sub)/length(Ytest_sub)*100;

As can be seen in the above output, the prediction accuracy using 1000 samples is pretty good on the training set is around 95% and on the test dataset around 79%. If more samples are used and randomization included, this technique can achieve even better prediction.