ANN in different languages

R

library(MASS)
data("Boston")
#install.packages("neuralnet")
library(neuralnet)

Boston$medv[Boston$medv>=5 & Boston$medv<=20]<-1
Boston$medv[Boston$medv>20 & Boston$medv<=35]<-2
Boston$medv[Boston$medv>35 & Boston$medv<=50]<-3

train_idx<-read.csv(file="train_idx.csv")
test_idx<-read.csv(file="test_idx.csv")
train_idx<-as.numeric(train_idx[[1]])
test_idx<-as.numeric(test_idx[[1]])
train<-Boston[train_idx,]
test<-Boston[test_idx,]

The min-max scale method is used to normalize the Boston, train and test data. The reason to normalize data is that avoiding normalization may lead to useless results or to a very difficult training process (most of the times the algorithm will not converge before the number of maximum iterations allowed).

#scale data by min-max method
maxs <- apply(Boston, 2, max) 
mins <- apply(Boston, 2, min)
scaled <- as.data.frame(scale(Boston, center = mins, scale = maxs - mins))
train_ <- scaled[train_idx,]
test_ <- scaled[test_idx,]

neuralnet function

Usage

neuralnet(formula, data, hidden = 1, threshold = 0.01, stepmax = 1e+05, rep = 1, startweights = NULL, learningrate.limit = NULL, learningrate.factor = list(minus = 0.5, plus = 1.2), learningrate=NULL, lifesign = “none”, lifesign.step = 1000, algorithm = “rprop+”, err.fct = “sse”, act.fct = “logistic”, linear.output = TRUE, exclude = NULL, constant.weights = NULL, likelihood = FALSE)

Arguments

formula a symbolic description of the model to be fitted.
data a data frame containing the variables specified in formula.
hidden a vector of integers specifying the number of hidden neurons (vertices) in each layer.
algorithm a string containing the algorithm type to calculate the neural network. The following types are possible: ‘backprop’, ‘rprop+’, ‘rprop-’, ‘sag’, or ‘slr’. ‘backprop’ refers to backpropagation, ‘rprop+’ and ‘rprop-’ refer to the resilient backpropagation with and without weight backtracking, while ‘sag’ and ‘slr’ induce the usage of the modified globally convergent algorithm (grprop).

Application of neuralnet

Formula y~. is not accepted in the neuralnet() function, therefore, we need to first write the formula and then pass it as an argument in the fitting function.

n <- names(train_)
f <- as.formula(paste("medv ~", paste(n[!n %in% "medv"], collapse = " + ")))

ONE LAYERS

With the case of one hidden layer,8 neurons are selected in the hidden layer. Then the configration is 13:8:1. Here, set.seed function is used to get the same result of every trial. Neural network is initialised with random values of gains or weights. If we do not use set.seed,we would have different starting points per each simulation during the training phase.

#1 layer
set.seed(200)
nn1 <- neuralnet(f,data=train_,hidden=c(8),algorithm='sag',linear.output=T)

We scale back “medv” of the test data in order to make meaningful analysis below.

test.r <- (test_$medv)*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)

The net will output a normalized prediction, so we need to scale it back in order to compare with the test data. Then the prediction values are rounded to the integer to compare with the test medv.

pr.nn1_ <- compute(nn1,test_[,1:13])#normalized
pr.nn1 <- pr.nn1_$net.result*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)#unnormalized
actual1 = test.r
prediction1= pr.nn1
prediction1<-round(prediction1)

table(actual1,prediction1)

##        prediction1
## actual1  1  2  3  4
##       1 40  8  0  0
##       2  6 35  3  0
##       3  0  2  4  2

Through this table, we can know how many predictions are mathced with the test data. Then the F1 measure is derived to test the accuracy of the prediction.

p11<-40/(40+6)
p12<-36/(8+36+2)
p13<-4/(3+4)
r11<-40/(40+8)
r12<-36/(6+36+3)
r13<-4/(2+4+2)
f11<-2*p11*r11/(p11+r11)
f12<-2*p12*r12/(p12+r12)
f13<-2*p13*r13/(p13+r13)
print(c(f11,f12,f13))

## [1] 0.8510638298 0.7912087912 0.5333333333

F1score1<-(f11+f12+f13)/3
F1score1

## [1] 0.7252019848

Since medv is divided into three groups, we can get 3 F1 measure regards to each group. Then by getting the average of the F1 measure, the Macro F1 measure with 1 layer is 72.52%.

TWO LAYERS

With the case of two hidden layers, we choose 5 and 3 neurons in the hidden layer. Then the configration is 13:5:3:1.

#2 layers
set.seed(200)
nn2 <- neuralnet(f,data=train_,hidden=c(5,3),algorithm='sag',linear.output=T)

pr.nn2_ <- compute(nn2,test_[,1:13])#normalized
pr.nn2 <- pr.nn2_$net.result*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)#unnormalized
actual2 = test.r
prediction2 = pr.nn2
prediction2<-round(prediction2)

table(actual2,prediction2)

##        prediction2
## actual2  1  2  3
##       1 34 14  0
##       2  6 35  3
##       3  0  2  6

p21<-34/(34+6)
p22<-36/(14+36+2)
p23<-6/(3+6)
r21<-34/(34+14)
r22<-36/(6+36+3)
r23<-6/(2+6)
f21<-2*p21*r21/(p21+r21)
f22<-2*p22*r22/(p22+r22)
f23<-2*p23*r23/(p23+r23)
print(c(f21,f22,f23))

## [1] 0.7727272727 0.7422680412 0.7058823529

F1score2<-(f21+f22+f23)/3
F1score2

## [1] 0.7402925556

The Macro F1 measure with 2 layers is 74.03%.

THREE LAYERS

With the case of three hidden layers, we choose 13 neurons in each hidden layer. Then the configration is 13:13:13:13:1.

#3 layers
set.seed(200)
nn3 <- neuralnet(f,data=train_,hidden=c(13,13,13),algorithm='sag',linear.output=T)

pr.nn3_ <- compute(nn1,test_[,1:13])#normalized
pr.nn3 <- pr.nn3_$net.result*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)#unnormalized
actual3 = test.r
prediction3 = pr.nn3
prediction3 = round(prediction3)
table(actual3,prediction3)

##        prediction3
## actual3  1  2  3  4
##       1 40  8  0  0
##       2  6 35  3  0
##       3  0  2  4  2

p31<-40/(40+6)
p32<-36/(8+36+2)
p33<-4/(3+4)
r31<-40/(40+8)
r32<-36/(6+36+3)
r33<-4/(2+4+2)
f31<-2*p31*r31/(p31+r31)
f32<-2*p32*r32/(p32+r32)
f33<-2*p33*r33/(p33+r33)
print(c(f31,f32,f33))

## [1] 0.8510638298 0.7912087912 0.5333333333

F1score3<-(f31+f32+f33)/3
F1score3

## [1] 0.7252019848

The Macro F1 measure with 3 layers is 72.52%.

In conclusion, Macro F1 are 72.52%, 74.03%, 72.52% for 1 layer (13:8:1), 2 layers (13:5:3:1) and 3 layers(13:13:13:13:1) respectively.

Python

Prepare data

The Boson data can also be loaded through the Python library. The following is a code that takes Boston data from the sklearn library and divides training and testing set by 8:2.

from sklearn import datasets
from sklearn.model_selection import train_test_split

dataset = datasets.load_boston()
data, target = dataset.data, dataset.target

x_train1, x_test1, y_train1, y_test1 = train_test_split(data, target, train_size=0.8)

However, we separated the index for training and testing data in advance to compare the result between languages. The following is a code that divides the data by using the index. x_(data) means input variables and y_(data) means response variable MEDV.

import pandas as pd

# read csv files for indexing testing and training data set
train_data = pd.read_csv('train_data.csv')
test_data = pd.read_csv('test_data.csv')

y_train = train_data[['medv']]
x_train = train_data.drop('medv', axis=1)

y_test = test_data[['medv']]
x_test = test_data.drop('medv', axis=1)

x_train=x_train.values ; y_train=y_train.values ; x_test=x_test.values; y_test=y_test.values

# convert two dimensional array into one dimensional array
y_test=y_test.flatten(); y_train=y_train.flatten()

Data preprocessing

We preprocess the input variables usnig min-max scale method to improve the model performance. The sklearn library allows us to scale data easily with the function preprocessing.MinMaxScaler(). It preprocesses the input variables to fall in between -1 to 1.

from sklearn import preprocessing

data_scaler = preprocessing.MinMaxScaler()
x_train = data_scaler.fit_transform(x_train) ; x_test = data_scaler.fit_transform(x_test)

Next, we divide the response variable into three groups. To include the observations with MEDV=5, We set the minimum criteria as 4.99. For example, if a observation has 5 <= MEDV < 20, the response value will be changed into 0. Similarly, 20 <= MEDV < 35 will be changed into 1 and 35 <= MEDV < 50 will be changed into 2.

bins = [4.99,20,35,50]
group_names = [0,1,2]
y_train = pd.cut(y_train, bins, labels=group_names)
y_test = pd.cut(y_test, bins, labels=group_names)

Produce an ANN classifier with 1-hidden layer (13 nodes)

Now, we produce an ANN classifier with sklearn library. In the code, MLP means the Multilayer Perceptron. We use a logistic sigmoid fuction as the activation fuction. After fitting with training data, we predict with the testing data and generate a confusion matrix so that we can calculate the F1 score. For modeling, we use the 1-hidden layer with 13 nodes and get 82.69% of F1 score as follwing result.

from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import precision_recall_fscore_support

# ANN Classifier 
mlp1 = MLPClassifier(hidden_layer_sizes=(13),max_iter=100000000, activation='logistic',solver='lbfgs')
mlp1.fit(x_train,y_train)

pred1 = mlp1.predict(x_test)
print(confusion_matrix(y_test,pred1))
precision_recall_fscore_support(y_test,pred1,average="macro")

pred1 = mlp1.predict(x_test)

2-hidden layer (5-3 nodes)

In the same way we produce a new model, but with 2-hidden layers having 5-3 nodes and get 84.27% of F1 score as below.

mlp2 = MLPClassifier(hidden_layer_sizes=(5,3),max_iter=100000000, activation='logistic',solver='lbfgs')
mlp2.fit(x_train,y_train)

pred2 = mlp2.predict(x_test)
print(confusion_matrix(y_test,pred2))
precision_recall_fscore_support(y_test,pred2,average="macro")

3-hidden layers (13-13-13 nodes)

Finally, we produce a model with 3-hidden layers having 13-13-13 nodes and get 80.73% of F1 score as below.

mlp3 = MLPClassifier(hidden_layer_sizes=(13,13,13),max_iter=100000000, activation='logistic',solver='lbfgs')
mlp3.fit(x_train,y_train)

pred3 = mlp3.predict(x_test)
print(confusion_matrix(y_test,pred3))
precision_recall_fscore_support(y_test,pred3,average="macro")

Matlab

Prepare relevant functions

In order to repeat the modeling process conveniently, we first write two relevant functions (in Matlab, such functions are stored in .m files. And you can put these files in the same directory where your main function locates and call them by name)

Function to construct neural network

function Y = my_network(input,output_train,testInput,layers,activ_func,train_func)

% function name: my_network
% argument:
  % input        -  normalized input matrix (predictors) from train data set
  % output_train -  correct output (class labels) recorded in train data set
  % testInput    -  normalized input matrix (predictors) from test data set
  % layers       -  layers configuration.[x1 x2 y] means two hidden layers 
  %                 with x1 and x2 nodes in each layer respectively.
  %                 y means that the number of classes in your output,must be included
  % activ_func   -  activate function in each layer
  %                 {func1 func2 funcY} if there is two hidden layers
  % train_func   -  train function for whole network. Passed by character


% set network paremeter
net.trainparam.show = 50 ;
net.trainparam.epochs = 500 ;
net.trainparam.goal = 0.01 ;
net.trainParam.lr = 0.01 ;

% initialize output matrix-generate a sX3 matrix to store the results predicted by ANN
s = length(output_train) ;
output = zeros( s , 3  ) ; %there are 3 classes at total
for i = 1 : s 
   output( i , output_train( i )  ) = 1 ;
end

% define the network to use
net = newff( minmax(input) , layers , activ_func, train_func ) ; 

% start training
net = train( net, input , output' ) ;

% predict
Y = sim( net , testInput );

end

Function to calculate F1 scores

function  F1 = Cal_F1(class,size,Pred_label,True_label)

% function name: my_network
% argument:
  % class        -  which class of interest
  % size         -  size of output results, equaling of number of obs of test data set
  % Pred_label   -  predicted class label of each obs in test data set
  % True_label   -  true class label of each obs in test data set


% initialize three relevant variable
  % TP - true positive
  % FP - false positive
  % FN - false negative

[TP,FP,FN] = deal(0);% all set to 0

% for-loop to calculate these 3 values
for i = 1 : size
    if(True_label(i)==class)
        if(Pred_label(i)==class)
            TP = TP+1;
        else
            FN = FN+1;
        end
    else
        if(Pred_label(i)==class)
            FP = FP+1;
        end
    end
end

F1 = 2*TP/(2*TP+FP+FN)

end

Main function

Import data and some data processing

%% read data
data = csvread('boston.csv',1,0);
feature = data(:,1:13);
class = data(:,14);

%% read train-test index
test_idx = csvread('test_idx.csv',1,0)
train_idx = csvread('train_idx.csv',1,0)

%% pick out trainset and testset
input_train=feature(train_idx,:);
output_train=class(train_idx,:);
input_test=feature(test_idx,:);
output_test=class(test_idx,:);

Use ‘%%’ can divide our script into multiple steps so we can debug conveniently. We first have two files ‘train_idx test_idx’, which record shuffled observations indices. The proportion of train and test is 8:2. Note that when applying csvread function, we set the second arg to 1(exclude 1 row and then read) because we don’t need to read colunm names. Finally we have 4 matrix. Prefix ‘input’ means the matrix of predictors. So ‘output’ means matrix of response(class labels).

Normalize input data

%% normalization of train features
[input,minI,maxI] = premnmx( input_train')  ;

%% normalization of test features
testInput = tramnmx ( input_test' , minI, maxI ) ;

Here we use premnmx function to normalize input_train data.It preprocesses the inputs so that they fall in the interval [-1,1]. minI contains minimums for each observation. maxI contains maximums. Then we use the same scale of input to normalize input_test by applying function tramnmx and get testInput.

Predict and calculate F1 score

In this section, we will test neural network with 3 different layer configuration. We use log sigmoid function as activate function for each layer. And gradient descent as training method for the whole network. We will use pure linear function as function to generate output results.

Hidden layers and nodes : [8]

%% predict
Y = my_network(input,output_train,testInput,[8 3],{ 'logsig' 'purelin' }, 'traingdx');
[s1 , s2] = size( Y ) ;

Prediction = zeros(s2,1);
for i = 1 : s2
    [m , Index] = max( Y( : ,  i ) ) ;
    Prediction(i) = Index;
end

True_label = output_test;

%% caculate F1 score 

% calculate confusion matrix of 3 classes
F1_mat = zeros(3,1);

% calculate F1 score for each class
for i = 1 : 3
   F1_mat(i) = Cal_F1(i,s2,Prediction,True_label) ;
end


F1_1 = mean(F1_mat)

First we call the pre-defined function my_network to generate predicted matrix Y. Y is a 3X103 matrix, where 3 is the number of classes and 103 is the number of obs in test data. Each entry in Y is float number, and the max number of each column indicates the predicted class of this observation. For example: if the first column is

obs1
1.1
-0.4
0.6

Then obs1 in test data is predicted to be in class 1. So we use a for-loop to mark the predicted label of each one and generate a 1X103 vector Prediction.

At last we calculate the F1 scores from class 1 to 3 and take average of those scores, which gives us the macro-F1 score of this model: 0.8177

Hidden layers and nodes : [5 3]

%% predict
Y_2 = my_network(input,output_train,testInput,[5 3 3],{ 'logsig' 'logsig' 'purelin' }, 'traingdx');
[s1 , s2] = size( Y_2 ) ;

Prediction = zeros(s2,1);
for i = 1 : s2
    [m , Index] = max( Y_2( : ,  i ) ) ;
    Prediction(i) = Index;
end

True_label = output_test;

%% caculate F1 score 

% calculate confusion matrix of 3 classes
F1_mat = zeros(3,1);

% calculate F1 score for each class
for i = 1 : 3
   F1_mat(i) = Cal_F1(i,s2,Prediction,True_label) ;
end


F1_2 = mean(F1_mat)

This time we try neural network with 2 hidden layers, 5 nodes in the first one and 3 in the second. The macro-F1 score of this model: 0.7883

Hidden layers and nodes : [5 3]

%% predict
Y_3 = my_network(input,output_train,testInput,[13 13 13 3],{ 'logsig' 'logsig' 'logsig' 'purelin' }, 'traingdx');
[s1 , s2] = size( Y_3 ) ;

Prediction = zeros(s2,1);
for i = 1 : s2
    [m , Index] = max( Y_3( : ,  i ) ) ;
    Prediction(i) = Index;
end

True_label = output_test;

%% caculate F1 score 

% calculate confusion matrix of 3 classes
F1_mat = zeros(3,1);

% calculate F1 score for each class
for i = 1 : 3
   F1_mat(i) = Cal_F1(i,s2,Prediction,True_label) ;
end


F1_3 = mean(F1_mat)

At last we try neural network with 3 hidden layers, 13 nodes in each one. The macro-F1 score of this model: 0.8436

Result of different layer-configuration

Item	network1	network2	network3
number of hidden layers	1	2	3
nodes of each layer	[8]	[5 3]	[13 13 13]
Macro-F1 score	0.8177	0.7883	0.8436