Machine Learning and Sci-Kit Learn¶

Stats 507, Fall 2021

James Henderson, PhD
November 4 & 9, 2021

Overview¶

  • Machine Learning
  • Scikit-learn
  • Isolet Demo
  • Training, Validation, and Testing
  • Cross-Validation
  • SVD
  • Regularization
  • Takeaways

Machine Learning¶

  • The Wikipedia entry on Machine Learning begins ...

Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data ... Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.

ML Domains¶

  • Supervised learning uses labeled data and is akin to regression methods in statistics in that one (or more) variables are treated as dependent.

    • Regression - continuous (or at least interval-valued) labels,
    • Classification - discrete/categorical (often binary) labels.
  • Unsupervised learning includes clustering, visualization, and distance-based methods. Seeks to understand structure rather than use some variables to predict others.

Machine Learning¶

A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers; but not all machine learning is statistical learning.

  • All or most of what we cover in this class can be thought of as statistical learning.

  • If you want to know more, I highly recommend reading and referring to The Elements of Statistical Learning.

ML vs Statistics¶

  • There is no clear boundary between ML and statistics.
  • ML tends to focus more on prediction and less on inference.
  • ML uses hold-out data rather than sampling-theory for evaluation.
  • As a practical matter, I like to make the following distinction:

    If you evaluate your model using hold out data you're doing ML.

Scikit-learn¶

  • Many ML algorithms and models are available in [scikit-learn][skl] (SKL).
  • We'll cover a very small fraction of what's available:
    • regularized regression: ridge, lasso, and elastic-net,
    • random forests,
    • gradient-boosted trees.

Scikit-learn¶

  • I'll be using sklearn version 1.0.1.
  • I'll typically import individual estimators and functions from sklearn.
  • Most of what I'll use comes from the linear_model and ensemble APIs.
from sklearn.linear_model import LogisticRegression

Isolet Demo¶

  • These slides are without examples and are intended to be used alongside the Isolet demo from the course repo.

Training, Validation, and Testing¶

  • In supervised ML, we evaluate our models based on their ability to make predictions on new cases.
  • Test data is data set aside to evaluate our (final) model(s).
  • Training data is used to learn model parameters (to train our model).
  • A validation dataset is data set aside to compare different models fit to the training data and for tuning hyper-parameters.

Over-fitting¶

  • We useout-of-sample data for validation and testing to get unbiased estimates of the out of sample error.
  • This helps us avoid over-fitting, which occurs when our model fits the training data well but does so in a way that doesn't generalize to new data.
  • Over-fitting is an instance of a bias-variance tradeoff.

Cross-Validation¶

  • Cross validation is often used in place of a designated validation dataset.
  • Cross-validation makes efficient use of available non-test data by repeatedly interchanging which observations are considered training and which validation.

Cross-Validation¶

  • Cross-validation is (typically) done by dividing the data into sub-groups called folds.
  • If we have k of these groups we refer to it as k-fold cross-validation.
  • The special case when k equals the total number of not-test samples is known as leave-one-out cross-validation.

Cross-Validation¶

  • When dividing data into folds it is important that observations be randomly distributed among the groups.
  • For example, if you had previously sorted your data set you would not want to assign folds using adjacent rows.
  • You can avoid this by randomly shuffling the rows (cases).
  • This is also true when generating the test-train split.

Cross-Validation¶

  • If you're data are not iid additional care is needed for data splitting.
  • If you have block structured data, with dependence isolated within blocks, and distinct blocks independent, you generally want keep block together across folds.

Singular Value Decomposition¶

  • The singular value decomposition or SVD is a generalized version of the Eigen decomposition.
  • The SVD breaks a matrix $X$ into three parts - two orthonormal matrices $U$ and $V$ and a diagonal matrix $D$: $X = UDV'$.
  • By convention $U$, $D$, and $V$ are ordered so that the diagonal of $D$ is largest in the upper left and smallest in the lower right.
  • The values of$D$ are called singular values, the columns of $U$ are called left singular vectors and the columns of $V$ right singular vectors.

Regularization¶

  • Regularization is used to limit over-fitting and to make models identifiable.
  • One way to regularize, is to limit the number of singular-vectors (or other features) used in your model.
  • Another common way to achieve regularization is to penalize the loss function used to measure fit.

Ridge Regression¶

  • In ridge regression - or $L_2$ regularization - we penalize the loss using the sum of the squared coefficients (the squared $L_2$ norm).
  • Logistic regression with a ridge penalty has the following objective function (with $g$ the logistic function):

    $$ \mathscr{L}(b) = \sum_{i=1^n} -y_i \log g(x_ib) - (1 - y_i)\log(1 - g(x_ib)) + \lambda \sum_{k=1}^p b_k^2. $$

  • $\lambda$ is a hyper-parameter that controls the amount of regularization.

Lasso¶

  • In the Lasso - or $L_1$ regularized regression - we penalize the loss using the sum of the absolute value of the coefficients (the $L_1$ norm).
  • Logistic regression with a Lasso penalty has the following objective function:

    $$ \mathscr{L}(b) = \sum_{i=1^n} -y_i \log g(x_ib) - (1 - y_i)\log(1 - g(x_ib)) + \lambda \sum_{k=1}^p |b_k|. $$

Elastic-net¶

  • The elastic-net interpolates between the $L_1$ and $L_2$ penalties.

    $$ \mathscr{L}(b) = \sum_{i=1^n} -y_i \log g(x_ib) - (1 - y_i)\log(1 - g(x_ib)) + \alpha \lambda \sum_{k=1}^p |b_k| + (1 - \alpha)\frac{\lambda}{2}\sum_{k=1}^p b_k^2. $$

Takeaways¶

  • Set aside test data to evaluate your ML models.
  • Use regularization to play the bias-variance tradeoff and avoid over-fitting.
  • Use cross validation (or a dedicated validation dataset) to tune hyper parameters and make model-building decisions.