Introduction to Cubic Spline Regression

Cubic regression spline is a form of generalized linear models in regression analysis. Also known as B-spline, it is supported by a series of interior basis functions on the interval with chosen knots. Cubic regression splines are widely used on modeling nonlinear data and interaction between variables. Unlike traditional methods such as polynomial regression or broken stick regression, cubic regression spline takes both smoothness and local influence into consideration (Faraway, 2015). According to Julian J Faraway, the basis functions need to be a continuous cubic polynomial that is nonzero on an interval defined by knots and zero anywhere else to ensure the local influence. Also, the first and second derivatives of the polynomials at each knotpoint should be continuous to achieve the overall smoothness. (Faraway, 2015). More details of cubic regression spline can be found at link

We will provides three analysis examples produced by R with package splines, STATA with package bspline and Python with package statsmodels, sklearn.metrics and patsy. In each example, we will first clean the data and remove outliers, fit the ols and polynomial regression models as alternative and finally fit the cubic regression spline models and compare their goodness-of-fit to the alternative models.

The result shows that cubic splines have the best fit compared to ols and polynomial regression, which benefits from its focus on local influence as stated before. In addition, the fitted model of cubic splines are smooth and will be significantly better than the other two methods when applied to higher dimensional data. Limited by the scale of this tutorial, we only focus on the local influence advantage of cubic splines.

Data Summary

In this project, we’ll be using dataset uswages from R package faraway. The dataset can be found at link. Uswages contains the weekly wages data for US male works collected from the Current Population Survey in 1988. It has 2000 observations and 10 variables. The detailed description of variables are as follows:

kable(var_explain, caption = "Variable description")
Variable description
Variables Type Explanation
wage double Real weekly wages in dollars
educ integer Years of education
exper integer Years of experience
race integer 1 if Black, 0 if White (other races not in sample)
smsa integer 1 if living in Standard Metropolitan Statistical Area, 0 if not
ne integer 1 if living in the North East
mw integer 1 if living in the Midwest
so integer 1 if living in the West
we integer 1 if living in the South
pt integer 1 if working part time, 0 if not

In this tutorial, we will focus on the relationship between response variable wage and prediction variable experience. The two-dimensional relationship is easier to present and allows us to better illustrate cubic regression spline method.

I. Applying Cubic Regression Spline with R

1. Data Cleaning

Regression spline methods are easy to apply with R packages splines. To load data and draw plots, package faraway and ggplot2 are also needed.

# Load packages and data
library(splines)
library(faraway)
library(ggplot2)
library(dplyr)
data(uswages)
ggplot(uswages, aes(exper, wage))+geom_point(col = "slategrey")+
  labs(x = "Experience in year", y = "Weekly wage")+
  ggtitle("Relationship between weekly wages and years of experience")

From the plot, we can observe that there are a few outliers with extremely high wage. To avoid the influence of outliers on regression models, we remove observations with wage larger than 4000.

# Remove outliers
uswages = filter(uswages, wage<4000)

2. Benchmark

First, let’s try to capture the relationship with ordinary least square model (ols).

# Fit an ols model as benchmark
fit1 = lm(wage~exper, data = uswages)
plot(uswages$exper, uswages$wage, xlab = "Weekly wage", 
     ylab = "Experience", main = "OLS model", col = "slategrey")
abline(fit1, col = "red")

From the plot, we can see that OLS model fails to catch most of the variabnce in the data.

3. Alternative models: Polynomial regression

Polynomial regression is a good alternative in this case. The linear models with polynomial of degree 2 and degree 4 are shown as follows:

# Fit polynomial regression models with degree 2 and 4
g2 = lm(wage~poly(exper, 2), data = uswages)
g4 = lm(wage~poly(exper, 4), data = uswages)
uswages = mutate(uswages, degree2 = fitted(g2), degree4 = fitted(g4))
ggplot(uswages, aes(exper, wage)) +
  labs(x = "Experience in year", y = "Weekly wage")+
  geom_point( col = "slategrey") + 
  geom_line(aes(exper, degree2,color = "2"))+
  geom_line(aes(exper, degree4,color = "4")) +
  scale_color_manual(values = c(
    '2' = 'darkblue',
    '4' = 'red')) +
  labs(color = 'Polynomial degree')+
  ggtitle("Polynomial regression models")

4. Cubic Regression Splines

Polynomial regression models are smooth but the shortcomings are overfitting problem and each data point affect the fit globally. Cubic regression spline, however, by seperate data into subsets, greatly mitigate the problem. We first used the 25%, 50% and 75% quantiles as the set of knots. The result after cubic spline regression is as follows:

# Fit regression spline model with chosen knots
# 8, 15 and 27 are the quantiles for wage
cubic_spline = lm(wage~bs(exper, knots = c(8,15,27)), data = uswages)
uswages = mutate(uswages, smooth = fitted(cubic_spline))
ggplot(uswages, aes(exper, wage)) + 
  labs(x = "Experience in year", y = "Weekly wage")+
  geom_point(col = "slategrey") +
  geom_line(aes(exper, smooth), col = "red") + 
  ggtitle("Cubic regression spline model")

summary(cubic_spline)
## 
## Call:
## lm(formula = wage ~ bs(exper, knots = c(8, 15, 27)), data = uswages)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -718.23 -232.31  -65.92  152.56 2712.39 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        108.70      80.59   1.349   0.1775    
## bs(exper, knots = c(8, 15, 27))1   278.89     137.20   2.033   0.0422 *  
## bs(exper, knots = c(8, 15, 27))2   371.32      75.70   4.905 1.01e-06 ***
## bs(exper, knots = c(8, 15, 27))3   653.38      97.27   6.717 2.41e-11 ***
## bs(exper, knots = c(8, 15, 27))4   716.96     101.21   7.084 1.94e-12 ***
## bs(exper, knots = c(8, 15, 27))5   392.52     137.69   2.851   0.0044 ** 
## bs(exper, knots = c(8, 15, 27))6    85.35     152.83   0.558   0.5766    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 381.8 on 1990 degrees of freedom
## Multiple R-squared:  0.1422, Adjusted R-squared:  0.1396 
## F-statistic: 54.97 on 6 and 1990 DF,  p-value: < 2.2e-16

Alternatively, we can even split the range of predictor variable experience into 4 subsets and use the breakpoints as the knots. In this way, we get 15, 30 and 45.

# Fit regression spline model with chosen knots
cubic_spline2 = lm(wage~bs(exper, knots = c(15, 30, 45)), data = uswages)
uswages = mutate(uswages, smooth2 = fitted(cubic_spline2))
ggplot(uswages, aes(exper, wage)) + 
  labs(x = "Experience in year", y = "Weekly wage")+
  geom_point(col = "slategrey") +
  geom_line(aes(exper, smooth2), col = "red") + 
  ggtitle("Cubic regression spline model")

summary(cubic_spline2)
## 
## Call:
## lm(formula = wage ~ bs(exper, knots = c(15, 30, 45)), data = uswages)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -709.67 -231.18  -63.73  151.18 2698.87 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         177.60      54.36   3.267   0.0011 ** 
## bs(exper, knots = c(15, 30, 45))1   233.22      96.66   2.413   0.0159 *  
## bs(exper, knots = c(15, 30, 45))2   583.20      56.93  10.244  < 2e-16 ***
## bs(exper, knots = c(15, 30, 45))3   618.75      86.19   7.179 9.86e-13 ***
## bs(exper, knots = c(15, 30, 45))4   427.26      88.51   4.827 1.49e-06 ***
## bs(exper, knots = c(15, 30, 45))5    58.48     152.05   0.385   0.7006    
## bs(exper, knots = c(15, 30, 45))6   151.68     226.77   0.669   0.5037    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 381.8 on 1990 degrees of freedom
## Multiple R-squared:  0.1419, Adjusted R-squared:  0.1393 
## F-statistic: 54.84 on 6 and 1990 DF,  p-value: < 2.2e-16

5. Summary

To better compare the goodness of fit of each model, we attached a MSE table below.

mse = function(model){
 rmse = sqrt(mean(model$residual^2))
 round(rmse, 3)
 }
mse_ols = mse(fit1)
mse_poly1 = mse(g2)
mse_poly2 = mse(g4)
mse_cubic = mse(cubic_spline)
mse_cubic2 = mse(cubic_spline2)

model_name = c("OLS", "Polynomial with degree 2", "Polynomial with degree 4", "Cubic Spline 1", "Cubic Spline 2")
mses = c(mse_ols, mse_poly1, mse_poly2, mse_cubic, mse_cubic2)
mse_table = cbind(model_name, mses)
colnames(mse_table) = c("Models", "RMSE")
kable(mse_table, caption = "Mean squared errors for different models", digits = 3)
Mean squared errors for different models
Models RMSE
OLS 404.744
Polynomial with degree 2 381.383
Polynomial with degree 4 381.196
Cubic Spline 1 381.091
Cubic Spline 2 381.156

In this table, we can find that the cubic spline regression with knots (8, 15, 27) has the best fit. In general, cubic splines are better than polynomial regressions, which are better than ols.

II. Applying Cubic Regression Spline with Python

1. Loading packages

In this python tutorial, pandas is required for data manipulation, statsmodel is require for building regression models and matplolib is for plotting.

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from math import sqrt
from sklearn.metrics import mean_squared_error
from patsy import dmatrix
from scipy import interpolate

2. Import dataset using pandas

# read dataset
data = pd.read_csv("uswages.csv")
print(data.head())
##    Unnamed: 0    wage  educ  exper  race  smsa  ne  mw  so  we  pt
## 0        6085  771.60    18     18     0     1   1   0   0   0   0
## 1       23701  617.28    15     20     0     1   0   0   0   1   0
## 2       16208  957.83    16      9     0     1   0   0   1   0   0
## 3        2720  617.28    12     24     0     1   1   0   0   0   0
## 4        9723  902.18    14     12     0     1   0   1   0   0   0

3. Data Exploration

We can see that wage and experience has a non-linear relationship.

## exper vs wage ##
data_x = data[['exper']]
data_y = data[['wage']]
#visualize the relationship between experience and wage
plt.scatter(data_x, data_y, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.suptitle('Fig 1. Relationship between experience and wage', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()

Looking at Fig 1., there seems to be an outlier (the wage that is above 7000), therefore, in this tutorial, we will remove that outlier and continue our model building, we reduce the data down to wage <= 4000.

# remove outlier
data_ylim = data.loc[data['wage']<= 4000]
wage = data_ylim[['wage']]
exper_x = data_ylim[['exper']]
#visualize the relationship between experience and wage
plt.clf()
plt.scatter(exper_x, wage, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.suptitle('Fig 2. Relationship between experience and wage (remove outlier)', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()

4. Simple Linear Regression

First we will add an intercept in our simple linear regression model, then use sm.OLS from statsmodels to do the analysis.

#add an intercept (beta_0) to our model
exper_x = sm.add_constant(exper_x)  
# model fitting
model = sm.OLS(wage, exper_x).fit()
# find fitted value
predictions1 = model.predict(exper_x) 

By looking at Fig 3., apparently, we can’t use simple linear regression to explain the relationship between wage and experience, since there exists a non-linear association between wage and experience.

# data visualization
plt.clf()
plt.scatter(exper_x['exper'], wage, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.plot(exper_x['exper'], predictions1, color = 'green', linewidth = 1.5)
plt.suptitle('Fig 3. Simple linear regression', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()

Calculating RMSE value:

rms1 = sqrt(mean_squared_error(wage, predictions1))
print(rms1)
## 404.7442876375701

5. Polynomial Regression

Next we will try using polynomial regression to access the relationship by setting “experience” to degree = 2. Again, we will use sm.OLS from statsmodel to do this analysis.

# refit model using polynomial regression ("exper" with degree = 2)
exper_x['exper2'] = np.power(exper_x['exper'], 2)
# model fitting
model2 = sm.OLS(wage, exper_x).fit()
# find fitted value
predictions2 = model2.predict(exper_x) 

The polynomial curve on Fig 4. does explain the non-linear relationship between wage and experience.

# reduce samples down to 100
x_lim = np.linspace(start = exper_x['exper'].min(), stop = exper_x['exper'].max(), num = 100)
x_lim_df = pd.DataFrame({'exper':x_lim})
x_lim_df['exper2'] = np.power(x_lim_df['exper'], 2)
x_lim_df = sm.add_constant(x_lim_df) 
# find fitted value using x_lim
fit_reduce = model2.predict(x_lim_df)
# data visualization
plt.clf()
plt.scatter(exper_x['exper'], wage, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.plot(x_lim_df[['exper']], fit_reduce, color = 'blue', linewidth = 1.5, label='experience with degree = 2')
plt.legend()
plt.suptitle('Fig 4. Polynomial regression (degree = 2)', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()

Calculating RMSE value:

rms2 = sqrt(mean_squared_error(wage, predictions2))
print(rms2)
## 381.38251012666694

6. Polynomial regression with degree up to 4

Next, we add higher degree of experience in the model.

# "exper" with degree = 4
exper_x['exper3'] = np.power(exper_x['exper'], 3)
exper_x['exper4'] = np.power(exper_x['exper'], 4)
# model fitting
model3 = sm.OLS(wage, exper_x).fit()
# find fitted value
predictions3 = model3.predict(exper_x)
# reduce samples down to 100
x_lim = np.linspace(start = exper_x['exper'].min(), stop = exper_x['exper'].max(), num = 100)
x_lim_df = pd.DataFrame({'exper':x_lim})
x_lim_df['exper2'] = np.power(x_lim_df['exper'], 2)
x_lim_df['exper3'] = np.power(x_lim_df['exper'], 3)
x_lim_df['exper4'] = np.power(x_lim_df['exper'], 4)
x_lim_df = sm.add_constant(x_lim_df) 
# find fitted value using x_lim
fit_reduce2 = model3.predict(x_lim_df)

Fig. 5 shows the polynomial regression curve when degree of experience is up to 4. The curve looks similar to Fig 4.

# data visualization
plt.clf()
plt.scatter(exper_x['exper'], wage, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.plot(x_lim_df[['exper']], fit_reduce2, color = 'purple', linewidth = 1.5, label='experience with degree up to 5')
plt.legend()
plt.suptitle('Fig 5. Polynomial regression (degree = 4)', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()

Calculating RMSE value:

rms3 = sqrt(mean_squared_error(wage, predictions3))
print(rms3)
## 381.1963530129414

7. Cubic Regression, knots = 8, 15, 27

We select 8, 15, 27 as our knots, which are the 25th , 50th and 75th quantile in variable experience, and use sm.GLM from statsmodels to build cubic regression model.

# cubic spline with 4 knots at 8, 15, 27
cubic_x1 = dmatrix("bs(data, knots = (8, 15, 27), include_intercept = False)", {"data": exper_x[['exper']]}, return_type = 'dataframe')
# model fitting
model4 = sm.GLM(wage, cubic_x1).fit()
# find fitted value
predictions4 = model4.predict(cubic_x1)
print(model4.summary())
# reduce samples down to 100
##                  Generalized Linear Model Regression Results                  
## ==============================================================================
## Dep. Variable:                   wage   No. Observations:                 1997
## Model:                            GLM   Df Residuals:                     1990
## Model Family:                Gaussian   Df Model:                            6
## Link Function:               identity   Scale:                      1.4574e+05
## Method:                          IRLS   Log-Likelihood:                -14702.
## Date:                Fri, 07 Dec 2018   Deviance:                   2.9003e+08
## Time:                        16:06:46   Pearson chi2:                 2.90e+08
## No. Iterations:                     3   Covariance Type:             nonrobust
## ===========================================================================================================================
##                                                               coef    std err          z      P>|z|      [0.025      0.975]
## ---------------------------------------------------------------------------------------------------------------------------
## Intercept                                                 108.7029     80.588      1.349      0.177     -49.247     266.653
## bs(data, knots=(8, 15, 27), include_intercept=False)[0]   278.8916    137.196      2.033      0.042       9.992     547.792
## bs(data, knots=(8, 15, 27), include_intercept=False)[1]   371.3220     75.698      4.905      0.000     222.957     519.687
## bs(data, knots=(8, 15, 27), include_intercept=False)[2]   653.3815     97.270      6.717      0.000     462.735     844.028
## bs(data, knots=(8, 15, 27), include_intercept=False)[3]   716.9594    101.208      7.084      0.000     518.595     915.323
## bs(data, knots=(8, 15, 27), include_intercept=False)[4]   392.5233    137.685      2.851      0.004     122.665     662.382
## bs(data, knots=(8, 15, 27), include_intercept=False)[5]    85.3464    152.831      0.558      0.577    -214.197     384.890
## ===========================================================================================================================
x_lim = np.linspace(exper_x[['exper']].min(), exper_x[['exper']].max(), 100)
# find fitted value using x_lim
fit_reduce3 = model4.predict(dmatrix(formula_like = "bs(train, knots = (8, 15, 27), include_intercept = False)", data = {"train": x_lim}, return_type = 'dataframe'))

The spline curve on Fig 6. does explain the non-linear relationship between wage and experience.

# plot spline
plt.clf()
plt.scatter(exper_x[['exper']], wage, facecolor='None', edgecolor='k', alpha=0.1)
plt.plot(x_lim, fit_reduce2, color='r', linewidth = 1.5, label='Specifying 3 knots, knots = (8, 15, 27)')
plt.legend()
plt.suptitle('Fig 6. Cubic regression, knots = (8, 15, 27)', fontsize=12)
plt.ylim(0, 5000)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()

Calculating RMSE value:

rms4 = sqrt(mean_squared_error(wage, predictions3))
print(rms4)
## 381.1963530129414

8. Cubic Regression, knots = 15, 30, 45

Next, we select other three knots 15, 30, 45, which are in same interval from experience, and do the cubic regression.

# cubic spline with 4 knots at 15, 30, 45
cubic_x2 = dmatrix("bs(data, knots = (15, 30, 45), include_intercept = False)", {"data": exper_x[['exper']]}, return_type = 'dataframe')
# model fitting
model5 = sm.GLM(wage, cubic_x2).fit()
# find fitted value
predictions5 = model5.predict(cubic_x2)
print(model5.summary())
# reduce sample to 100
##                  Generalized Linear Model Regression Results                  
## ==============================================================================
## Dep. Variable:                   wage   No. Observations:                 1997
## Model:                            GLM   Df Residuals:                     1990
## Model Family:                Gaussian   Df Model:                            6
## Link Function:               identity   Scale:                      1.4579e+05
## Method:                          IRLS   Log-Likelihood:                -14702.
## Date:                Fri, 07 Dec 2018   Deviance:                   2.9012e+08
## Time:                        16:06:46   Pearson chi2:                 2.90e+08
## No. Iterations:                     3   Covariance Type:             nonrobust
## ============================================================================================================================
##                                                                coef    std err          z      P>|z|      [0.025      0.975]
## ----------------------------------------------------------------------------------------------------------------------------
## Intercept                                                  177.5977     54.356      3.267      0.001      71.062     284.134
## bs(data, knots=(15, 30, 45), include_intercept=False)[0]   233.2164     96.665      2.413      0.016      43.757     422.675
## bs(data, knots=(15, 30, 45), include_intercept=False)[1]   583.2001     56.933     10.244      0.000     471.613     694.787
## bs(data, knots=(15, 30, 45), include_intercept=False)[2]   618.7530     86.186      7.179      0.000     449.831     787.675
## bs(data, knots=(15, 30, 45), include_intercept=False)[3]   427.2639     88.511      4.827      0.000     253.785     600.743
## bs(data, knots=(15, 30, 45), include_intercept=False)[4]    58.4825    152.055      0.385      0.701    -239.539     356.504
## bs(data, knots=(15, 30, 45), include_intercept=False)[5]   151.6777    226.770      0.669      0.504    -292.783     596.139
## ============================================================================================================================
x_lim = np.linspace(exper_x[['exper']].min(), exper_x[['exper']].max(), 100)
# find fitted value using x_lim
fit_reduce4 = model5.predict(dmatrix(formula_like = "bs(train, knots = (15, 30, 45), include_intercept = False)", data = {"train": x_lim}, return_type = 'dataframe'))

The spline curve on Fig 7 is slightly different from Fig 6 at higher experience value, this might be due to the lower sample size at the last knot.

plt.clf()
plt.scatter(exper_x[['exper']], wage, facecolor='None', edgecolor='k', alpha=0.1)
plt.plot(x_lim, fit_reduce4, color='y', linewidth = 1.5, label='Specifying 3 knots, knots = (15, 30, 45)')
plt.legend()
plt.suptitle('Fig 7. Cubic regression, knots = (15, 30, 45)', fontsize=12)
plt.ylim(0, 5000)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()

Calculating RMSE value:

rms5 = sqrt(mean_squared_error(wage, predictions5))
print(rms5)
## 381.1563127080508

9. Summary

By looking at Fig 8, we can see that polynomial curve and spline curve do overlap with each other, and their residual RMSE both look similar with spline’s MSE seems to be slightly smaller.

# overlay three regression curve
plt.clf()
plt.scatter(exper_x[['exper']], wage, facecolor='None', edgecolor='k', alpha=0.1)
plt.plot(exper_x['exper'], predictions1, color = 'green', linewidth = 1.5, label = 'Simple Linear Regression')
plt.plot(x_lim_df['exper'], fit_reduce, color = 'blue', linewidth = 1.5, label='Polynomial Regression, experience degree = 2')
plt.plot(x_lim_df['exper'], fit_reduce2, color = 'purple', linewidth = 1.5, label='Polynomial Regression, experience up to degree = 4')
plt.plot(x_lim, fit_reduce2, color='r', linewidth = 1.5, label='Cubic Regrssion, knots = (8, 15, 27)')
plt.plot(x_lim, fit_reduce3, color='y', linewidth = 1.5, label='Cubic Regrssion, knots = (15, 30, 45)')
plt.legend()
plt.suptitle('Fig 8. Relationship between experience and wage', fontsize=12)
plt.ylim(0, 5000)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()

compare root of MSE:

# compare mse
model = ['SLR', 'Polynomial (degree = 2)', 'Polynomial (degree = 4)', 'Spline(knots = 8, 15, 27)', 'Spline(knots = 15, 30, 45)']
RMSE = [rms1, rms2, rms3, rms4, rms5]
compare = pd.DataFrame({'Model':model, 'RMSE':RMSE})
print(compare)
##                         Model        RMSE
## 0                         SLR  404.744288
## 1     Polynomial (degree = 2)  381.382510
## 2     Polynomial (degree = 4)  381.196353
## 3   Spline(knots = 8, 15, 27)  381.196353
## 4  Spline(knots = 15, 30, 45)  381.156313

III. Applying Cubic Regression Spline with STATA

1. Loading Packages

In this tutorial, stata is used to perform the analysis. To build the cubic spline regression model, a package called bspline is needed. More information about the package can be found here (https://www.stata-journal.com/sjpdf.html?articlenum=sg151_2)

ssc install bspline
ssc install markstat
ssc install whereis

2. Import Dataset

We start by loading the data and presenting the first 4 observations.

import delimited uswages.csv
list in 1/4

3. Data Exploration

This dataset is exploring how wages in the U.S. are affected by certain variables. For simplicity, we will be regressing the variable experience onto wage and looking at the resulting relationship. First we will visualize how these 2 variables are related. We notice that wage and experience has a non-linear relationship. We can also see that there seems to be a few outliers. We remove data points where wage is greater than 4000.

import delimited uswages.csv
summarize
summarize wage exper, detail
import delimited uswages.csv
twoway (scatter wage exper), ytitle(Wage) xtitle(Experience) legend(off) note("Fig 1. Relationship between experience and wage") title("Wage vs Exper") name(Scatter)
import delimited uswages.csv
drop if wage > 4000
twoway (scatter wage exper), ytitle(Wage) xtitle(Experience) legend(off) note("Fig 2. Relationship between experience and wage (remove outlier)") title("Wage vs Exper") name(Scatter2)

4. Simple Linear Regression

First we will do OLS Regression. From Fig 3., it does not appear that linear regression is suitable to explain the relationship between wage and experience. We note from the output that the RMSE value is 404.95.

import delimited uswages.csv
regress wage exper
predict linear_wage_exper
twoway (scatter wage exper)(line linear_wage_exper exper, sort), ytitle(Wage) xtitle(Experience) legend(off) note("Fig 3. Relationship between experience and wage: using simple linear regression") title("Linear") name(Linear)

5. Polynomial Regression

Next we perform polynomial regression and fit a quadratic by setting “experience” to degree = 2. The polynomial seems to explain the non-linearity somewhat better than linear regression. We note from the output that the RMSE value is 381.67.

import delimited uswages.csv
regress wage c.exper##c.exper
predict quad_wage_exper
twoway (scatter wage exper)(line quad_wage_exper exper, sort), ytitle(Wage) xtitle(Experience) legend(off) note("Fig 4. Relationship between experience and wage: using polynomial regression") title("Polynomial: Degree 2") name(Degree2)

6. Polynomial Regression to the 4th Degree

Next, we’ll try adding a higher degree of experience to the model.

import delimited uswages.csv
regress wage c.exper##c.exper##c.exper##c.exper
predict degfour_wage_exper
twoway (scatter wage exper)(line degfour_wage_exper exper, sort), ytitle(Wage) xtitle(Experience) legend(off) note("Fig 5. Relationship between experience and wage: using polynomial regression") title("Polynomial: Degree 4") name(Degree4)

The curve looks similar to Fig 4. Note that the RMSE is 381.67

7. Cubic Regression, Knots = 8, 15, 27

For the cubic regression model, we select 8, 15, and 27 as our knots, which are the 25th, 50th, and 75th quartiles for the experience variable. Here we use the bspline package.

import delimited uswages.csv
bspline, xvar(exper) knots(8,15,27) gen(cubic_wage_exper_spline) power(3)
regress wage cubic_wage_exper_spline*
predict cubic_wage_exper_spline
twoway (scatter wage exper)(line cubic_wage_exper_spline exper, sort), ytitle(Wage) xtitle(Experience) legend(off) note("Fig 6. Relationship between experience and wage: using cubic regression. Knots at (8,15,27)") title("Cubic: Knots at Quartiles") name(Cubic)

The cubic regression spline also seems to model the non-linear relationship better than a linear fit, although it does not look significantly different from the polynomial regressions performed earlier. Note the RMSE value of 382.23.

One may notice that the coefficients produced from the cubic regression in Stata compared to Python and R look quite different. We attempt to explain this in the Discussion section below.

8. Cubic Regression, knots = 15, 30, 45

Next, we select evenly spaced knots at 15, 30, and 45 and perform the cubic regression again.

import delimited uswages.csv
bspline, xvar(exper) knots(15,30,40) gen(cubic_wage_exper_spline_even) power(3)
regress wage cubic_wage_exper_spline_even*
predict cubic_wage_exper_spline_even
twoway (scatter wage exper)(line cubic_wage_exper_spline_even exper, sort), ytitle(Wage) xtitle(Experience) legend(off) note("Fig 7. Relationship between experience and wage: using cubic regression. Knots at (15,30,40)") title("Cubic: Evenly Spaced Knots") name(Cubic2)

The spline curve from Fig 7. is slightly different from that of Fig 6. at higher values of experience. We suspect this might be due to lower sample size at the last knot. Note the RMSE value of 381.95.

9. Summary

Looking at a plot of all of the curves, we can see that the splines and polynomial curves overlap each other somewhat. In python and R, the spline’s RMSE is slightly smaller than the polynomial’s. However, in Stata, the spline’s RMSE is slightly bigger than the polynomial’s. We still might expect the cubic fit to be a little more accurate because of the local influence property from the knots.

Model RMSE
OLS 404.95
Polynomial(Degree = 2) 381.67
Polynomial(Degree = 4) 381.67
Spline(knots = 8,15,27) 382.23
Spline(knots = 15,30,45) 381.95

Discussion

The coefficients for the cubic spline regression in Stata look quite different from the coefficients from R and Python. We believe that this could possibly be due to how the spline functions are scaled in the bspline package. If \(y(t) = \sum x_i(t)\beta_i\), it could also be that \(y(t) = \sum (\frac{x_i(t)}{C}) * C\beta_i\) where C is some constant. We believe this is happening here.

We can try to reproduce the cubic spline regression results with knots at 8, 15, 27 from scratch using the basis functions. We can then sum the predictions from the bspline package and those from scratch and compare. More information about rescaling of the basis functions can be found here (https://www.researchgate.net/publication/4794073_BSPLINE_Stata_modules_to_compute_B-splines_parameterized_by_their_values_at_reference_points )

import delimited uswages.csv
gen exper2 = exper^2
gen exper3 = exper^3
gen k8 = cond(exper > 8, (exper - 8)^3, 0)
gen k15 = cond(exper > 15, (exper - 15)^3, 0)
gen k27 = cond(exper > 27, (exper - 27)^3, 0)
regress wage exper exper2 exper3 k8 k15 k27
predict cubic_wage_exper
bspline, xvar(exper) knots(8,15,27) gen(cubic_wage_exper_spline) power(3)
regress wage cubic_wage_exper_spline*
predict cubic_wage_exper_spline
sum cubic_wage_exper cubic_wage_exper_spline

Reference

Notes: FARAWAY, J. J. (2015). Linear Models With R, Second Edition (2nd ed.). Boca Raton, FL: Taylor & Francis.

R package: splines link

Python package: statsmodels link, patsy link

STAT package: bspline link