Cubic regression spline is a form of generalized linear models in regression analysis. Also known as B-spline, it is supported by a series of interior basis functions on the interval with chosen knots. Cubic regression splines are widely used on modeling nonlinear data and interaction between variables. Unlike traditional methods such as polynomial regression or broken stick regression, cubic regression spline takes both smoothness and local influence into consideration (Faraway, 2015). According to Julian J Faraway, the basis functions need to be a continuous cubic polynomial that is nonzero on an interval defined by knots and zero anywhere else to ensure the local influence. Also, the first and second derivatives of the polynomials at each knotpoint should be continuous to achieve the overall smoothness. (Faraway, 2015). More details of cubic regression spline can be found at link
We will provides three analysis examples produced by R with package splines, STATA with package bspline and Python with package statsmodels, sklearn.metrics and patsy. In each example, we will first clean the data and remove outliers, fit the ols and polynomial regression models as alternative and finally fit the cubic regression spline models and compare their goodness-of-fit to the alternative models.
The result shows that cubic splines have the best fit compared to ols and polynomial regression, which benefits from its focus on local influence as stated before. In addition, the fitted model of cubic splines are smooth and will be significantly better than the other two methods when applied to higher dimensional data. Limited by the scale of this tutorial, we only focus on the local influence advantage of cubic splines.
In this project, we’ll be using dataset uswages from R package faraway. The dataset can be found at link. Uswages contains the weekly wages data for US male works collected from the Current Population Survey in 1988. It has 2000 observations and 10 variables. The detailed description of variables are as follows:
kable(var_explain, caption = "Variable description")
Variables | Type | Explanation |
---|---|---|
wage | double | Real weekly wages in dollars |
educ | integer | Years of education |
exper | integer | Years of experience |
race | integer | 1 if Black, 0 if White (other races not in sample) |
smsa | integer | 1 if living in Standard Metropolitan Statistical Area, 0 if not |
ne | integer | 1 if living in the North East |
mw | integer | 1 if living in the Midwest |
so | integer | 1 if living in the West |
we | integer | 1 if living in the South |
pt | integer | 1 if working part time, 0 if not |
In this tutorial, we will focus on the relationship between response variable wage and prediction variable experience. The two-dimensional relationship is easier to present and allows us to better illustrate cubic regression spline method.
Regression spline methods are easy to apply with R packages splines. To load data and draw plots, package faraway and ggplot2 are also needed.
# Load packages and data
library(splines)
library(faraway)
library(ggplot2)
library(dplyr)
data(uswages)
ggplot(uswages, aes(exper, wage))+geom_point(col = "slategrey")+
labs(x = "Experience in year", y = "Weekly wage")+
ggtitle("Relationship between weekly wages and years of experience")
From the plot, we can observe that there are a few outliers with extremely high wage. To avoid the influence of outliers on regression models, we remove observations with wage larger than 4000.
# Remove outliers
uswages = filter(uswages, wage<4000)
First, let’s try to capture the relationship with ordinary least square model (ols).
# Fit an ols model as benchmark
fit1 = lm(wage~exper, data = uswages)
plot(uswages$exper, uswages$wage, xlab = "Weekly wage",
ylab = "Experience", main = "OLS model", col = "slategrey")
abline(fit1, col = "red")
From the plot, we can see that OLS model fails to catch most of the variabnce in the data.
Polynomial regression is a good alternative in this case. The linear models with polynomial of degree 2 and degree 4 are shown as follows:
# Fit polynomial regression models with degree 2 and 4
g2 = lm(wage~poly(exper, 2), data = uswages)
g4 = lm(wage~poly(exper, 4), data = uswages)
uswages = mutate(uswages, degree2 = fitted(g2), degree4 = fitted(g4))
ggplot(uswages, aes(exper, wage)) +
labs(x = "Experience in year", y = "Weekly wage")+
geom_point( col = "slategrey") +
geom_line(aes(exper, degree2,color = "2"))+
geom_line(aes(exper, degree4,color = "4")) +
scale_color_manual(values = c(
'2' = 'darkblue',
'4' = 'red')) +
labs(color = 'Polynomial degree')+
ggtitle("Polynomial regression models")
Polynomial regression models are smooth but the shortcomings are overfitting problem and each data point affect the fit globally. Cubic regression spline, however, by seperate data into subsets, greatly mitigate the problem. We first used the 25%, 50% and 75% quantiles as the set of knots. The result after cubic spline regression is as follows:
# Fit regression spline model with chosen knots
# 8, 15 and 27 are the quantiles for wage
cubic_spline = lm(wage~bs(exper, knots = c(8,15,27)), data = uswages)
uswages = mutate(uswages, smooth = fitted(cubic_spline))
ggplot(uswages, aes(exper, wage)) +
labs(x = "Experience in year", y = "Weekly wage")+
geom_point(col = "slategrey") +
geom_line(aes(exper, smooth), col = "red") +
ggtitle("Cubic regression spline model")
summary(cubic_spline)
##
## Call:
## lm(formula = wage ~ bs(exper, knots = c(8, 15, 27)), data = uswages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -718.23 -232.31 -65.92 152.56 2712.39
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 108.70 80.59 1.349 0.1775
## bs(exper, knots = c(8, 15, 27))1 278.89 137.20 2.033 0.0422 *
## bs(exper, knots = c(8, 15, 27))2 371.32 75.70 4.905 1.01e-06 ***
## bs(exper, knots = c(8, 15, 27))3 653.38 97.27 6.717 2.41e-11 ***
## bs(exper, knots = c(8, 15, 27))4 716.96 101.21 7.084 1.94e-12 ***
## bs(exper, knots = c(8, 15, 27))5 392.52 137.69 2.851 0.0044 **
## bs(exper, knots = c(8, 15, 27))6 85.35 152.83 0.558 0.5766
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 381.8 on 1990 degrees of freedom
## Multiple R-squared: 0.1422, Adjusted R-squared: 0.1396
## F-statistic: 54.97 on 6 and 1990 DF, p-value: < 2.2e-16
Alternatively, we can even split the range of predictor variable experience into 4 subsets and use the breakpoints as the knots. In this way, we get 15, 30 and 45.
# Fit regression spline model with chosen knots
cubic_spline2 = lm(wage~bs(exper, knots = c(15, 30, 45)), data = uswages)
uswages = mutate(uswages, smooth2 = fitted(cubic_spline2))
ggplot(uswages, aes(exper, wage)) +
labs(x = "Experience in year", y = "Weekly wage")+
geom_point(col = "slategrey") +
geom_line(aes(exper, smooth2), col = "red") +
ggtitle("Cubic regression spline model")
summary(cubic_spline2)
##
## Call:
## lm(formula = wage ~ bs(exper, knots = c(15, 30, 45)), data = uswages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -709.67 -231.18 -63.73 151.18 2698.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 177.60 54.36 3.267 0.0011 **
## bs(exper, knots = c(15, 30, 45))1 233.22 96.66 2.413 0.0159 *
## bs(exper, knots = c(15, 30, 45))2 583.20 56.93 10.244 < 2e-16 ***
## bs(exper, knots = c(15, 30, 45))3 618.75 86.19 7.179 9.86e-13 ***
## bs(exper, knots = c(15, 30, 45))4 427.26 88.51 4.827 1.49e-06 ***
## bs(exper, knots = c(15, 30, 45))5 58.48 152.05 0.385 0.7006
## bs(exper, knots = c(15, 30, 45))6 151.68 226.77 0.669 0.5037
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 381.8 on 1990 degrees of freedom
## Multiple R-squared: 0.1419, Adjusted R-squared: 0.1393
## F-statistic: 54.84 on 6 and 1990 DF, p-value: < 2.2e-16
To better compare the goodness of fit of each model, we attached a MSE table below.
mse = function(model){
rmse = sqrt(mean(model$residual^2))
round(rmse, 3)
}
mse_ols = mse(fit1)
mse_poly1 = mse(g2)
mse_poly2 = mse(g4)
mse_cubic = mse(cubic_spline)
mse_cubic2 = mse(cubic_spline2)
model_name = c("OLS", "Polynomial with degree 2", "Polynomial with degree 4", "Cubic Spline 1", "Cubic Spline 2")
mses = c(mse_ols, mse_poly1, mse_poly2, mse_cubic, mse_cubic2)
mse_table = cbind(model_name, mses)
colnames(mse_table) = c("Models", "RMSE")
kable(mse_table, caption = "Mean squared errors for different models", digits = 3)
Models | RMSE |
---|---|
OLS | 404.744 |
Polynomial with degree 2 | 381.383 |
Polynomial with degree 4 | 381.196 |
Cubic Spline 1 | 381.091 |
Cubic Spline 2 | 381.156 |
In this table, we can find that the cubic spline regression with knots (8, 15, 27) has the best fit. In general, cubic splines are better than polynomial regressions, which are better than ols.
In this python tutorial, pandas is required for data manipulation, statsmodel is require for building regression models and matplolib is for plotting.
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from math import sqrt
from sklearn.metrics import mean_squared_error
from patsy import dmatrix
from scipy import interpolate
# read dataset
data = pd.read_csv("uswages.csv")
print(data.head())
## Unnamed: 0 wage educ exper race smsa ne mw so we pt
## 0 6085 771.60 18 18 0 1 1 0 0 0 0
## 1 23701 617.28 15 20 0 1 0 0 0 1 0
## 2 16208 957.83 16 9 0 1 0 0 1 0 0
## 3 2720 617.28 12 24 0 1 1 0 0 0 0
## 4 9723 902.18 14 12 0 1 0 1 0 0 0
We can see that wage and experience has a non-linear relationship.
## exper vs wage ##
data_x = data[['exper']]
data_y = data[['wage']]
#visualize the relationship between experience and wage
plt.scatter(data_x, data_y, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.suptitle('Fig 1. Relationship between experience and wage', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()
Looking at Fig 1., there seems to be an outlier (the wage that is above 7000), therefore, in this tutorial, we will remove that outlier and continue our model building, we reduce the data down to wage <= 4000.
# remove outlier
data_ylim = data.loc[data['wage']<= 4000]
wage = data_ylim[['wage']]
exper_x = data_ylim[['exper']]
#visualize the relationship between experience and wage
plt.clf()
plt.scatter(exper_x, wage, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.suptitle('Fig 2. Relationship between experience and wage (remove outlier)', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()
First we will add an intercept in our simple linear regression model, then use sm.OLS from statsmodels to do the analysis.
#add an intercept (beta_0) to our model
exper_x = sm.add_constant(exper_x)
# model fitting
model = sm.OLS(wage, exper_x).fit()
# find fitted value
predictions1 = model.predict(exper_x)
By looking at Fig 3., apparently, we can’t use simple linear regression to explain the relationship between wage and experience, since there exists a non-linear association between wage and experience.
# data visualization
plt.clf()
plt.scatter(exper_x['exper'], wage, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.plot(exper_x['exper'], predictions1, color = 'green', linewidth = 1.5)
plt.suptitle('Fig 3. Simple linear regression', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()
Calculating RMSE value:
rms1 = sqrt(mean_squared_error(wage, predictions1))
print(rms1)
## 404.7442876375701
Next we will try using polynomial regression to access the relationship by setting “experience” to degree = 2. Again, we will use sm.OLS from statsmodel to do this analysis.
# refit model using polynomial regression ("exper" with degree = 2)
exper_x['exper2'] = np.power(exper_x['exper'], 2)
# model fitting
model2 = sm.OLS(wage, exper_x).fit()
# find fitted value
predictions2 = model2.predict(exper_x)
The polynomial curve on Fig 4. does explain the non-linear relationship between wage and experience.
# reduce samples down to 100
x_lim = np.linspace(start = exper_x['exper'].min(), stop = exper_x['exper'].max(), num = 100)
x_lim_df = pd.DataFrame({'exper':x_lim})
x_lim_df['exper2'] = np.power(x_lim_df['exper'], 2)
x_lim_df = sm.add_constant(x_lim_df)
# find fitted value using x_lim
fit_reduce = model2.predict(x_lim_df)
# data visualization
plt.clf()
plt.scatter(exper_x['exper'], wage, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.plot(x_lim_df[['exper']], fit_reduce, color = 'blue', linewidth = 1.5, label='experience with degree = 2')
plt.legend()
plt.suptitle('Fig 4. Polynomial regression (degree = 2)', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()
Calculating RMSE value:
rms2 = sqrt(mean_squared_error(wage, predictions2))
print(rms2)
## 381.38251012666694
Next, we add higher degree of experience in the model.
# "exper" with degree = 4
exper_x['exper3'] = np.power(exper_x['exper'], 3)
exper_x['exper4'] = np.power(exper_x['exper'], 4)
# model fitting
model3 = sm.OLS(wage, exper_x).fit()
# find fitted value
predictions3 = model3.predict(exper_x)
# reduce samples down to 100
x_lim = np.linspace(start = exper_x['exper'].min(), stop = exper_x['exper'].max(), num = 100)
x_lim_df = pd.DataFrame({'exper':x_lim})
x_lim_df['exper2'] = np.power(x_lim_df['exper'], 2)
x_lim_df['exper3'] = np.power(x_lim_df['exper'], 3)
x_lim_df['exper4'] = np.power(x_lim_df['exper'], 4)
x_lim_df = sm.add_constant(x_lim_df)
# find fitted value using x_lim
fit_reduce2 = model3.predict(x_lim_df)
Fig. 5 shows the polynomial regression curve when degree of experience is up to 4. The curve looks similar to Fig 4.
# data visualization
plt.clf()
plt.scatter(exper_x['exper'], wage, facecolor = 'None', edgecolor = 'k', alpha = 0.3)
plt.plot(x_lim_df[['exper']], fit_reduce2, color = 'purple', linewidth = 1.5, label='experience with degree up to 5')
plt.legend()
plt.suptitle('Fig 5. Polynomial regression (degree = 4)', fontsize=12)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()
Calculating RMSE value:
rms3 = sqrt(mean_squared_error(wage, predictions3))
print(rms3)
## 381.1963530129414
We select 8, 15, 27 as our knots, which are the 25th , 50th and 75th quantile in variable experience, and use sm.GLM from statsmodels to build cubic regression model.
# cubic spline with 4 knots at 8, 15, 27
cubic_x1 = dmatrix("bs(data, knots = (8, 15, 27), include_intercept = False)", {"data": exper_x[['exper']]}, return_type = 'dataframe')
# model fitting
model4 = sm.GLM(wage, cubic_x1).fit()
# find fitted value
predictions4 = model4.predict(cubic_x1)
print(model4.summary())
# reduce samples down to 100
## Generalized Linear Model Regression Results
## ==============================================================================
## Dep. Variable: wage No. Observations: 1997
## Model: GLM Df Residuals: 1990
## Model Family: Gaussian Df Model: 6
## Link Function: identity Scale: 1.4574e+05
## Method: IRLS Log-Likelihood: -14702.
## Date: Fri, 07 Dec 2018 Deviance: 2.9003e+08
## Time: 16:06:46 Pearson chi2: 2.90e+08
## No. Iterations: 3 Covariance Type: nonrobust
## ===========================================================================================================================
## coef std err z P>|z| [0.025 0.975]
## ---------------------------------------------------------------------------------------------------------------------------
## Intercept 108.7029 80.588 1.349 0.177 -49.247 266.653
## bs(data, knots=(8, 15, 27), include_intercept=False)[0] 278.8916 137.196 2.033 0.042 9.992 547.792
## bs(data, knots=(8, 15, 27), include_intercept=False)[1] 371.3220 75.698 4.905 0.000 222.957 519.687
## bs(data, knots=(8, 15, 27), include_intercept=False)[2] 653.3815 97.270 6.717 0.000 462.735 844.028
## bs(data, knots=(8, 15, 27), include_intercept=False)[3] 716.9594 101.208 7.084 0.000 518.595 915.323
## bs(data, knots=(8, 15, 27), include_intercept=False)[4] 392.5233 137.685 2.851 0.004 122.665 662.382
## bs(data, knots=(8, 15, 27), include_intercept=False)[5] 85.3464 152.831 0.558 0.577 -214.197 384.890
## ===========================================================================================================================
x_lim = np.linspace(exper_x[['exper']].min(), exper_x[['exper']].max(), 100)
# find fitted value using x_lim
fit_reduce3 = model4.predict(dmatrix(formula_like = "bs(train, knots = (8, 15, 27), include_intercept = False)", data = {"train": x_lim}, return_type = 'dataframe'))
The spline curve on Fig 6. does explain the non-linear relationship between wage and experience.
# plot spline
plt.clf()
plt.scatter(exper_x[['exper']], wage, facecolor='None', edgecolor='k', alpha=0.1)
plt.plot(x_lim, fit_reduce2, color='r', linewidth = 1.5, label='Specifying 3 knots, knots = (8, 15, 27)')
plt.legend()
plt.suptitle('Fig 6. Cubic regression, knots = (8, 15, 27)', fontsize=12)
plt.ylim(0, 5000)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()
Calculating RMSE value:
rms4 = sqrt(mean_squared_error(wage, predictions3))
print(rms4)
## 381.1963530129414
Next, we select other three knots 15, 30, 45, which are in same interval from experience, and do the cubic regression.
# cubic spline with 4 knots at 15, 30, 45
cubic_x2 = dmatrix("bs(data, knots = (15, 30, 45), include_intercept = False)", {"data": exper_x[['exper']]}, return_type = 'dataframe')
# model fitting
model5 = sm.GLM(wage, cubic_x2).fit()
# find fitted value
predictions5 = model5.predict(cubic_x2)
print(model5.summary())
# reduce sample to 100
## Generalized Linear Model Regression Results
## ==============================================================================
## Dep. Variable: wage No. Observations: 1997
## Model: GLM Df Residuals: 1990
## Model Family: Gaussian Df Model: 6
## Link Function: identity Scale: 1.4579e+05
## Method: IRLS Log-Likelihood: -14702.
## Date: Fri, 07 Dec 2018 Deviance: 2.9012e+08
## Time: 16:06:46 Pearson chi2: 2.90e+08
## No. Iterations: 3 Covariance Type: nonrobust
## ============================================================================================================================
## coef std err z P>|z| [0.025 0.975]
## ----------------------------------------------------------------------------------------------------------------------------
## Intercept 177.5977 54.356 3.267 0.001 71.062 284.134
## bs(data, knots=(15, 30, 45), include_intercept=False)[0] 233.2164 96.665 2.413 0.016 43.757 422.675
## bs(data, knots=(15, 30, 45), include_intercept=False)[1] 583.2001 56.933 10.244 0.000 471.613 694.787
## bs(data, knots=(15, 30, 45), include_intercept=False)[2] 618.7530 86.186 7.179 0.000 449.831 787.675
## bs(data, knots=(15, 30, 45), include_intercept=False)[3] 427.2639 88.511 4.827 0.000 253.785 600.743
## bs(data, knots=(15, 30, 45), include_intercept=False)[4] 58.4825 152.055 0.385 0.701 -239.539 356.504
## bs(data, knots=(15, 30, 45), include_intercept=False)[5] 151.6777 226.770 0.669 0.504 -292.783 596.139
## ============================================================================================================================
x_lim = np.linspace(exper_x[['exper']].min(), exper_x[['exper']].max(), 100)
# find fitted value using x_lim
fit_reduce4 = model5.predict(dmatrix(formula_like = "bs(train, knots = (15, 30, 45), include_intercept = False)", data = {"train": x_lim}, return_type = 'dataframe'))
The spline curve on Fig 7 is slightly different from Fig 6 at higher experience value, this might be due to the lower sample size at the last knot.
plt.clf()
plt.scatter(exper_x[['exper']], wage, facecolor='None', edgecolor='k', alpha=0.1)
plt.plot(x_lim, fit_reduce4, color='y', linewidth = 1.5, label='Specifying 3 knots, knots = (15, 30, 45)')
plt.legend()
plt.suptitle('Fig 7. Cubic regression, knots = (15, 30, 45)', fontsize=12)
plt.ylim(0, 5000)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()
Calculating RMSE value:
rms5 = sqrt(mean_squared_error(wage, predictions5))
print(rms5)
## 381.1563127080508
By looking at Fig 8, we can see that polynomial curve and spline curve do overlap with each other, and their residual RMSE both look similar with spline’s MSE seems to be slightly smaller.
# overlay three regression curve
plt.clf()
plt.scatter(exper_x[['exper']], wage, facecolor='None', edgecolor='k', alpha=0.1)
plt.plot(exper_x['exper'], predictions1, color = 'green', linewidth = 1.5, label = 'Simple Linear Regression')
plt.plot(x_lim_df['exper'], fit_reduce, color = 'blue', linewidth = 1.5, label='Polynomial Regression, experience degree = 2')
plt.plot(x_lim_df['exper'], fit_reduce2, color = 'purple', linewidth = 1.5, label='Polynomial Regression, experience up to degree = 4')
plt.plot(x_lim, fit_reduce2, color='r', linewidth = 1.5, label='Cubic Regrssion, knots = (8, 15, 27)')
plt.plot(x_lim, fit_reduce3, color='y', linewidth = 1.5, label='Cubic Regrssion, knots = (15, 30, 45)')
plt.legend()
plt.suptitle('Fig 8. Relationship between experience and wage', fontsize=12)
plt.ylim(0, 5000)
plt.xlabel('experience')
plt.ylabel('wage')
plt.show()
compare root of MSE:
# compare mse
model = ['SLR', 'Polynomial (degree = 2)', 'Polynomial (degree = 4)', 'Spline(knots = 8, 15, 27)', 'Spline(knots = 15, 30, 45)']
RMSE = [rms1, rms2, rms3, rms4, rms5]
compare = pd.DataFrame({'Model':model, 'RMSE':RMSE})
print(compare)
## Model RMSE
## 0 SLR 404.744288
## 1 Polynomial (degree = 2) 381.382510
## 2 Polynomial (degree = 4) 381.196353
## 3 Spline(knots = 8, 15, 27) 381.196353
## 4 Spline(knots = 15, 30, 45) 381.156313
In this tutorial, stata is used to perform the analysis. To build the cubic spline regression model, a package called bspline is needed. More information about the package can be found here (https://www.stata-journal.com/sjpdf.html?articlenum=sg151_2)
ssc install bspline
ssc install markstat
ssc install whereis
We start by loading the data and presenting the first 4 observations.
import delimited uswages.csv
list in 1/4
This dataset is exploring how wages in the U.S. are affected by certain variables. For simplicity, we will be regressing the variable experience onto wage and looking at the resulting relationship. First we will visualize how these 2 variables are related. We notice that wage and experience has a non-linear relationship. We can also see that there seems to be a few outliers. We remove data points where wage is greater than 4000.
import delimited uswages.csv
summarize
summarize wage exper, detail
import delimited uswages.csv
twoway (scatter wage exper), ytitle(Wage) xtitle(Experience) legend(off) note("Fig 1. Relationship between experience and wage") title("Wage vs Exper") name(Scatter)
import delimited uswages.csv
drop if wage > 4000
twoway (scatter wage exper), ytitle(Wage) xtitle(Experience) legend(off) note("Fig 2. Relationship between experience and wage (remove outlier)") title("Wage vs Exper") name(Scatter2)
First we will do OLS Regression. From Fig 3., it does not appear that linear regression is suitable to explain the relationship between wage and experience. We note from the output that the RMSE value is 404.95.
import delimited uswages.csv
regress wage exper
predict linear_wage_exper
twoway (scatter wage exper)(line linear_wage_exper exper, sort), ytitle(Wage) xtitle(Experience) legend(off) note("Fig 3. Relationship between experience and wage: using simple linear regression") title("Linear") name(Linear)
Next we perform polynomial regression and fit a quadratic by setting “experience” to degree = 2. The polynomial seems to explain the non-linearity somewhat better than linear regression. We note from the output that the RMSE value is 381.67.
import delimited uswages.csv
regress wage c.exper##c.exper
predict quad_wage_exper
twoway (scatter wage exper)(line quad_wage_exper exper, sort), ytitle(Wage) xtitle(Experience) legend(off) note("Fig 4. Relationship between experience and wage: using polynomial regression") title("Polynomial: Degree 2") name(Degree2)
Next, we’ll try adding a higher degree of experience to the model.
import delimited uswages.csv
regress wage c.exper##c.exper##c.exper##c.exper
predict degfour_wage_exper
twoway (scatter wage exper)(line degfour_wage_exper exper, sort), ytitle(Wage) xtitle(Experience) legend(off) note("Fig 5. Relationship between experience and wage: using polynomial regression") title("Polynomial: Degree 4") name(Degree4)
The curve looks similar to Fig 4. Note that the RMSE is 381.67
For the cubic regression model, we select 8, 15, and 27 as our knots, which are the 25th, 50th, and 75th quartiles for the experience variable. Here we use the bspline package.
import delimited uswages.csv
bspline, xvar(exper) knots(8,15,27) gen(cubic_wage_exper_spline) power(3)
regress wage cubic_wage_exper_spline*
predict cubic_wage_exper_spline
twoway (scatter wage exper)(line cubic_wage_exper_spline exper, sort), ytitle(Wage) xtitle(Experience) legend(off) note("Fig 6. Relationship between experience and wage: using cubic regression. Knots at (8,15,27)") title("Cubic: Knots at Quartiles") name(Cubic)
The cubic regression spline also seems to model the non-linear relationship better than a linear fit, although it does not look significantly different from the polynomial regressions performed earlier. Note the RMSE value of 382.23.
One may notice that the coefficients produced from the cubic regression in Stata compared to Python and R look quite different. We attempt to explain this in the Discussion section below.
Next, we select evenly spaced knots at 15, 30, and 45 and perform the cubic regression again.
import delimited uswages.csv
bspline, xvar(exper) knots(15,30,40) gen(cubic_wage_exper_spline_even) power(3)
regress wage cubic_wage_exper_spline_even*
predict cubic_wage_exper_spline_even
twoway (scatter wage exper)(line cubic_wage_exper_spline_even exper, sort), ytitle(Wage) xtitle(Experience) legend(off) note("Fig 7. Relationship between experience and wage: using cubic regression. Knots at (15,30,40)") title("Cubic: Evenly Spaced Knots") name(Cubic2)
The spline curve from Fig 7. is slightly different from that of Fig 6. at higher values of experience. We suspect this might be due to lower sample size at the last knot. Note the RMSE value of 381.95.
Looking at a plot of all of the curves, we can see that the splines and polynomial curves overlap each other somewhat. In python and R, the spline’s RMSE is slightly smaller than the polynomial’s. However, in Stata, the spline’s RMSE is slightly bigger than the polynomial’s. We still might expect the cubic fit to be a little more accurate because of the local influence property from the knots.
Model | RMSE |
---|---|
OLS | 404.95 |
Polynomial(Degree = 2) | 381.67 |
Polynomial(Degree = 4) | 381.67 |
Spline(knots = 8,15,27) | 382.23 |
Spline(knots = 15,30,45) | 381.95 |
The coefficients for the cubic spline regression in Stata look quite different from the coefficients from R and Python. We believe that this could possibly be due to how the spline functions are scaled in the bspline package. If \(y(t) = \sum x_i(t)\beta_i\), it could also be that \(y(t) = \sum (\frac{x_i(t)}{C}) * C\beta_i\) where C is some constant. We believe this is happening here.
We can try to reproduce the cubic spline regression results with knots at 8, 15, 27 from scratch using the basis functions. We can then sum the predictions from the bspline package and those from scratch and compare. More information about rescaling of the basis functions can be found here (https://www.researchgate.net/publication/4794073_BSPLINE_Stata_modules_to_compute_B-splines_parameterized_by_their_values_at_reference_points )
import delimited uswages.csv
gen exper2 = exper^2
gen exper3 = exper^3
gen k8 = cond(exper > 8, (exper - 8)^3, 0)
gen k15 = cond(exper > 15, (exper - 15)^3, 0)
gen k27 = cond(exper > 27, (exper - 27)^3, 0)
regress wage exper exper2 exper3 k8 k15 k27
predict cubic_wage_exper
bspline, xvar(exper) knots(8,15,27) gen(cubic_wage_exper_spline) power(3)
regress wage cubic_wage_exper_spline*
predict cubic_wage_exper_spline
sum cubic_wage_exper cubic_wage_exper_spline