statsmodels¶

Stats 507, Fall 2021

James Henderson, PhD
October 12, 2021

Overview¶

  • statsmodels
  • regression
  • OLS
  • Example
  • Formulas
  • Interactions
  • Contrasts
  • Takeaways

statsmodels¶

statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

--[statsmodels][sm]

statsmodels¶

  • For general use, the canonical way to import statsmodels is to import the api submodule with the handle sm.
  • You will usually want to import the formula API as well.
In [ ]:
# imports 
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from os.path import exists
from scipy.stats import t
sm.__version__

Regression¶

  • We will focus on using statsmodels for regression problems:
    • linear models using ordinary least squares,
    • generalized linear models.

Regression¶

  • A regression problem is one in which our focus is on the conditional mean of a dependent or endogenous variable ($Y$) ...
  • ... given some set of independent or exogenous variables (X).
  • endog, exog
  • This has the form: $$ \mathbb{E}[Y|X = x] = f(x; \beta). $$

Linear Regression¶

  • In linear regression of ordinary least squares (OLS) $f(x; \beta)$ is linear in $\beta$ and $Y|X$ is (usually) assumed to be Gaussian with covariance $\Sigma = \sigma^2I_{p \times p}$.
$$ \mathbb{E}[Y|X = x] = x\beta = \sum_{j=1}^{p} x_j \times \beta_j, \qquad Y | X \sim N(X\beta, \sigma^2). $$

Design Matrix¶

  • The design matrix $X$ forms a basis for the conditional mean and is determined by are IVs.
  • We assume $X \in \mathbb{R}^{n \times p}$ where $n$ is the number of samples (rows) and $p$ the dimension of the basis (columns).
  • A formula interface as provided by statsmodels (using patsy) allows for a concise, flexible way to construct $X$ from a data frame.

Example - ToothGrowth¶

  • Let's revisit the ToothGrowth data from the Resampling Methods notes.
In [ ]:
# tooth growth data
file = 'tooth_growth.feather'
if exists(file):
    tg_data = pd.read_feather(file)
else: 
    tooth_growth = sm.datasets.get_rdataset('ToothGrowth')
    #print(tooth_growth.__doc__)
    tg_data = tooth_growth.data
    tg_data.to_feather(file)

(tg_data
 .groupby(['dose', 'supp'])
 .apply(lambda gdf: gdf.iloc[0:2, :])
)

Example - endog, exog¶

  • To begin, let's compare supplementation methods within each dose separately.
  • To be concrete, we'll start with the dose of 0.5 mg/day of Vitamin C.
  • Here, we create the outcome and design matrices directly.
In [ ]:
tg_05 = tg_data.query('dose == 0.5')
Y = tg_05['len']
X = pd.get_dummies(tg_05['supp'])['OJ']
X = sm.add_constant(X) 
(Y, X)

Example - OLS¶

  • Now we can set up the model object using sm.OLS() ...
  • and then call its fit model to estimate the parameters.
  • The summary() method prints key information.
In [ ]:
mod0_05 = sm.OLS(Y, X)
res0_05 = mod0_05.fit()
res0_05.summary()

Example - OLS Attributes¶

  • The results object res0_05 contains much of the summary information in attributes (properties).
  • Here we explore those properties while reviewing the relationships in the coefficient table.
In [ ]:
print(type(res0_05))
print((res0_05.rsquared_adj, res0_05.aic))
(res0_05.params, res0_05.tvalues)
#res0_05.  # use tab complete to see methods
?res0_05.conf_int

Example - Coefficient table¶

  • Let's review the coefficient table from the summary results and construct it using attributes and methods from the results object.
In [ ]:
b = res0_05.params                     # estimated parameters
v = res0_05.cov_params()               # variance-covariance
se = np.sqrt(np.diag(v))               # std errors
t_stats = res0_05.params / se          # t statistics
assert all(t_stats == res0_05.tvalues) # same as stored in object     

df = res0_05.df_resid                  # degrees of freedom
p = 2 * (1 - t.cdf(t_stats, df=df))    # p-values
assert all(np.round(p, 6) == np.round(res0_05.pvalues, 6))

tt = t.ppf(.975, df=df)                # multiplier
lwr, upr = b - tt * se, b + tt * se    # CI bounds
ci = res0_05.conf_int()
ci['lwr'], ci['upr'] = lwr, upr
ci

Example - OLS Methods¶

  • The results object res0_05 also has a number of useful methods.
  • On the previous slides, we used the methods .summary(), .cov_params() and .conf_int() methods.
  • You can use the .save() and sm.load() function to store results on disk (as pickle).
In [ ]:
file_name = 'res0_05.pickle'
# notebook needs to be trusted
res0_05.save(file_name)
del res0_05
res0_05 = sm.load(file_name)
res0_05.summary2()

Example - Diagnostics¶

  • Let's use the .predict() method and then create some plots.
  • We'll also use the .add_gridspec() method of our figure object.
In [ ]:
y_hat, r = res0_05.predict(), res0_05.resid
b0, b1 = res0_05.params
col = [('darkgreen', 'orange')[x] for x in X['OJ'].values]

fig0 = plt.figure(tight_layout=True)
gs = fig0.add_gridspec(2, 2)

f0_ax0 = fig0.add_subplot(gs[0, 0])
_ = f0_ax0.set_title('Observed ~ Predicted')
_ = f0_ax0.scatter(x=y_hat, y=Y, color=col)

f0_ax1 = fig0.add_subplot(gs[0, 1])
_ = f0_ax1.set_title('Residual ~ Predicted')
_ = f0_ax1.scatter(x=y_hat, y=r, color=col)

f0_ax2 = fig0.add_subplot(gs[1, :])
f0_ax2.set_title('QQ Plot')
_ = sm.qqplot(ax=f0_ax2, data=r, line='s')

Example¶

  • We can compare the ratio of geometric means by regressing the log outcome on supplement type.
  • This is almost identical to our t-test analysis from the re-sampling notes except we use a pooled estimate of the (residual) variance.
In [ ]:
tg_data['log_len'] = tg_data[['len']].transform(np.log)
tg_data['OJ'] = pd.get_dummies(tg_data['supp'])['OJ']
doses = [0.5, 1.0, 2.0]
cis = []
for dose in doses:
    # subset dose
    dat = tg_data.query('dose == @dose')
    
    # fit OLS model
    Y, X = dat['log_len'], sm.add_constant(dat['OJ'])
    m = sm.OLS(Y, X)
    res = m.fit()
    ci = res.conf_int()
    
    # format confidence interval
    ci.columns = ('lwr', 'upr')
    ci['est'] = res.params
    ci.index.name = 'term'
    ci = ci.transform(np.exp)
    ci['CI'] = (
        ci.groupby('term').apply(lambda gdf: 
                         '{0:4.2f} ({1:4.2f}, {2:4.2f})'.format(
                             gdf['est'].values[0],
                             gdf['lwr'].values[0], 
                             gdf['upr'].values[0]
                         )
                        )
    )
    cis.append(ci.loc['OJ', 'CI'])
pd.DataFrame({'dose': doses, 'Ratio of Means (95% CI)': cis})

Formula Method¶

  • Formulas are a convenient way to construct the design matrix $X$ from a dataset.
  • Formulas are not only convenient, but also concise and expressive, allowing us to focus on the modeling and iterate quickly w/o being distracted by set up.
  • In a formula string, a tilde ~ separates DV(s) from the IVs.
In [ ]:
mod1 = sm.OLS.from_formula('log_len ~ OJ + dose', data=tg_data)
res1 = mod1.fit()
res1.summary2()

Design Inspection¶

  • If you're ever unsure about how a formula is interpreted by statsmodels, take a look at (select) rows of the design matrix.
  • The model object has attributes endog and exog and *_names.
In [ ]:
X = mod1.exog
print(mod1.exog_names)
X

Formula API¶

  • The formula API was imported as smf.
  • This allows us to call the previous model and others more concisely.
  • Use smf.ols() with a formula when fitting linear regression models.
In [ ]:
# smf.
mod1 = smf.ols('log_len ~ OJ + dose', data=tg_data)
res1 = mod1.fit()
res1.summary2()

Interactions¶

  • Interaction terms in the design matrix are formed by multiplying values from two (or more) variables together.
  • Here we specify an interaction between the indicator OJ and dose treated as a continuous variable.
In [ ]:
mod2 = sm.OLS.from_formula('log_len ~ OJ:dose', data=tg_data)
res2 = mod2.fit()
res2.summary2()

Interaction Inspection¶

  • What to you expect X to look like below. How many zeros will it have?
  • Note also how the interaction is named.
In [ ]:
X = mod2.exog
print(mod2.exog_names)
X

Main Effects & Interactions¶

  • We rarely use interactions without also including the associated main effects for the variables that make up the interaction.
  • a + b + a:b can be concisely written as a*b.
In [ ]:
mod3 = smf.ols('log_len ~ OJ*dose', data=tg_data)
res3 = mod3.fit()
res3.summary2()

Interaction Inspection¶

  • How many columns do you expect X to have?
  • What are their names?
In [ ]:
#mod3 = smf.ols('log_len ~ OJ*dose', data=tg_data)
X = mod3.exog
print(mod3.exog_names)
X

Categorical Interactions¶

  • Let's run an analysis with the same mean structure as our dose-by-dose analysis.
  • Mean-structure means the models have the same predictions or fitted values, but not the same variance structure.
  • We can let each supplement and dose combination have its own mean by making dose categorical and interacting this with supp (or OJ).
In [ ]:
tg_data['dose_cat'] = pd.Categorical(tg_data['dose'])
mod4 = smf.ols('log_len ~ OJ*dose_cat', data=tg_data)
res4 = mod4.fit()
res4.summary2()

Interaction Inspection¶

  • How many columns do you expect X to have?
  • What are their names?
In [ ]:
#mod4 = smf.ols('log_len ~ OJ*dose_cat', data=tg_data)
X = mod4.exog
print(mod4.exog_names)
X

AIC¶

  • The Akaike Information Criterion (AIC) is one way ato compare competing models fit to the same data.
  • There is a trade-off between model fit and model complexity -2 times the log-likelihood (aka the deviance) plus 2 times the number of parameters (columns in X).
  • The log-likelihood is a method of the model object.
In [ ]:
print(np.round(-2 * mod4.loglike(res4.params) + 2 * len(res4.params), 2))
[np.round(r.aic, 1) for r in [res1, res2, res3, res4]]

Contrasts¶

  • Linear combinations of regression coefficients are known as contrasts.
  • We often express contrasts using matrix notation -- each row of L below specifies a contrast.
In [ ]:
b = res4.params
v = res4.cov_params()
L = np.zeros((3, len(b)))
L[0, 3] = 1      # OJ
L[1, (3, 4)] = 1 # OJ + OJ:dose_cat[T.1.0]
L[2, (3, 5)] = 1 # OJ + OJ:dose_cat[T.2.0]
est = np.exp(np.dot(L, b))
se = np.sqrt(np.diag(np.dot(np.dot(L, v), L.T)))
tt = t.ppf(.975, df=res4.df_resid)
lwr, upr = est - tt * se, est + tt * se
np.round(np.array([est, lwr, upr]).T, 2)

Contrasts¶

  • Contrasts can also be estimated using the .t_test() method of the results object.
  • In models with interactions, we need contrasts to estimate the means for non-reference groups.
In [ ]:
contrasts = res4.t_test(L)
print(np.round(np.exp(contrasts.effect), 2))
contrasts

Contrasts¶

  • Setting up the contrasts matrix L takes care.
  • The .t_test() method also accepts a string with contrasts specified using the parameter (variable) names.
In [ ]:
contrasts = res4.t_test(
    'OJ = 0, OJ + OJ:dose_cat[T.1.0] = 0, OJ + OJ:dose_cat[T.2.0] = 0'
)
#contrasts
#contrasts.summary()
est_ci4 = np.zeros((3, 3))
est_ci4[:, 0] = np.exp(contrasts.effect)
est_ci4[:, 1:3] = np.exp(contrasts.conf_int())
np.round(est_ci4, 2)

Takeaways¶

  • Regression is among the most important techniques in the data science toolkit.
  • OLS and many related (statistical) regression models can be fit using statsmodels.
  • The formula API allows us for flexible and concise specification of the design matrix.
  • In a formula, : indicates an interaction and a*b = a + b + a:b.
  • Estimate contrasts using the .t_test() method.