Prediction (out of sample)

In [1]:
%matplotlib inline
In [2]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.983
Model:                            OLS   Adj. R-squared:                  0.982
Method:                 Least Squares   F-statistic:                     881.0
Date:                Sun, 10 Nov 2019   Prob (F-statistic):           1.26e-40
Time:                        16:32:41   Log-Likelihood:                -1.3475
No. Observations:                  50   AIC:                             10.69
Df Residuals:                      46   BIC:                             18.34
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9323      0.088     55.837      0.000       4.754       5.110
x1             0.5048      0.014     37.057      0.000       0.477       0.532
x2             0.5422      0.054     10.125      0.000       0.434       0.650
x3            -0.0197      0.001    -16.478      0.000      -0.022      -0.017
==============================================================================
Omnibus:                        1.967   Durbin-Watson:                   2.100
Prob(Omnibus):                  0.374   Jarque-Bera (JB):                1.114
Skew:                          -0.286   Prob(JB):                        0.573
Kurtosis:                       3.456   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.4395271   4.93798172  5.39450724  5.7795518   6.07422864  6.27341917
  6.38661392  6.43635326  6.45452412  6.47712102  6.53833229  6.66492302
  6.87183764  7.15974457  7.51492643  7.91153375  8.31583195  8.69174388
  9.00677715  9.23736256  9.37272643  9.41666073  9.38690002  9.31220775
  9.22765049  9.16883657  9.16606578  9.23935117  9.39513096  9.62521077
  9.90810993 10.2125907  10.50279052 10.74411415 10.90891757 10.9810498
 10.95850646 10.8537591  10.69170616 10.50558218 10.33149755 10.20250558
 10.14317199 10.16553882 10.26714595 10.43143531 10.63047129 10.8295305
 10.99280676 11.08929387]

Create a new sample of explanatory variables Xnew, predict and plot

In [6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.08661593 10.9417432  10.67594049 10.33766755  9.99071449  9.69858371
  9.50894229  9.44195146  9.48533038  9.59736296]

Plot comparison

In [7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [9]:
res.params
Out[9]:
Intercept           4.932288
x1                  0.504841
np.sin(x1)          0.542245
I((x1 - 5) ** 2)   -0.019710
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [10]:
res.predict(exog=dict(x1=x1n))
Out[10]:
0    11.086616
1    10.941743
2    10.675940
3    10.337668
4     9.990714
5     9.698584
6     9.508942
7     9.441951
8     9.485330
9     9.597363
dtype: float64