Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import statsmodels.api as sm

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.988
Model:                            OLS   Adj. R-squared:                  0.987
Method:                 Least Squares   F-statistic:                     1287.
Date:                Mon, 24 Feb 2020   Prob (F-statistic):           2.32e-44
Time:                        22:49:06   Log-Likelihood:                 10.116
No. Observations:                  50   AIC:                            -12.23
Df Residuals:                      46   BIC:                            -4.584
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0074      0.070     71.295      0.000       4.866       5.149
x1             0.4960      0.011     45.791      0.000       0.474       0.518
x2             0.5148      0.043     12.089      0.000       0.429       0.600
x3            -0.0202      0.001    -21.285      0.000      -0.022      -0.018
==============================================================================
Omnibus:                        0.999   Durbin-Watson:                   2.043
Prob(Omnibus):                  0.607   Jarque-Bera (JB):                0.631
Skew:                           0.274   Prob(JB):                        0.730
Kurtosis:                       3.041   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.50134616  4.98737075  5.43308135  5.81042435  6.10147053  6.30136081
  6.41910454  6.47609935  6.50261552  6.53282271  6.59917592  6.72708331
  6.93073163  7.21075541  7.55413274  7.93632467  8.32530693  8.68683143
  8.99005311  9.21259769  9.34423772  9.38857291  9.36243871  9.29314028
  9.21396636  9.15872     9.15616482  9.22529923  9.37223503  9.58919331
  9.85578243 10.14234835 10.41484672 10.64043624 10.79287438 10.85682901
 10.83039738 10.72541873 10.56552919 10.38227835 10.20994584 10.07990921
 10.01548874 10.02811594 10.11545543 10.26178859 10.44059613 10.6189151
 10.76275462 10.84268049]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.82514732 10.6718622  10.40301182 10.06459909  9.71718005  9.42103759
  9.22142214  9.13747276  9.15753126  9.2419965 ]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
../../../_images/examples_notebooks_generated_predict_12_0.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.007438
x1                  0.496007
np.sin(x1)          0.514754
I((x1 - 5) ** 2)   -0.020244
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.825147
1    10.671862
2    10.403012
3    10.064599
4     9.717180
5     9.421038
6     9.221422
7     9.137473
8     9.157531
9     9.241996
dtype: float64