Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.986
Model:                            OLS   Adj. R-squared:                  0.985
Method:                 Least Squares   F-statistic:                     1089.
Date:                Sat, 05 Feb 2022   Prob (F-statistic):           1.03e-42
Time:                        17:46:33   Log-Likelihood:                 5.4949
No. Observations:                  50   AIC:                            -2.990
Df Residuals:                      46   BIC:                             4.658
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0174      0.077     65.130      0.000       4.862       5.172
x1             0.4901      0.012     41.248      0.000       0.466       0.514
x2             0.4574      0.047      9.793      0.000       0.363       0.551
x3            -0.0192      0.001    -18.371      0.000      -0.021      -0.017
==============================================================================
Omnibus:                        3.229   Durbin-Watson:                   2.204
Prob(Omnibus):                  0.199   Jarque-Bera (JB):                2.302
Skew:                           0.313   Prob(JB):                        0.316
Kurtosis:                       3.845   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.53825052  4.99484664  5.41523057  5.77447594  6.05665219  6.25744189
  6.38485011  6.45788915  6.50345495  6.55190811  6.63208574  6.76656363
  6.96794708  7.23679988  7.56155172  7.92039913  8.28488773  8.62458721
  8.91209095  9.12751883  9.2617835   9.3180834   9.31137724  9.26592637
  9.21130868  9.17755905  9.19023467  9.26621612  9.41093391  9.61747625
  9.86772466 10.13533078 10.39004533 10.60268796 10.74994153 10.81818305
 10.80572187 10.72307747 10.59125113 10.4382755  10.29460888 10.18813103
 10.13956262 10.15906084 10.24455063 10.38206561 10.5480427  10.71319349
 10.84731622 10.92425784]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[10.91540335 10.78534125 10.55200792 10.25627814  9.95195747  9.69260882
  9.51843834  9.44645158  9.46628993  9.54276676]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7f82a592eee0>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.017353
x1                  0.490064
np.sin(x1)          0.457372
I((x1 - 5) ** 2)   -0.019164
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.915403
1    10.785341
2    10.552008
3    10.256278
4     9.951957
5     9.692609
6     9.518438
7     9.446452
8     9.466290
9     9.542767
dtype: float64