Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.980
Model: OLS Adj. R-squared: 0.979
Method: Least Squares F-statistic: 766.8
Date: Fri, 17 Dec 2021 Prob (F-statistic): 2.88e-39
Time: 22:32:59 Log-Likelihood: -4.0068
No. Observations: 50 AIC: 16.01
Df Residuals: 46 BIC: 23.66
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4.9820 0.093 53.478 0.000 4.794 5.169
x1 0.5041 0.014 35.087 0.000 0.475 0.533
x2 0.4584 0.056 8.116 0.000 0.345 0.572
x3 -0.0203 0.001 -16.072 0.000 -0.023 -0.018
==============================================================================
Omnibus: 6.049 Durbin-Watson: 2.578
Prob(Omnibus): 0.049 Jarque-Bera (JB): 4.913
Skew: -0.676 Prob(JB): 0.0857
Kurtosis: 3.729 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.47510987 4.94218886 5.37261943 5.74141981 6.032624 6.24190495
6.37728554 6.45782037 6.51046519 6.56564812 6.65227042 6.79295807
7.00034441 7.2749945 7.60531228 7.96944579 8.33887738 8.68310904
8.97467321 9.19364555 9.33091849 9.38969759 9.38497485 9.34106555
9.28761337 9.25471999 9.26799946 9.34436988 9.48927385 9.6957842
9.94574217 10.21274078 10.46646347 10.67766501 10.8229764 10.88874433
10.87327441 10.7871095 10.6512976 10.49393371 10.34554386 10.23406983
10.18027836 10.19434924 10.27420279 10.40584158 10.56565011 10.72427483
10.85144681 10.92095485]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.90228018 10.7615946 10.5168744 10.20908523 9.89215235 9.61975778
9.4321971 9.34651369 9.35232588 9.41436881]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7ff47d4090d0>
Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1": x1, "y": y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 4.981965
x1 0.504110
np.sin(x1) 0.458389
I((x1 - 5) ** 2) -0.020274
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 10.902280
1 10.761595
2 10.516874
3 10.209085
4 9.892152
5 9.619758
6 9.432197
7 9.346514
8 9.352326
9 9.414369
dtype: float64