Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.986
Model: OLS Adj. R-squared: 0.985
Method: Least Squares F-statistic: 1089.
Date: Sat, 05 Feb 2022 Prob (F-statistic): 1.03e-42
Time: 17:46:33 Log-Likelihood: 5.4949
No. Observations: 50 AIC: -2.990
Df Residuals: 46 BIC: 4.658
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.0174 0.077 65.130 0.000 4.862 5.172
x1 0.4901 0.012 41.248 0.000 0.466 0.514
x2 0.4574 0.047 9.793 0.000 0.363 0.551
x3 -0.0192 0.001 -18.371 0.000 -0.021 -0.017
==============================================================================
Omnibus: 3.229 Durbin-Watson: 2.204
Prob(Omnibus): 0.199 Jarque-Bera (JB): 2.302
Skew: 0.313 Prob(JB): 0.316
Kurtosis: 3.845 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.53825052 4.99484664 5.41523057 5.77447594 6.05665219 6.25744189
6.38485011 6.45788915 6.50345495 6.55190811 6.63208574 6.76656363
6.96794708 7.23679988 7.56155172 7.92039913 8.28488773 8.62458721
8.91209095 9.12751883 9.2617835 9.3180834 9.31137724 9.26592637
9.21130868 9.17755905 9.19023467 9.26621612 9.41093391 9.61747625
9.86772466 10.13533078 10.39004533 10.60268796 10.74994153 10.81818305
10.80572187 10.72307747 10.59125113 10.4382755 10.29460888 10.18813103
10.13956262 10.15906084 10.24455063 10.38206561 10.5480427 10.71319349
10.84731622 10.92425784]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.91540335 10.78534125 10.55200792 10.25627814 9.95195747 9.69260882
9.51843834 9.44645158 9.46628993 9.54276676]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7f82a592eee0>
Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1": x1, "y": y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 5.017353
x1 0.490064
np.sin(x1) 0.457372
I((x1 - 5) ** 2) -0.019164
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 10.915403
1 10.785341
2 10.552008
3 10.256278
4 9.951957
5 9.692609
6 9.518438
7 9.446452
8 9.466290
9 9.542767
dtype: float64