Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import statsmodels.api as sm
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.988
Model: OLS Adj. R-squared: 0.987
Method: Least Squares F-statistic: 1287.
Date: Mon, 24 Feb 2020 Prob (F-statistic): 2.32e-44
Time: 22:49:06 Log-Likelihood: 10.116
No. Observations: 50 AIC: -12.23
Df Residuals: 46 BIC: -4.584
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.0074 0.070 71.295 0.000 4.866 5.149
x1 0.4960 0.011 45.791 0.000 0.474 0.518
x2 0.5148 0.043 12.089 0.000 0.429 0.600
x3 -0.0202 0.001 -21.285 0.000 -0.022 -0.018
==============================================================================
Omnibus: 0.999 Durbin-Watson: 2.043
Prob(Omnibus): 0.607 Jarque-Bera (JB): 0.631
Skew: 0.274 Prob(JB): 0.730
Kurtosis: 3.041 Cond. No. 221.
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.50134616 4.98737075 5.43308135 5.81042435 6.10147053 6.30136081
6.41910454 6.47609935 6.50261552 6.53282271 6.59917592 6.72708331
6.93073163 7.21075541 7.55413274 7.93632467 8.32530693 8.68683143
8.99005311 9.21259769 9.34423772 9.38857291 9.36243871 9.29314028
9.21396636 9.15872 9.15616482 9.22529923 9.37223503 9.58919331
9.85578243 10.14234835 10.41484672 10.64043624 10.79287438 10.85682901
10.83039738 10.72541873 10.56552919 10.38227835 10.20994584 10.07990921
10.01548874 10.02811594 10.11545543 10.26178859 10.44059613 10.6189151
10.76275462 10.84268049]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.82514732 10.6718622 10.40301182 10.06459909 9.71718005 9.42103759
9.22142214 9.13747276 9.15753126 9.2419965 ]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");
Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1" : x1, "y" : y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 5.007438
x1 0.496007
np.sin(x1) 0.514754
I((x1 - 5) ** 2) -0.020244
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 10.825147
1 10.671862
2 10.403012
3 10.064599
4 9.717180
5 9.421038
6 9.221422
7 9.137473
8 9.157531
9 9.241996
dtype: float64