Robust Linear Models¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
Estimation¶
Load data:
[3]:
data = sm.datasets.stackloss.load(as_pandas=False)
data.exog = sm.add_constant(data.exog)
Huber’s T norm with the (default) median absolute deviation scaling
[4]:
huber_t = sm.RLM(data.endog, data.exog, M=sm.robust.norms.HuberT())
hub_results = huber_t.fit()
print(hub_results.params)
print(hub_results.bse)
print(hub_results.summary(yname='y',
xname=['var_%d' % i for i in range(len(hub_results.params))]))
[-41.02649835 0.82938433 0.92606597 -0.12784672]
[9.79189854 0.11100521 0.30293016 0.12864961]
Robust linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 21
Model: RLM Df Residuals: 17
Method: IRLS Df Model: 3
Norm: HuberT
Scale Est.: mad
Cov Type: H1
Date: Mon, 24 Feb 2020
Time: 22:49:06
No. Iterations: 19
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
var_0 -41.0265 9.792 -4.190 0.000 -60.218 -21.835
var_1 0.8294 0.111 7.472 0.000 0.612 1.047
var_2 0.9261 0.303 3.057 0.002 0.332 1.520
var_3 -0.1278 0.129 -0.994 0.320 -0.380 0.124
==============================================================================
If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .
Huber’s T norm with ‘H2’ covariance matrix
[5]:
hub_results2 = huber_t.fit(cov="H2")
print(hub_results2.params)
print(hub_results2.bse)
[-41.02649835 0.82938433 0.92606597 -0.12784672]
[9.08950419 0.11945975 0.32235497 0.11796313]
Andrew’s Wave norm with Huber’s Proposal 2 scaling and ‘H3’ covariance matrix
[6]:
andrew_mod = sm.RLM(data.endog, data.exog, M=sm.robust.norms.AndrewWave())
andrew_results = andrew_mod.fit(scale_est=sm.robust.scale.HuberScale(), cov="H3")
print('Parameters: ', andrew_results.params)
Parameters: [-40.8817957 0.79276138 1.04857556 -0.13360865]
See help(sm.RLM.fit) for more options and module sm.robust.scale for scale options
Comparing OLS and RLM¶
Artificial data with outliers:
[7]:
nsample = 50
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, (x1-5)**2))
X = sm.add_constant(X)
sig = 0.3 # smaller error variance makes OLS<->RLM contrast bigger
beta = [5, 0.5, -0.0]
y_true2 = np.dot(X, beta)
y2 = y_true2 + sig*1. * np.random.normal(size=nsample)
y2[[39,41,43,45,48]] -= 5 # add some outliers (10% of nsample)
Example 1: quadratic function with linear truth¶
Note that the quadratic term in OLS regression will capture outlier effects.
[8]:
res = sm.OLS(y2, X).fit()
print(res.params)
print(res.bse)
print(res.predict())
[ 4.98937229 0.53607174 -0.01435181]
[0.4821492 0.07443733 0.00658655]
[ 4.63057699 4.90556963 5.17578033 5.44120909 5.7018559 5.95772077
6.20880369 6.45510467 6.6966237 6.93336079 7.16531593 7.39248913
7.61488038 7.83248969 8.04531705 8.25336247 8.45662594 8.65510747
8.84880705 9.03772469 9.22186038 9.40121413 9.57578593 9.74557579
9.91058371 10.07080968 10.2262537 10.37691578 10.52279591 10.6638941
10.80021034 10.93174464 11.058497 11.18046741 11.29765587 11.41006239
11.51768697 11.6205296 11.71859028 11.81186902 11.90036582 11.98408067
12.06301357 12.13716453 12.20653355 12.27112062 12.33092574 12.38594892
12.43619016 12.48164945]
Estimate RLM:
[9]:
resrlm = sm.RLM(y2, X).fit()
print(resrlm.params)
print(resrlm.bse)
[ 4.93135487e+00 5.15380256e-01 -2.30780247e-03]
[0.12270896 0.01894461 0.00167631]
Draw a plot to compare OLS estimates to the robust estimates:
[10]:
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111)
ax.plot(x1, y2, 'o',label="data")
ax.plot(x1, y_true2, 'b-', label="True")
prstd, iv_l, iv_u = wls_prediction_std(res)
ax.plot(x1, res.fittedvalues, 'r-', label="OLS")
ax.plot(x1, iv_u, 'r--')
ax.plot(x1, iv_l, 'r--')
ax.plot(x1, resrlm.fittedvalues, 'g.-', label="RLM")
ax.legend(loc="best")
[10]:
<matplotlib.legend.Legend at 0x7f36320ea510>
Example 2: linear function with linear truth¶
Fit a new OLS model using only the linear term and the constant:
[11]:
X2 = X[:,[0,1]]
res2 = sm.OLS(y2, X2).fit()
print(res2.params)
print(res2.bse)
[5.56783819 0.39255362]
[0.41789611 0.03600762]
Estimate RLM:
[12]:
resrlm2 = sm.RLM(y2, X2).fit()
print(resrlm2.params)
print(resrlm2.bse)
[5.00775204 0.49586454]
[0.10336252 0.00890613]
Draw a plot to compare OLS estimates to the robust estimates:
[13]:
prstd, iv_l, iv_u = wls_prediction_std(res2)
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(x1, y2, 'o', label="data")
ax.plot(x1, y_true2, 'b-', label="True")
ax.plot(x1, res2.fittedvalues, 'r-', label="OLS")
ax.plot(x1, iv_u, 'r--')
ax.plot(x1, iv_l, 'r--')
ax.plot(x1, resrlm2.fittedvalues, 'g.-', label="RLM")
legend = ax.legend(loc="best")