Linear regression models with robust parameter estimation

May 15th, 2010

There are situations in regression modelling where robust methods could be considered to handle unusual observations that do not follow the general trend of the data set. There are various packages in R that provide robust statistical methods which are summarised on the CRAN Robust Task View.

As an example of using robust statistical estimation in a linear regression framework consider the CPUs data that was used in previous posts on linear regression and variable selection. For this data we could fit a model with six variables using least squares and also with a fast MM-estimator from the robustbase package.

First step is to make the functions and data available for analysis:

require(MASS)
require(robustbase)

The linear model using least squares is fitted as follows:

> cpu.mod1 = lm(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus)

The summary for this model:

> summary(cpu.mod1)
 
Call:
lm(formula = perf ~ syct + mmin + mmax + cach + chmin + chmax, 
    data = cpus)
 
Residuals:
     Min       1Q   Median       3Q      Max 
-195.841  -25.169    5.409   26.528  385.749 
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.590e+01  8.045e+00  -6.948 4.99e-11 ***
syct         4.886e-02  1.752e-02   2.789  0.00579 ** 
mmin         1.529e-02  1.827e-03   8.371 9.42e-15 ***
mmax         5.571e-03  6.418e-04   8.680 1.33e-15 ***
cach         6.412e-01  1.396e-01   4.594 7.64e-06 ***
chmin       -2.701e-01  8.557e-01  -0.316  0.75263    
chmax        1.483e+00  2.201e-01   6.738 1.64e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
 
Residual standard error: 59.99 on 202 degrees of freedom
Multiple R-squared: 0.8649,     Adjusted R-squared: 0.8609 
F-statistic: 215.5 on 6 and 202 DF,  p-value: < 2.2e-16

The linear model using an MM-estimator is fitted as follows:

cpu.robmod1 = lmrob(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, control = lmrob.control(max.it = 100))

Note that we need to increase the default number of iterations (50) to allow the routine to converge to a solution. The summary for this model:

> summary(cpu.robmod1)
 
Call:
lmrob(formula = perf ~ syct + mmin + mmax + cach + chmin + chmax, 
    data = cpus, control = lmrob.control(max.it = 100))
 
Weighted Residuals:
      Min        1Q    Median        3Q       Max 
-144.0045   -9.4554    0.7691   13.3757  759.6953 
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.6634297  6.1697645  -0.594 0.553329    
syct         0.0063112  0.0043877   1.438 0.151868    
mmin         0.0098798  0.0034726   2.845 0.004897 ** 
mmax         0.0024463  0.0004525   5.407 1.80e-07 ***
cach         0.8702102  0.2551245   3.411 0.000782 ***
chmin        2.4078436  1.3319413   1.808 0.072130 .  
chmax        0.1016861  0.1494902   0.680 0.497145    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
 
Robust residual standard error: 19.27 
Convergence in 65 IRWLS iterations
 
Robustness weights: 
 15 observations c(1,8,9,10,31,32,96,97,98,153,154,156,169,199,200) are outliers with |weight| = 0 ( < 0.00048); 
 14 weights are ~= 1. The remaining 180 ones are summarized as
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.002472 0.831300 0.963700 0.849800 0.989500 0.998900 
Algorithmic parameters: 
tuning.chi         bb tuning.psi refine.tol    rel.tol 
 1.5476400  0.5000000  4.6850610  0.0000001  0.0000001 
 nResample     max.it     groups    n.group   best.r.s   k.fast.s      k.max  trace.lev compute.rd 
       500        100          5        400          2          1        200          0          0 
seed : int(0)

The two models differ in the variables that are considered important and the output from the lmrob function provides a summary of the weights that have been allocated to the data. A total of fifteen of the data points have been allocated very small weights by the fitting algorithm.

Related posts: