Using the update function during variable selection

May 9th, 2010

When fitting statistical models to data where there are multiple variables we are often interested in adding or removing terms from our model and in cases where there are a large number of terms it can be quicker to use the update function to start with a formula from a model that we have already fitted and to specify the terms that we want to add or remove as opposed to a copy and paste and manually editing the formula to our needs.

Consider the oil-bearing rocks data set that is available with the R software which is used extensively as an example by many authors. One model that can be used as a starting point is a linear model with additive terms for the three variables:

> rock.mod1 = lm(log(perm) ~ area + peri + shape, data = rock)
> summary(rock.mod1)
 
Call:
lm(formula = log(perm) ~ area + peri + shape, data = rock)
 
Residuals:
    Min      1Q  Median      3Q     Max 
-1.8092 -0.5413  0.1735  0.6493  1.4788 
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.333e+00  5.487e-01   9.720 1.59e-12 ***
area         4.850e-04  8.657e-05   5.602 1.29e-06 ***
peri        -1.527e-03  1.770e-04  -8.623 5.24e-11 ***
shape        1.757e+00  1.756e+00   1.000    0.323    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
 
Residual standard error: 0.8521 on 44 degrees of freedom
Multiple R-squared: 0.7483,     Adjusted R-squared: 0.7311 
F-statistic:  43.6 on 3 and 44 DF,  p-value: 3.094e-13

Given this model, saved as an object rock.mod1, we might be interested in considering adding an interaction term between the area and perimeter measurements. The update function has various options and the simplest case is to specfiy a model object and a new formula. The new formula can use the period as short hand for keep everything on either the left or right hand side of the formula and the plus or minus sign used to add or remove terms to the model. In the case of adding an interaction term our call would be:

> rock.mod2 = update(rock.mod1, . ~ . + area:peri)

The first function argument is the name of the model we fitted previously and the periods indicate that we want to use the same response variable and to start with the whole formula but add an interaction term between area and perimeter – the colon is used to specify an interaction term by itself. This fitted model is now:

> summary(rock.mod2)
 
Call:
lm(formula = log(perm) ~ area + peri + shape + area:peri, data = rock)
 
Residuals:
    Min      1Q  Median      3Q     Max 
-1.7255 -0.4760  0.1256  0.6539  1.4269 
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.567e+00  8.533e-01   7.696 1.28e-09 ***
area         3.769e-04  1.025e-04   3.678  0.00065 ***
peri        -2.141e-03  3.734e-04  -5.733 8.94e-07 ***
shape        4.022e-01  1.859e+00   0.216  0.82974    
area:peri    6.641e-08  3.583e-08   1.854  0.07065 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
 
Residual standard error: 0.8295 on 43 degrees of freedom
Multiple R-squared: 0.7669,     Adjusted R-squared: 0.7452 
F-statistic: 35.37 on 4 and 43 DF,  p-value: 4.404e-13

The update function can also be used to change other aspects of the linear model or in fact many other types of model are set up to repsond sensibly to using this function.

Related posts: