Manual variable selection using the dropterm function

May 12th, 2010

When fitting a multiple linear regression model to data a natural question is whether a model can be simplified by excluding variables from the model. There are automatic procedures for undertaking these tests but some people prefer to follow a more manual approach to variable selection rather than pressing a button and taking what comes out.

Fast Tube by Casper

When there are a large number of variables it is awkward to manually go through each one in turn to make a decision about simplification to a more parsimonious model. In R there is a function dropterm that removes some of this task by assuming that we are interested in considering the outcome of dropping each model term one at a time.

To illustrate this consider the cpus data set in the MASS package which contains information about a relative performance measure and characteristics of 209 CPUs. We load the package first to make the data available:

library(MASS)

We first fit a linear model with six explanatory variables:

cpu.mod1 = lm(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus)

The function dropterm requires a fitted model, which we saved in the last command, and optionally we could specify what test to use to compare the initial model and each of the possible alternative models with one less variable. We can choose to perform an F test:

> dropterm(cpu.mod1, test = "F")
Single term deletions
 
Model:
perf ~ syct + mmin + mmax + cach + chmin + chmax
       Df Sum of Sq    RSS    AIC F Value     Pr(F)    
<none>              727002 1718.3                      
syct    1     27995 754997 1724.2   7.779  0.005793 ** 
mmin    1    252211 979213 1778.5  70.078 9.416e-15 ***
mmax    1    271147 998149 1782.5  75.339 1.326e-15 ***
cach    1     75962 802964 1737.0  21.106 7.640e-06 ***
chmin   1       358 727360 1716.4   0.100  0.752632    
chmax   1    163396 890398 1758.6  45.400 1.640e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The output from the function call indicates that we could excude the chmin variable then re-fit the model and continue again with the same checking process.

Update:

The dropterm function considers each variable individually and considers what the change in residual sum of squares would be if this variable was excluded from the model. There is a link between this F test and the t test that appears as part of the model summary – this is because of the link between these two distributions. For this model we would have:

> summary(cpu.mod1)
 
Call:
lm(formula = perf ~ syct + mmin + mmax + cach + chmin + chmax, 
    data = cpus)
 
Residuals:
     Min       1Q   Median       3Q      Max 
-195.841  -25.169    5.409   26.528  385.749 
 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.590e+01  8.045e+00  -6.948 4.99e-11 ***
syct         4.886e-02  1.752e-02   2.789  0.00579 ** 
mmin         1.529e-02  1.827e-03   8.371 9.42e-15 ***
mmax         5.571e-03  6.418e-04   8.680 1.33e-15 ***
cach         6.412e-01  1.396e-01   4.594 7.64e-06 ***
chmin       -2.701e-01  8.557e-01  -0.316  0.75263    
chmax        1.483e+00  2.201e-01   6.738 1.64e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
 
Residual standard error: 59.99 on 202 degrees of freedom
Multiple R-squared: 0.8649,     Adjusted R-squared: 0.8609 
F-statistic: 215.5 on 6 and 202 DF,  p-value: < 2.2e-16

Let us consider the syct variable. The t statistic in the model summary is 2.789 and if we square this value we get 7.779 which is the F statistic produced by the dropterm function.

Related posts:

The update function for simplifying model selection.
Data analysis using simple linear regression models.
Including factors in a regression model via analysis of covariance.

Posted by Ralph at 9:14 pm 10 Comments »

10 responses to “Manual variable selection using the dropterm function”

Liviu says:

May 26, 2010 at 10:03 am

Hello
And thank you for this post. Could you give some details on the dropterm output? I suspect that this is not quite the same as teh coefficients from summary(), and I have trouble understanding what the cols/lines actually represent.
Ralph says:

May 26, 2010 at 8:04 pm

Hi,

I’ve updated the post to hopefully address your comments. In the dropterm we are looking at a goodness of fit between two models and the model summary is testing the signficance (or not) of a particular variable in the model. Unsurprising this amounts to answering the same question.

Hope this helps.
Gary says:

September 2, 2010 at 10:24 am

Hi,

Could you give some details on how i could go about using dropterm in a glm that uses many iterations? I have a huge dataset and the only way to identify non-significant terms in a glm is splitting the data into 10% random samples and then repeating the process. I am using a for loop for the iterations and would like to reduce the model using the dropterm function but i am not sure how i can use the outputs other than creating 10000 separate outputs which is infeasible to analyse. Any tips would be brilliant.
Ralph says:

September 3, 2010 at 6:28 pm

A quick thought – there is a biglm package in R that might do the trick. By this I mean use all of the data at the same time rather than having to fit separate models to each subset.

Could you describe whether the huge nature of the data is in the number of rows or columns, possibly both? How many subsets are you having to divide the data into?
Martí says:

February 9, 2012 at 12:38 pm

Is the function “drop term” possible with linear mixed model too? I fitted my model with the nlme library , and then I don’t know how can I obtain my best model. One option is to following the prupose of Pinheiro paper (Model building using covariates in nonlinear mixed-effects models) using a forward stepwise approach.

what is your opinion?

Thanks in advance.

Martí Casals.
Ralph says:

July 22, 2012 at 10:09 am

I haven’t used the drop term function with any models other than those produced by lm so can’t comment on whether they work correctly. My impression is that mixed effects models requires more user intervention as the theory isn’t as straightforward as for standard linear regression.
Martí says:

July 22, 2012 at 6:33 pm

Hi Ralph,
Thank you so much!Finally I used the prupose of Pinheiro paper (Model building using covariates in nonlinear mixed-effects models) using a forward stepwise approach.

Martí
Ralph says:

July 22, 2012 at 8:24 pm

Good to hear that you managed to do what you wanted with nlme Marti!

Did you use addterm/dropterm or go for a manual approach of considering each model change individually?
Martí says:

July 22, 2012 at 10:57 pm

Finally, I followed the strategy proposed by Pinheiro and Bates (2000) to select the final model. I used plots of the estimated random effects versus the candidate covariates to identify interesting patterns. A pattern in each step would indicate that the covariate will be included in the model. However, I also used the dropterm/addterm function and I remembered that the result was similar.

Martí
tvmanikandan says:

June 20, 2014 at 7:22 am

Please guide me on variable selection and variable reduction techniques in R.