Classification Trees using the rpart function

September 21st, 2010

In a previous post on classification trees we considered using the tree package to fit a classification tree to data divided into known classes. In this post we will look at the alternative function rpart that is available within the base R distribution.


Fast Tube by Casper

A classification tree can be fitted using the rpart function using a similar syntax to the tree function. For the ecoli data set discussed in the previous post we would use:

> require(rpart)
> ecoli.df = read.csv("ecoli.txt")

followed by

> ecoli.rpart1 = rpart(class ~ mcv + gvh + lip + chg + aac + alm1 + alm2, 
  data = ecoli.df)

We would then consider whether the tree could be simplified by pruning and make use of the plotcp function:

> plotcp(ecoli.rpart1)

Once the amount of pruning has been determined from this graph or by looking at the output from the printcp function:

> printcp(ecoli.rpart1)
 
Classification tree:
rpart(formula = class ~ mcv + gvh + lip + chg + aac + alm1 + 
    alm2, data = ecoli.df)
 
Variables actually used in tree construction:
[1] aac  alm1 gvh  mcv 
 
Root node error: 193/336 = 0.5744
 
n= 336 
 
        CP nsplit rel error  xerror     xstd
1 0.388601      0   1.00000 1.00000 0.046959
2 0.207254      1   0.61140 0.61658 0.045423
3 0.062176      2   0.40415 0.45596 0.041758
4 0.051813      3   0.34197 0.38342 0.039359
5 0.031088      4   0.29016 0.36269 0.038571
6 0.015544      5   0.25907 0.30570 0.036136
7 0.010000      6   0.24352 0.31088 0.036375

The prune function is used to simplify the tree based on a cp identified from the graph or printed output threshold.

> ecoli.rpart2 = prune(ecoli.rpart1, cp = 0.02)

The classification tree can be visualised with the plot function and then the text function adds labels to the graph:

> plot(ecoli.rpart2, uniform = TRUE)
> text(ecoli.rpart2, use.n = TRUE, cex = 0.75)

Other useful resources are provided on the Supplementary Material page.

5 responses to “Classification Trees using the rpart function”

  1. YZK_R says:

    Hi,
    I am trying to teach myself R. I like your explanatory video. Thank you very much for posting it.
    Unfortunately, I have run into a problem. I wonder what I am doing wrong.

    I have R running on a PC. (R version 2.11.1 (2010-05-31))

    I got the following when I tried requite(tree):
    > require(tree)
    Loading required package: tree
    Warning message:
    In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
    there is no package called ‘tree’
    >
    > ?tree
    No documentation for ‘tree’ in specified packages and libraries:
    you could try ‘??tree’
    >

    Cheers,
    YZK

  2. Ralph says:

    Based on the message I think that you might need to install the tree package first. If your computer is connected to the internet then you can use the Packages menu from within the R GUI to install additional packages.

  3. YZK_R says:

    Here is a follow-up to my previous post.

    I would also like to understand:

    1) What exactly is cp parameter? (how is it defined? what does it measure? what is the “intuitive way to explain its meaning”?)

    2) What is the difference between rpart and tree functions? How would one decide whether to use one or the other?

    3) How is a tree being fit in rpart and in tree? In other words, how does “the algorithm decides” how to divide the space of predictors to form the splits and how does it decide when to stop?

    Thank you very much once again!

    YZK_R

  4. Carol says:

    Where can I get the sample data for this tutorial, “ecoli.txt”? I want to go through the code but I don’t have my own data.

  5. Ralph says:

    Carol,

    Hopefully this link will help: http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/09/ecoli.txt to get hold of the data.

    Ralph