Useful functions for data frames in R

February 17th, 2012

This post will consider some useful functions for dealing with data frames during data processing and validation.

Consider an artifical data set create using the expand.grid function where there are duplicate rows in the data frame.

> des = expand.grid(A = c(2,2,3,4), B = c(1,3,5,5,7))
> des
   A B
1  2 1
2  2 1
3  3 1
4  4 1
5  2 3
6  2 3
7  3 3
8  4 3
9  2 5
10 2 5
11 3 5
12 4 5
13 2 5
14 2 5
15 3 5
16 4 5
17 2 7
18 2 7
19 3 7
20 4 7

If we want to identify rows that are duplicates then the duplicated function comes in handy:

> duplicated(des)
 [1] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
 FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE

We can pick out the unique rows of the data frame with the following code:

> des[!duplicated(des),]
   A B
1  2 1
3  3 1
4  4 1
5  2 3
7  3 3
8  4 3
9  2 5
11 3 5
12 4 5
17 2 7
19 3 7
20 4 7

After loading a large file into a data frame we might be interested in checking some of the data to ensure that it is as expected. Rather than printing out the entirity of the data frame we can use the head and tail functions to view the top or bottom few rows of the data frame. An example using the rock data set that is available within R:

> head(rock)
  area    peri     shape perm
1 4990 2791.90 0.0903296  6.3
2 7002 3892.60 0.1486220  6.3
3 7558 3930.66 0.1833120  6.3
4 7352 3869.32 0.1170630  6.3
5 7943 3948.54 0.1224170 17.1
6 7979 4010.15 0.1670450 17.1
> tail(rock)
   area     peri    shape perm
43 5605 1145.690 0.464125 1300
44 8793 2280.490 0.420477 1300
45 3475 1174.110 0.200744  580
46 1651  597.808 0.262651  580
47 5514 1455.880 0.182453  580
48 9718 1485.580 0.200447  580

Leave a Reply