Working with Subsets of Data

May 8th, 2009

There are often situations where we might be interested in a subset of our complete data and there are simple mechanisms for viewing and editing particular subsets of a data frame or other objects in R.

We might be interested in using one of the variables to select a particular subset. A square bracket notation is used after the name of an object to indicate that we are interested in specific rows or columns of the data and there are a large number of options that could be used. For example, if we consider the olive oil data set used in ggobi demonstrations, we could view the data for one of the regions using the following code:

olive.df[olive.df$Area == "North-Apulia",]

which would give the following output:

   Region         Area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
1       1 North-Apulia     1075          75     226  7823      672        36        60         29
2       1 North-Apulia     1088          73     224  7709      781        31        61         29
3       1 North-Apulia      911          54     246  8113      549        31        63         29
...

Within the square brackets in this example we test a condition which runs a vector of TRUE and FALSE values and R interprets our intention as viewing only those rows where the condition returned TRUE. The comma is used to separate between rows and columns in this case as we have two dimensions in the data frame. All columns are included as there is no expression after the comma in the example above.

It is possible to work with multiple conditions so for the olive oil data we could select one of the other regions, South Apulia, and only data points where stearic variable is greater than 250 units. The could used in this case would be:

olive.df[olive.df$Area == "South-Apulia" & olive.df$stearic > 250,]

to give output:

    Region         Area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
88       1 South-Apulia     1410         232     280  6715     1233        32        60         24
89       1 South-Apulia     1509         209     257  6647     1240        42        62         30
90       1 South-Apulia     1317         197     256  7036     1067        40        60         22
...

If we were interested in a particular column of data then we would specify the name of the column(s) after the comma in the square brackets. For example we could view the palmitic column only with this code:

olive.df[,"palmitic"]

Multiple columns could be selected by providing a vector of column names, such as:

olive.df[,c("palmitic","oleic")]

More complicated expressions are possible with a bit of imagination. For example if we wanted to view the even numbered rows only then we could use the condition seq(2,10,2) which would provide the numbers 2, 4, 6, 8 and 10.

Comments are closed.