Producing Data Summaries

May 11th, 2009

The first stage of most investigations is to produce summaries of the data to identify any unusual records and to get a overall feel for the contents of the data. This initial data analysis usually involves tabulation and plotting of data and there are a variety of functions available in R to generate the required summaries of interest.


Fast Tube by Casper

In this post we will consider the numerical and tabular summaries of data of various types, e.g. numeric, categorical etc. The function mean calculates the average value of a vector of numeric data. As an example we could run the following:

> mean(CO2$conc)
[1] 435

The $ indicates that we are interested in a specific column from the CO2 data frame. A trimmed mean can be calculated by specifying the percentage of data at either end (minimum and maximum) of the data range. For the previous example we could get a 10% trimmed mean with this code:

> mean(CO2$conc, trim=0.05)
[1] 423.1579

If there was missing data then the function will return NA so to get around this we can instruct the mean function to ignore missing data by the na.rm argument

> mean(CO2$conc, na.rm = TRUE)

Other functions of interest are min, max and range which compute the values that we would expect based on the names of these functions:

> min(CO2$conc)
[1] 95
> max(CO2$conc)
[1] 1000
> range(CO2$conc)
[1]   95 1000

The range function returns a vector with two elements corresponding to the minimum and maximum values of the data respectively.

The variance and standard deviation are other useful statistics that would be calculated for a data variable – to get the standard deviation we use the square root function on the value returned by the variance function:

> var(CO2$conc)
[1] 87571.08
> sqrt(var(CO2$conc))
[1] 295.9241

Comments are closed.