The first stage of most investigations is to produce summaries of the data to identify any unusual records and to get a overall feel for the contents of the data. This initial data analysis usually involves tabulation and plotting of data and there are a variety of functions available in R to generate the required summaries of interest.

Fast Tube by Casper

In this post we will consider the numerical and tabular summaries of data of various types, e.g. numeric, categorical etc. The function **mean** calculates the average value of a vector of numeric data. As an example we could run the following:

> mean(CO2$conc) [1] 435 |

The **$** indicates that we are interested in a specific column from the CO2 data frame. A trimmed mean can be calculated by specifying the percentage of data at either end (minimum and maximum) of the data range. For the previous example we could get a 10% trimmed mean with this code:

> mean(CO2$conc, trim=0.05) [1] 423.1579 |

If there was missing data then the function will return **NA** so to get around this we can instruct the mean function to ignore missing data by the **na.rm** argument

> mean(CO2$conc, na.rm = TRUE) |

Other functions of interest are **min**, **max** and **range** which compute the values that we would expect based on the names of these functions:

> min(CO2$conc) [1] 95 > max(CO2$conc) [1] 1000 > range(CO2$conc) [1] 95 1000 |

The **range** function returns a vector with two elements corresponding to the minimum and maximum values of the data respectively.

The variance and standard deviation are other useful statistics that would be calculated for a data variable – to get the standard deviation we use the square root function on the value returned by the variance function:

> var(CO2$conc) [1] 87571.08 > sqrt(var(CO2$conc)) [1] 295.9241 |