One and Two Sample Hypothesis Testing

June 26th, 2009

The t-test is regularly used in Classical Statistics to investigate one or two samples of data and to test a particular hypothesis. There are variants on the t-test that are all handled by the same function, t.test, in R.

The simplest case is where there is a set of data and we are interested in testing whether the mean value of the data is equal to a particular value. The three possible alternative hypotheses of not equal, greater than or less than are all available via this function by setting the alternative argument. Consider the rock dataset, which is a series of 48 measurements on rock samples taken from a petroleum reservoir, from the base R system. If we wanted to test whether the mean perimeter is 2,500 pixels then we would perform a one sample t-test:

t.test(rock$peri, mu = 2500)

This produces the following output to the console:

 
        One Sample t-test
 
data:  rock$peri 
t = 0.8818, df = 47, p-value = 0.3824
alternative hypothesis: true mean is not equal to 2500 
95 percent confidence interval:
 2266.501 3097.923 
sample estimates:
mean of x 
 2682.212

The t statistics can be seen in the output, 0.8818, and the p-value is not small so there is no evidence of a departure from 2500 pixels for the mean perimeter. By default a 95% confidence interval on the mean value of the data is also shown in the output.

If we have two samples of independent data, for example considering the commonly demonstrated olive oil fatty acid data, then a two sample t-test can be performed in R. The data can be divided by Area and we can test for significant differences in a given fatty acid between two Areas in the data set. To perform the test we would first check that the data is approximately Normally distributed, which is one of the assumptions underlying the test, and then consider whether the variances of the two groups of data are similar. The function var.test can be used to compare the variances:

> var(olive.df$oleic[olive.df$Area == "East-Liguria"])
[1] 24697.96
> var(olive.df$oleic[olive.df$Area == "North-Apulia"])
[1] 24568.5
> var.test(olive.df$oleic[olive.df$Area == "East-Liguria"], olive.df$oleic[olive.df$Area == "North-Apulia"])
 
        F test to compare two variances
 
data:  olive.df$oleic[olive.df$Area == "East-Liguria"] and olive.df$oleic[olive.df$Area == "North-Apulia"] 
F = 1.0053, num df = 49, denom df = 24, p-value = 0.9796
alternative hypothesis: true ratio of variances is not equal to 1 
95 percent confidence interval:
 0.4764333 1.9476684 
sample estimates:
ratio of variances 
          1.005269

First up we calculated the variances for the East Liguria and North Apulia Areas and then we run the formal test for equal variance. The confidence interval on the ratio includes the value one so in this case we proceed under the assumption that the variances are equal for these Areas. To run the test:

> t.test(olive.df$oleic[olive.df$Area == "East-Liguria"], olive.df$oleic[olive.df$Area == "North-Apulia"],
+ var.equal = TRUE)
 
        Two Sample t-test
 
data:  olive.df$oleic[olive.df$Area == "East-Liguria"] and olive.df$oleic[olive.df$Area == "North-Apulia"] 
t = -1.9344, df = 73, p-value = 0.05694
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -151.054614    2.254614 
sample estimates:
mean of x mean of y 
   7746.0    7820.4

The argument var.equal is used to specify whether the variances for the two samples of data are equal.

This function is flexible and can be used for paired data via the paired argument.

Comments are closed.