<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Software for Exploratory Data Analysis and Statistical Modelling &#187; Analysis of Variance</title>
	<atom:link href="http://www.wekaleamstudios.co.uk/topics/statistical-analysis/analysis-of-variance-statistical-analysis/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.wekaleamstudios.co.uk</link>
	<description>Statistical Modelling with R</description>
	<lastBuildDate>Wed, 01 Feb 2012 19:44:22 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Analysis of Covariance &#8211; Extending Simple Linear Regression</title>
		<link>http://www.wekaleamstudios.co.uk/posts/analysis-of-covariance-extending-simple-linear-regression/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/analysis-of-covariance-extending-simple-linear-regression/#comments</comments>
		<pubDate>Wed, 28 Apr 2010 19:25:58 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Analysis of Variance]]></category>
		<category><![CDATA[Lattice Graphics]]></category>
		<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[Trellis Graphics]]></category>
		<category><![CDATA[analysis of variance]]></category>
		<category><![CDATA[ANOVA]]></category>
		<category><![CDATA[covariate]]></category>
		<category><![CDATA[fitted]]></category>
		<category><![CDATA[lattice]]></category>
		<category><![CDATA[lm]]></category>
		<category><![CDATA[panel]]></category>
		<category><![CDATA[panel.lmline]]></category>
		<category><![CDATA[panel.xyplot]]></category>
		<category><![CDATA[regression]]></category>
		<category><![CDATA[resid]]></category>
		<category><![CDATA[residual]]></category>
		<category><![CDATA[summary]]></category>
		<category><![CDATA[trellis]]></category>
		<category><![CDATA[xyplot]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=989</guid>
		<description><![CDATA[The simple linear regression model considers the relationship between two variables and in many cases more information will be available that can be used to extend the model. For example, there might be a categorical variable (sometimes known as a covariate) that can be used to divide the data set to fit a separate linear [...]]]></description>
			<content:encoded><![CDATA[<p>The simple linear regression model considers the relationship between two variables and in many cases more information will be available that can be used to extend the model. For example, there might be a categorical variable (sometimes known as a covariate) that can be used to divide the data set to fit a separate linear regression to each of the subsets. We will consider how to handle this extension using one of the data sets available within the <strong>R</strong> software package.<span id="more-989"></span></p>
<p>There is a set of data relating trunk circumference (in mm) to the age of Orange trees where data was recorded for five trees. This data is available in the data frame <strong>Orange</strong> and we make a copy of this data set so that we can remove the ordering that is recorded for the <strong>Tree</strong> identifier variable. We create a new factor after converting the old factor to a numeric string:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">orange.df = Orange
orange.df$Tree = factor(as.numeric(orange.df$Tree))</pre></div></div>

<p>The purpose of this step is to set up the variable for use in the linear model. The simplest model assumes that the relationship between circumference and age is the same for all five trees and we fit this model as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">orange.mod1 = lm(circumference ~ age, data = orange.df)</pre></div></div>

<p>The summary of the fitted model is shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; summary(orange.mod1)
&nbsp;
Call:
lm(formula = circumference ~ age, data = orange.df)
&nbsp;
Residuals:
      Min        1Q    Median        3Q       Max 
-46.31030 -14.94610  -0.07649  19.69727  45.11146 
&nbsp;
Coefficients:
             Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) 17.399650   8.622660   2.018   0.0518 .  
age          0.106770   0.008277  12.900 1.93e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
Residual standard error: 23.74 on 33 degrees of freedom
Multiple R-squared: 0.8345,     Adjusted R-squared: 0.8295 
F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14</pre></div></div>

<p>The test on the <strong>age</strong> parameter provides very strong evidence of an increase in circumference with age, as would be expected. The next stage is to consider how this model can be extended &#8211; one idea is to have a separate intercept for each of the five trees. This new model assumes that the increase in circumference is consistent between the trees but that the growth starts at different rates. We fit this model and get the summary as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; orange.mod2 = lm(circumference ~ age + Tree, data = orange.df)
&gt; summary(orange.mod2)
&nbsp;
Call:
lm(formula = circumference ~ age + Tree, data = orange.df)
&nbsp;
Residuals:
    Min      1Q  Median      3Q     Max 
-30.505  -8.790   3.738   7.650  21.859 
&nbsp;
Coefficients:
             Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) -4.457493   7.572732  -0.589   0.5607    
age          0.106770   0.005321  20.066  &lt; 2e-16 ***
Tree2        5.571429   8.157252   0.683   0.5000    
Tree3       17.142857   8.157252   2.102   0.0444 *  
Tree4       41.285714   8.157252   5.061 2.14e-05 ***
Tree5       45.285714   8.157252   5.552 5.48e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
Residual standard error: 15.26 on 29 degrees of freedom
Multiple R-squared: 0.9399,     Adjusted R-squared: 0.9295 
F-statistic:  90.7 on 5 and 29 DF,  p-value: &lt; 2.2e-16</pre></div></div>

<p>The additional term is appended to the simple model using the <strong>+</strong> in the formula part of the call to <strong>lm</strong>. The first tree is used as the baseline to compare the other four trees against and the model summary shows that tree 2 is similar to tree 1 (no real need for a different offset) but that there is evidence that the offset for the other three trees is significantly larger than tree 1 (and tree 2). We can compare the two models using an F-test for nested models using the <strong>anova</strong> function:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; anova(orange.mod1, orange.mod2)
Analysis of Variance Table
&nbsp;
Model 1: circumference ~ age
Model 2: circumference ~ age + Tree
  Res.Df     RSS Df Sum of Sq      F    Pr(&gt;F)    
1     33 18594.7                                  
2     29  6753.9  4     11841 12.711 4.289e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</pre></div></div>

<p>Here there are four degrees of freedom used up by the more complicated model (four parameters for the different trees) and the test comparing the two models is highly significant. There is very strong evidence of a difference in starting circumference (for the data that was collected) between the trees.</p>
<p>We can extended this model further by allowing the rate of increase in circumference to vary between the five trees. This additional term can be included in the linear model as an interaction term, assuming that tree 1 is the baseline. An interaction term is included in the model formula with a <strong>:</strong> between the name of two variables. For the Orange tree data the new model is fitted thus:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; orange.mod3 = lm(circumference ~ age + Tree + age:Tree, data = orange.df)
&gt; summary(orange.mod3)
&nbsp;
Call:
lm(formula = circumference ~ age + Tree + age:Tree, data = orange.df)
&nbsp;
Residuals:
    Min      1Q  Median      3Q     Max 
-18.061  -6.639  -1.482   8.069  16.649 
&nbsp;
Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)  1.920e+01  8.458e+00   2.270  0.03206 *  
age          8.111e-02  8.119e-03   9.991 3.27e-10 ***
Tree2        5.234e+00  1.196e+01   0.438  0.66544    
Tree3       -1.045e+01  1.196e+01  -0.873  0.39086    
Tree4        7.574e-01  1.196e+01   0.063  0.95002    
Tree5       -4.566e+00  1.196e+01  -0.382  0.70590    
age:Tree2    3.656e-04  1.148e-02   0.032  0.97485    
age:Tree3    2.992e-02  1.148e-02   2.606  0.01523 *  
age:Tree4    4.395e-02  1.148e-02   3.828  0.00077 ***
age:Tree5    5.406e-02  1.148e-02   4.708 7.93e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
Residual standard error: 10.41 on 25 degrees of freedom
Multiple R-squared: 0.9759,     Adjusted R-squared: 0.9672 
F-statistic: 112.4 on 9 and 25 DF,  p-value: &lt; 2.2e-16</pre></div></div>

<p>Interesting we see that there is strong evidence of a difference in the rate of change in circumference for the five trees. The previously observed difference in intercepts is now longer as strong but this parameter is kept in the model &#8211; there are plenty of books/websites that discuss this marginality restrictin on statistical models. The fitted model described above can be created using <strong>lattice</strong> graphics with a custom panel function making use of available panel functions for fitting and drawing a linear regression line for each panel of a Trellis display. The function call is shown below:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">xyplot(circumference ~ age | Tree, data = orange.df,
  panel = function(x, y, ...)
  {
    panel.xyplot(x, y, ...)
    panel.lmline(x, y, ...)
  }
)</pre></div></div>

<p>The <strong>panel.xyplot</strong> and <strong>panel.lmline</strong> functions are part of the lattice package along with many other panel functions and can be built up to create a display that differs from the standard. The graph that is produced:</p>
<div id="attachment_992" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/orange-fittedmodel.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/orange-fittedmodel-300x300.png" alt="Orange Tree Fitted Model" title="Orange Tree Fitted Model" width="300" height="300" class="size-medium wp-image-992" /></a><p class="wp-caption-text">Analysis of Covariance Model fitted to the Orange Tree data</p></div>
<p>This graph clearly shows the different relationships between circumference and age for the five trees. The residuals from the model can be plotted against fitted values, divided by tree, to investigate the model assumptions:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">xyplot(resid(orange.mod3) ~ fitted(orange.mod3) | orange.df$Tree,
  xlab = &quot;Fitted Values&quot;,
  ylab = &quot;Residuals&quot;,
  main = &quot;Residual Diagnostic Plot&quot;,
  panel = function(x, y, ...)
  {
    panel.grid(h = -1, v = -1)
    panel.abline(h = 0)
    panel.xyplot(x, y, ...)
  }
)</pre></div></div>

<p>The residual diagnostic plot is:</p>
<div id="attachment_994" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/orange-residualplot.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/orange-residualplot-300x300.png" alt="Orange Tree Model Residual Plot" title="Orange Tree Model Residual Plot" width="300" height="300" class="size-medium wp-image-994" /></a><p class="wp-caption-text">Residual diagnostic plot for the analysis of covariance model fitted to the Orange Tree data</p></div>
<p>There are no obvious problematic patterns in this graph so we conclude that this model is a reasonable representation of the relationship between circumference and age.</p>
<p>Additional: The analysis of variance table comparing the second and third models shows an improvement by moving to the more complicated model with different slopes:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; anova(orange.mod2, orange.mod3)
Analysis of Variance Table
&nbsp;
Model 1: circumference ~ age + Tree
Model 2: circumference ~ age + Tree + age:Tree
  Res.Df    RSS Df Sum of Sq      F    Pr(&gt;F)    
1     29 6753.9                                  
2     25 2711.0  4    4042.9 9.3206 9.402e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/analysis-of-covariance-extending-simple-linear-regression/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Two-way Analysis of Variance (ANOVA)</title>
		<link>http://www.wekaleamstudios.co.uk/posts/two-way-analysis-of-variance-anova/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/two-way-analysis-of-variance-anova/#comments</comments>
		<pubDate>Mon, 15 Feb 2010 21:45:02 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Analysis of Variance]]></category>
		<category><![CDATA[Design of Experiments]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[analysis of variance]]></category>
		<category><![CDATA[ANOVA]]></category>
		<category><![CDATA[aov]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[summary]]></category>
		<category><![CDATA[Tukey HSD]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[two way]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=660</guid>
		<description><![CDATA[The analysis of variance (ANOVA) model can be extended from making a comparison between multiple groups to take into account additional factors in an experiment. The simplest extension is from one-way to two-way ANOVA where a second factor is included in the model as well as a potential interaction between the two factors. As an [...]]]></description>
			<content:encoded><![CDATA[<p>The analysis of variance (<strong>ANOVA</strong>) model can be extended from making a comparison between multiple groups to take into account additional factors in an experiment. The simplest extension is from one-way to two-way <strong>ANOVA</strong> where a second factor is included in the model as well as a potential interaction between the two factors.<span id="more-660"></span></p>
<p>As an example consider a company that regularly has to ship parcels between its various (five for this example) sub-offices and has the option of using three competing parcel delivery services, all of which charge roughly similar amounts for each delivery. To determine which service to use, the company decides to run an experiment shipping three packages from its head office to each of the five sub-offices. The delivery time for each package is recorded and the data loaded into <strong>R</strong>:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">delivery.df = data.frame(
  Service = c(rep(&quot;Carrier 1&quot;, 15), rep(&quot;Carrier 2&quot;, 15),
    rep(&quot;Carrier 3&quot;, 15)),
  Destination = c(rep(c(&quot;Office 1&quot;, &quot;Office 2&quot;, &quot;Office 3&quot;,
    &quot;Office 4&quot;, &quot;Office 5&quot;), 9)),
  Time = c(15.23, 14.32, 14.77, 15.12, 14.05,
  15.48, 14.13, 14.46, 15.62, 14.23, 15.19, 14.67, 14.48, 15.34, 14.22,
  16.66, 16.27, 16.35, 16.93, 15.05, 16.98, 16.43, 15.95, 16.73, 15.62,
  16.53, 16.26, 15.69, 16.97, 15.37, 17.12, 16.65, 15.73, 17.77, 15.52,
  16.15, 16.86, 15.18, 17.96, 15.26, 16.36, 16.44, 14.82, 17.62, 15.04)
)</pre></div></div>

<p>The data is then displayed using a dot plot for an initial visual investigation of any trends in delivery time between the three services and across the five sub-offices. The colour aesthetic is used to distinguish between the three services in the plot.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.df, aes(Time, Destination, colour = Service)) + geom_point()</pre></div></div>

<p>This code produces the following graph:</p>
<div id="attachment_792" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-data.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-data-300x300.png" alt="Service Delivery Time by Destination" title="Delivery Time" width="300" height="300" class="size-medium wp-image-792" /></a><p class="wp-caption-text">Graph of the delivery time for different services and destintions</p></div>
<p>The graph shows a general pattern of service carrier 1 having shorter delivery times than the other two services. There is also an indication that the differences between the services varies for the five sub-offices and we might expect the interaction term to be significant in the two-way <strong>ANOVA</strong> model. To fit the two-way <strong>ANOVA</strong> model we use this code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">delivery.mod1 = aov(Time ~ Destination*Service, data = delivery.df)</pre></div></div>

<p>The <strong>*</strong> symbol instructs <strong>R</strong> to create a formula that includes main effects for both Destination and Service as well as the two-way interaction between these two factors. We save the fitted model to an object which we can summarise as follows to test for importance of the various model terms:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; summary(delivery.mod1)
                    Df  Sum Sq Mean Sq  F value    Pr(&gt;F)    
Destination          4 17.5415  4.3854  61.1553 5.408e-14 ***
Service              2 23.1706 11.5853 161.5599 &lt; 2.2e-16 ***
Destination:Service  8  4.1888  0.5236   7.3018 2.360e-05 ***
Residuals           30  2.1513  0.0717                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</pre></div></div>

<p>We have strong evidence here that there are differences between the three delivery services, between the five sub-office destinations and that there is an interaction between destination and service in line with what we saw in the original plot of the data. Now that we have fitted the model and identified the important factors we need to investigate the model diagnostics to ensure that the various assumptions are broadly valid.</p>
<p>We can plot the model residuals against fitted values to look for obvious trends that are not consistent with the model assumptions about independence and common variance. The first step is to create a data frame with the fitted values and residuals from the above model:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">delivery.res = delivery.df
delivery.res$M1.Fit = fitted(delivery.mod1)
delivery.res$M1.Resid = resid(delivery.mod1)</pre></div></div>

<p>Then a scatter plot is used to display the fitted values and residuals where the colour asthetic highlights which points correspond to the three competing delivery services:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.res, aes(M1.Fit, M1.Resid, colour = Service)) + geom_point() +
  xlab(&quot;Fitted Values&quot;) + ylab(&quot;Residuals&quot;)</pre></div></div>

<p>The <strong>xlab()</strong> and <strong>ylab()</strong> are used to change the text on the axis labels. The residual diagnostic plot is:</p>
<div id="attachment_798" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid1.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid1-300x300.png" alt="Model Residual Plot" title="Model Residual Plot" width="300" height="300" class="size-medium wp-image-798" /></a><p class="wp-caption-text">Diagnostic Residual Plot for Delivery Time Model</p></div>
<p>There are no obvious patterns in this plot that suggest problems with the two-way <strong>ANOVA</strong> model that we fitted to the data.</p>
<p>As an alternative display we could separate the residuals into destination sub-offices, where the <strong>facet_wrap()</strong> function instructs <strong>ggplot</strong> to create a separate display (panel) for each of the destinations.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.res, aes(M1.Fit, M1.Resid, colour = Service)) +
  geom_point() + xlab(&quot;Fitted Values&quot;) + ylab(&quot;Residuals&quot;) +
  facet_wrap( ~ Destination)</pre></div></div>

<p>To produce the following alternative residual plot:</p>
<div id="attachment_799" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid2.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid2-300x300.png" alt="Model Residual Plot" title="Model Residual Plot" width="300" height="300" class="size-medium wp-image-799" /></a><p class="wp-caption-text">Diagnostic Residual Plot for Delivery Time Model by Destination</p></div>
<p>No obvious problems in this diagnostic plot.</p>
<p>We could also consider dividing the data by delivery service to get a different view of the residuals:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.res, aes(M1.Fit, M1.Resid, colour = Destination)) +
  geom_point() + xlab(&quot;Fitted Values&quot;) + ylab(&quot;Residuals&quot;) +
  facet_wrap( ~ Service)</pre></div></div>

<p>This creates the following graph:</p>
<div id="attachment_800" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid3.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid3-300x300.png" alt="Model Residual Plot" title="Model Residual Plot" width="300" height="300" class="size-medium wp-image-800" /></a><p class="wp-caption-text">Diagnostic Residual Plot for Delivery Time Model by Service</p></div>
<p>Again there is nothing substantial here to lead us to consider an alternative analysis.</p>
<p>Lastly we consider the normal probability plot of the model residuals, using the <strong>stat_qq()</strong> option:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.res, aes(sample = M1.Resid)) + stat_qq()</pre></div></div>

<p>The quantile plot is:</p>
<div id="attachment_806" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-qq.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-qq-300x300.png" alt="Quantile Plot" title="Quantile Plot" width="300" height="300" class="size-medium wp-image-806" /></a><p class="wp-caption-text">Normal Probability Plot for Delivery Time Model</p></div>
<p>This plot is very close to the straight line we would expect to observe if the data was a close approximation to a normal distribution. To round off the analysis we look at the Tukey HSD multiple comparisons to confirm that the differences are between delivery service 1 and the other two competing services:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; TukeyHSD(delivery.mod1, which = &quot;Service&quot;)
  Tukey multiple comparisons of means
    95% family-wise confidence level
&nbsp;
Fit: aov(formula = Time ~ Destination * Service, data = delivery.df)
&nbsp;
$Service
                        diff        lwr       upr     p adj
Carrier 2-Carrier 1 1.498667  1.2576092 1.7397241 0.0000000
Carrier 3-Carrier 1 1.544667  1.3036092 1.7857241 0.0000000
Carrier 3-Carrier 2 0.046000 -0.1950575 0.2870575 0.8856246</pre></div></div>

<p>Even with the multiple comparison post-hoc adjustment there is very strong evidence for the differences that we have consistenly observed throughout the analysis.</p>
<p>We can use <strong>ggplot</strong> to visualise the difference in mean delivery time for the services and the 95% confidence intervals on these differences. We create a data frame from the <strong>TukeyHSD</strong> output by extracting the component relating to the delivery service comparison and add the text labels by extracting the row names from the data frame.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">delivery.hsd = data.frame(TukeyHSD(delivery.mod1, which = &quot;Service&quot;)$Service)
delivery.hsd$Comparison = row.names(delivery.hsd)</pre></div></div>

<p>We then use the <strong>geom_pointrange()</strong> to specify lower, middle and upper values based on the three pairwise comparisons of interest.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.hsd, aes(Comparison, y = diff, ymin = lwr, ymax = upr)) +
  geom_pointrange() + ylab(&quot;Difference in Mean Delivery Time by Service&quot;) +
  coord_flip()</pre></div></div>

<p>The <strong>coord_flip()</strong> is used to make the confidence intervals horizontal rather than vertical on the graph. This can be confusing for creating the axis labels as we specify the label where it would appear prior to the filp of coordinates. In the example above we add text to the y axis but this now appears on the x axis in the final graph:</p>
<div id="attachment_811" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-tukeyHSD.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-tukeyHSD-300x300.png" alt="Tukey HSD" title="Tukey HSD" width="300" height="300" class="size-medium wp-image-811" /></a><p class="wp-caption-text">Plot of Confidence Intervals for Mean Differences using Tukey HSD</p></div>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/two-way-analysis-of-variance-anova/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>One-way ANOVA (cont.)</title>
		<link>http://www.wekaleamstudios.co.uk/posts/one-way-anova-cont/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/one-way-anova-cont/#comments</comments>
		<pubDate>Fri, 12 Feb 2010 13:45:34 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Analysis of Variance]]></category>
		<category><![CDATA[Design of Experiments]]></category>
		<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[ANOVA]]></category>
		<category><![CDATA[aov]]></category>
		<category><![CDATA[lm]]></category>
		<category><![CDATA[multiple comparisons]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=767</guid>
		<description><![CDATA[In a previous post we considered using R to fit one-way ANOVA models to data. In this post we consider a few additional ways that we can look at the analysis. Fast Tube by Casper Fast Tube by Casper In the analysis we made use of the linear model function lm and the analysis could [...]]]></description>
			<content:encoded><![CDATA[<p>In a previous <a href="http://www.wekaleamstudios.co.uk/?p=658">post</a> we considered using <strong>R</strong> to fit one-way ANOVA models to data. In this post we consider a few additional ways that we can look at the analysis.<span id="more-767"></span></p>
<p><!--[Fast Tube]--><span id="PBE-llEkiHk" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/one-way-anova-cont/#PBE-llEkiHk"><img src="http://i.ytimg.com/vi/PBE-llEkiHk/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p><!--[Fast Tube]--><span id="r_uSH0Xaau8" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/one-way-anova-cont/#r_uSH0Xaau8"><img src="http://i.ytimg.com/vi/r_uSH0Xaau8/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In the analysis we made use of the linear model function <strong>lm</strong> and the analysis could be conducted using the <strong>aov</strong> function. The code used to fit the model is very similar:</p>

<div class="wp_syntax"><div class="code"><pre class="rsplus" style="font-family:monospace;"><span style="color: #080;">&gt;</span> plant.<span style="">mod2</span> <span style="color: #080;">=</span> <span style="color: #0000FF; font-weight: bold;">aov</span><span style="color: #080;">&#40;</span>weight ~ group, <span style="color: #0000FF; font-weight: bold;">data</span> <span style="color: #080;">=</span> plant.<span style="">df</span><span style="color: #080;">&#41;</span>
<span style="color: #080;">&gt;</span> <span style="color: #0000FF; font-weight: bold;">summary</span><span style="color: #080;">&#40;</span>plant.<span style="">mod2</span><span style="color: #080;">&#41;</span>
            Df  Sum Sq Mean Sq <span style="color: #0000FF; font-weight: bold;">F</span> value  Pr<span style="color: #080;">&#40;</span><span style="color: #080;">&gt;</span>F<span style="color: #080;">&#41;</span>  
group        <span style="color: #ff0000;">2</span>  <span style="color: #ff0000;">3.7663</span>  <span style="color: #ff0000;">1.8832</span>  <span style="color: #ff0000;">4.8461</span> <span style="color: #ff0000;">0.01591</span> <span style="color: #080;">*</span>
Residuals   <span style="color: #ff0000;">27</span> <span style="color: #ff0000;">10.4921</span>  <span style="color: #ff0000;">0.3886</span>                  
<span style="color: #080;">---</span>
Signif. <span style="color: #0000FF; font-weight: bold;">codes</span><span style="color: #080;">:</span>  <span style="color: #ff0000;">0</span> ‘<span style="color: #080;">***</span>’ <span style="color: #ff0000;">0.001</span> ‘<span style="color: #080;">**</span>’ <span style="color: #ff0000;">0.01</span> ‘<span style="color: #080;">*</span>’ <span style="color: #ff0000;">0.05</span> ‘.’ <span style="color: #ff0000;">0.1</span> ‘ ’ <span style="color: #ff0000;">1</span></pre></div></div>

<p>The output from using the <strong>summary</strong> function of the fitted model object shows the analysis of variance table with the p-value showing evidence of differences between the three groups. In <strong>R</strong> we can investigated the particular groups where there are differences using Tukey&#8217;s multiple comparisons:</p>

<div class="wp_syntax"><div class="code"><pre class="rsplus" style="font-family:monospace;"><span style="color: #080;">&gt;</span> <span style="color: #0000FF; font-weight: bold;">TukeyHSD</span><span style="color: #080;">&#40;</span>plant.<span style="">mod2</span><span style="color: #080;">&#41;</span>
  Tukey multiple comparisons of means
    <span style="color: #ff0000;">95</span><span style="color: #080;">%</span> family<span style="color: #080;">-</span>wise confidence level
&nbsp;
Fit<span style="color: #080;">:</span> <span style="color: #0000FF; font-weight: bold;">aov</span><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">formula</span> <span style="color: #080;">=</span> weight ~ group, <span style="color: #0000FF; font-weight: bold;">data</span> <span style="color: #080;">=</span> plant.<span style="">df</span><span style="color: #080;">&#41;</span>
&nbsp;
$group
                          <span style="color: #0000FF; font-weight: bold;">diff</span>        lwr       upr     p adj
Treatment <span style="color: #ff0000;">1</span><span style="color: #080;">-</span>Control     <span style="color: #080;">-</span><span style="color: #ff0000;">0.371</span> <span style="color: #080;">-</span><span style="color: #ff0000;">1.0622161</span> <span style="color: #ff0000;">0.3202161</span> <span style="color: #ff0000;">0.3908711</span>
Treatment <span style="color: #ff0000;">2</span><span style="color: #080;">-</span>Control      <span style="color: #ff0000;">0.494</span> <span style="color: #080;">-</span><span style="color: #ff0000;">0.1972161</span> <span style="color: #ff0000;">1.1852161</span> <span style="color: #ff0000;">0.1979960</span>
Treatment <span style="color: #ff0000;">2</span><span style="color: #080;">-</span>Treatment <span style="color: #ff0000;">1</span>  <span style="color: #ff0000;">0.865</span>  <span style="color: #ff0000;">0.1737839</span> <span style="color: #ff0000;">1.5562161</span> <span style="color: #ff0000;">0.0120064</span></pre></div></div>

<p>The multiple comparison tests highlight that the difference is due to comparing treatments 1 and 2. These 95% confidence intervals for the differences shown above can be plotted:</p>

<div class="wp_syntax"><div class="code"><pre class="rsplus" style="font-family:monospace;"><span style="color: #0000FF; font-weight: bold;">plot</span><span style="color: #080;">&#40;</span><span style="color: #0000FF; font-weight: bold;">TukeyHSD</span><span style="color: #080;">&#40;</span>plant.<span style="">mod2</span><span style="color: #080;">&#41;</span><span style="color: #080;">&#41;</span></pre></div></div>

<p>which gives</p>
<p><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-oneway-tukeyHSD.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-oneway-tukeyHSD-300x300.png" alt="Tukey HSD Plot" title="Plant Growth Tukey Multiple Comparison" width="300" height="300" class="aligncenter size-medium wp-image-771" /></a></p>
<p>The post-hoc adjustments are recommended as we are testing after looking at the data rather than undertaking a pre-planned analysis.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/one-way-anova-cont/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>One-way Analysis of Variance (ANOVA)</title>
		<link>http://www.wekaleamstudios.co.uk/posts/one-way-analysis-of-variance-anova/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/one-way-analysis-of-variance-anova/#comments</comments>
		<pubDate>Wed, 03 Feb 2010 21:01:24 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Analysis of Variance]]></category>
		<category><![CDATA[Design of Experiments]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[analysis of variance]]></category>
		<category><![CDATA[ANOVA]]></category>
		<category><![CDATA[factor]]></category>
		<category><![CDATA[fitted values]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[lm]]></category>
		<category><![CDATA[one way]]></category>
		<category><![CDATA[residuals]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=658</guid>
		<description><![CDATA[Analysis of Variance (ANOVA) is a commonly used statistical technique for investigating data by comparing the means of subsets of the data. The base case is the one-way ANOVA which is an extension of two-sample t test for independent groups covering situations where there are more than two groups being compared. Fast Tube by Casper [...]]]></description>
			<content:encoded><![CDATA[<p>Analysis of Variance (<strong>ANOVA</strong>) is a commonly used statistical technique for investigating data by comparing the means of subsets of the data. The base case is the one-way <strong>ANOVA</strong> which is an extension of two-sample t test for independent groups covering situations where there are more than two groups being compared.<span id="more-658"></span></p>
<p><!--[Fast Tube]--><span id="PBE-llEkiHk" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/one-way-analysis-of-variance-anova/#PBE-llEkiHk"><img src="http://i.ytimg.com/vi/PBE-llEkiHk/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p><!--[Fast Tube]--><span id="r_uSH0Xaau8" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/one-way-analysis-of-variance-anova/#r_uSH0Xaau8"><img src="http://i.ytimg.com/vi/r_uSH0Xaau8/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In one-way <strong>ANOVA</strong> the data is sub-divided into groups based on a single classification factor and the standard terminology used to describe the set of factor levels is <strong>treatment</strong> even though this might not always have meaning for the particular application. There is variation in the measurements taken on the individual components of the data set and ANOVA investigates whether this variation can be explained by the grouping introduced by the classification factor.</p>
<p>As an example we consider one of the data sets available with R relating to an experiment into plant growth. The purpose of the experiment was to compare the yields on the plants for a control group and two treatments of interest. The response variable was a measurement taken on the dried weight of the plants.</p>
<p>The first step in the investigation is to take a copy of the data frame so that we can make some adjustments as necessary while leaving the original data alone. We use the <strong>factor</strong> function to re-define the labels of the <strong>group</strong> variables that will appear in the output and graphs:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">plant.df = PlantGrowth
plant.df$group = factor(plant.df$group,
  labels = c(&quot;Control&quot;, &quot;Treatment 1&quot;, &quot;Treatment 2&quot;))</pre></div></div>

<p>The <strong>labels</strong> argument is a list of names corresponding to the levels of the <strong>group</strong> factor variable.</p>
<p>A boxplot of the distributions of the dried weights for the three competing groups is created using the <strong>ggplot</strong> package:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">require(ggplot2)
&nbsp;
ggplot(plant.df, aes(x = group, y = weight)) +
  geom_boxplot(fill = &quot;grey80&quot;, colour = &quot;blue&quot;) +
  scale_x_discrete() + xlab(&quot;Treatment Group&quot;) +
  ylab(&quot;Dried weight of plants&quot;)</pre></div></div>

<p>The <strong>geom_boxplot()</strong> option is used to specify background and outline colours for the boxes. The axis labels are created with the <strong>xlab()</strong> and <strong>ylab()</strong> options. The plot that is produce looks like this:</p>
<p><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/01/anova-oneway-data.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/01/anova-oneway-data-300x300.png" alt="Boxplot of Plant Growth by Treatment Group" title="Plant Growth Data Summary" width="300" height="300" class="aligncenter size-medium wp-image-754" /></a></p>
<p>Initial inspection of the data suggests that there are differences in the dried weight for the two treatments but it is not so clear cut to determine whether the treatments are different to the control group. To investigate these differences we fit the one-way ANOVA model using the <strong>lm</strong> function and look at the parameter estimates and standard errors for the treatment effects. The function call is:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">plant.mod1 = lm(weight ~ group, data = plant.df)</pre></div></div>

<p>We save the model fitted to the data in an object so that we can undertake various actions to study the goodness of the fit to the data and other model assumptions. The standard summary of a <strong>lm</strong> object is used to produce the following output:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; summary(plant.mod1)
&nbsp;
Call:
lm(formula = weight ~ group, data = plant.df)
&nbsp;
Residuals:
    Min      1Q  Median      3Q     Max 
-1.0710 -0.4180 -0.0060  0.2627  1.3690 
&nbsp;
Coefficients:
                 Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)        5.0320     0.1971  25.527   &lt;2e-16 ***
groupTreatment 1  -0.3710     0.2788  -1.331   0.1944    
groupTreatment 2   0.4940     0.2788   1.772   0.0877 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
Residual standard error: 0.6234 on 27 degrees of freedom
Multiple R-squared: 0.2641,     Adjusted R-squared: 0.2096 
F-statistic: 4.846 on 2 and 27 DF,  p-value: 0.01591</pre></div></div>

<p>The model output indicates some evidence of a difference in the average growth for the 2nd treatment compared to the control group. An analysis of variance table for this model can be produced via the <strong>anova</strong> command:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; anova(plant.mod1)
Analysis of Variance Table
&nbsp;
Response: weight
          Df  Sum Sq Mean Sq F value  Pr(&gt;F)  
group      2  3.7663  1.8832  4.8461 0.01591 *
Residuals 27 10.4921  0.3886                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</pre></div></div>

<p>This table confirms that there are differences between the groups which were highlighted in the model summary. The function <strong>confint</strong> is used to calculate confidence intervals on the treatment parameters, by default 95% confidence intervals:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; confint(plant.mod1)
                       2.5 %    97.5 %
(Intercept)       4.62752600 5.4364740
groupTreatment 1 -0.94301261 0.2010126
groupTreatment 2 -0.07801261 1.0660126</pre></div></div>

<p>The model residuals can be plotted against the fitted values to investigate the model assumptions. First we create a data frame with the fitted values, residuals and treatment identifiers:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">plant.mod = data.frame(Fitted = fitted(plant.mod1),
  Residuals = resid(plant.mod1), Treatment = plant.df$group)</pre></div></div>

<p>and then produce the plot:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(plant.mod, aes(Fitted, Residuals, colour = Treatment)) + geom_point()</pre></div></div>

<p>which produces this graph:<br />
<a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-oneway-residualplot.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-oneway-residualplot-300x300.png" alt="Residual diagnostic plot" title="Plant Growth Residual Plot" width="300" height="300" class="aligncenter size-medium wp-image-762" /></a><br />
We can see that there is no major problem with the diagnostic plot but some evidence of different variabilities in the spread of the residuals for the three treatment groups.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/one-way-analysis-of-variance-anova/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

