<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Software for Exploratory Data Analysis and Statistical Modelling &#187; Statistical Modelling</title>
	<atom:link href="http://www.wekaleamstudios.co.uk/topics/statistical-analysis/statistical-modelling/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.wekaleamstudios.co.uk</link>
	<description>Statistical Modelling with R</description>
	<lastBuildDate>Wed, 01 Feb 2012 19:44:22 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Generalized Linear Models &#8211; Poisson Regression</title>
		<link>http://www.wekaleamstudios.co.uk/posts/generalized-linear-models-poisson-regression/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/generalized-linear-models-poisson-regression/#comments</comments>
		<pubDate>Sun, 26 Jun 2011 09:28:50 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Linear Models]]></category>
		<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[Generalized Linear Model]]></category>
		<category><![CDATA[glm]]></category>
		<category><![CDATA[Poisson]]></category>
		<category><![CDATA[update]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1547</guid>
		<description><![CDATA[The Generalized Linear Model (GLM) allows us to model responses with distributions other than the Normal distribution, which is one of the assumptions underlying linear regression as used in many cases. When data is counts of events (or items) then a discrete distribution is more appropriate is usually more appropriate than approximating with a continuous [...]]]></description>
			<content:encoded><![CDATA[<p>The Generalized Linear Model (GLM) allows us to model responses with distributions other than the Normal distribution, which is one of the assumptions underlying linear regression as used in many cases. When data is counts of events (or items) then a discrete distribution is more appropriate is usually more appropriate than approximating with a continuous distribution, especially as our counts should be bounded below at zero. Negative counts do not make sense.<span id="more-1547"></span></p>
<p><!--[Fast Tube]--><span id="Z1qE9-Vqw50" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/generalized-linear-models-poisson-regression/#Z1qE9-Vqw50"><img src="http://i.ytimg.com/vi/Z1qE9-Vqw50/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>To investigate using Poisson regression via the GLM framework consider a small data set on failure modes (<a href="http://www.sci.usq.edu.au/staff/dunn/Datasets/tech-glms.html">here</a>).</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; failure.df = read.table(&quot;twomodes.dat&quot;, header = TRUE)
&gt; failure.df
  Mode1 Mode2 Failures
1  33.3  25.3       15
2  52.2  14.4        9
3  64.7  32.5       14
4 137.0  20.5       24
5 125.9  97.6       27
6 116.3  53.6       27
7 131.7  56.6       23
8  85.0  87.3       18
9  91.9  47.8       22</pre></div></div>

<p>The machinery is run in two modes and the objective of the analysis is to determine whether the number of failures depends on how long the machine is run in mode 1 or mode 2 and whether there is an interaction between the time in each mode to increases or decreases the number of failures.</p>
<p>The response for this set of data is the number of failures (count) so a Poisson regression model is considered.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; fmod1 = glm(Failures ~ Mode1 * Mode2, data = failure.df, family = poisson)
&gt; summary(fmod1)
&nbsp;
Call:
glm(formula = Failures ~ Mode1 * Mode2, family = poisson, data = failure.df)
&nbsp;
Deviance Residuals: 
       1         2         3         4         5         6         7         8         9  
 0.91003  -1.15601  -0.28328  -0.10398   0.03526   0.84825  -0.49211  -0.57298   0.64821  
&nbsp;
Coefficients:
              Estimate Std. Error z value Pr(&gt;|z|)    
(Intercept)  2.105e+00  4.481e-01   4.698 2.63e-06 ***
Mode1        7.687e-03  4.285e-03   1.794   0.0729 .  
Mode2        4.703e-03  1.163e-02   0.405   0.6858    
Mode1:Mode2 -1.978e-05  1.037e-04  -0.191   0.8487    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
(Dispersion parameter for poisson family taken to be 1)
&nbsp;
    Null deviance: 16.996  on 8  degrees of freedom
Residual deviance:  3.967  on 5  degrees of freedom
AIC: 55.024
&nbsp;
Number of Fisher Scoring iterations: 4</pre></div></div>

<p>The model output does not provide any support for an interaction between the number of time spent in the two different modes of operation. If we remove the interaction term and re-fit the model, using the update function, we get:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; fmod2 = update(fmod1, . ~ . - Mode1:Mode2)
&gt; summary(fmod2)
&nbsp;
Call:
glm(formula = Failures ~ Mode1 + Mode2, family = poisson, data = failure.df)
&nbsp;
Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.21984  -0.44735  -0.05893   0.68351   0.87510  
&nbsp;
Coefficients:
            Estimate Std. Error z value Pr(&gt;|z|)    
(Intercept) 2.175168   0.255456   8.515  &lt; 2e-16 ***
Mode1       0.007015   0.002429   2.888  0.00387 ** 
Mode2       0.002549   0.002835   0.899  0.36852    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
(Dispersion parameter for poisson family taken to be 1)
&nbsp;
    Null deviance: 16.9964  on 8  degrees of freedom
Residual deviance:  4.0033  on 6  degrees of freedom
AIC: 53.06
&nbsp;
Number of Fisher Scoring iterations: 4</pre></div></div>

<p>This output suggests that the time of operation in mode 1 is important for determining the number of faults but the time of operation in mode 2 is not important. One last step gives us:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; fmod3 = update(fmod2, . ~ . - Mode2)
&gt; summary(fmod3)
&nbsp;
Call:
glm(formula = Failures ~ Mode1, family = poisson, data = failure.df)
&nbsp;
Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.43194  -0.56958  -0.00745   0.66742   0.82231  
&nbsp;
Coefficients:
            Estimate Std. Error z value Pr(&gt;|z|)    
(Intercept) 2.237196   0.243053   9.205  &lt; 2e-16 ***
Mode1       0.007705   0.002264   3.403 0.000667 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
(Dispersion parameter for poisson family taken to be 1)
&nbsp;
    Null deviance: 16.9964  on 8  degrees of freedom
Residual deviance:  4.8078  on 7  degrees of freedom
AIC: 51.865
&nbsp;
Number of Fisher Scoring iterations: 4</pre></div></div>

<p>The diagnostic plots are shown below which do not indicate any major problems with the final model, especially given the small number of data points.</p>
<div id="attachment_1644" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2011/06/Poisson-Regression.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2011/06/Poisson-Regression-300x300.png" alt="Residual Plots for Poisson Regression model" title="Residual Plots for Poisson Regression model" width="300" height="300" class="size-medium wp-image-1644" /></a><p class="wp-caption-text">Four diagnostic plots for a Poisson regression model based on total failures</p></div>
<p>Other useful resources are provided on the <a href="http://www.wekaleamstudios.co.uk/supplementary-material/">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/generalized-linear-models-poisson-regression/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Classification Trees using the rpart function</title>
		<link>http://www.wekaleamstudios.co.uk/posts/classification-trees-using-the-rpart-function/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/classification-trees-using-the-rpart-function/#comments</comments>
		<pubDate>Tue, 21 Sep 2010 19:22:50 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[CART]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[misclassification]]></category>
		<category><![CDATA[plotcp]]></category>
		<category><![CDATA[printcp]]></category>
		<category><![CDATA[prune]]></category>
		<category><![CDATA[rpart]]></category>
		<category><![CDATA[tree]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1399</guid>
		<description><![CDATA[In a previous post on classification trees we considered using the tree package to fit a classification tree to data divided into known classes. In this post we will look at the alternative function rpart that is available within the base R distribution. Fast Tube by Casper A classification tree can be fitted using the [...]]]></description>
			<content:encoded><![CDATA[<p>In a previous <a href="http://www.wekaleamstudios.co.uk/posts/classification-trees/">post</a> on classification trees we considered using the <strong>tree</strong> package to fit a classification tree to data divided into known classes. In this post we will look at the alternative function rpart that is available within the base <strong>R</strong> distribution.<span id="more-1399"></span></p>
<p><!--[Fast Tube]--><span id="m3mLNpeke0I" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/classification-trees-using-the-rpart-function/#m3mLNpeke0I"><img src="http://i.ytimg.com/vi/m3mLNpeke0I/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>A classification tree can be fitted using the <strong>rpart</strong> function using a similar syntax to the <strong>tree</strong> function. For the ecoli data set discussed in the previous <a href="http://www.wekaleamstudios.co.uk/posts/classification-trees/">post</a> we would use:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; require(rpart)
&gt; ecoli.df = read.csv(&quot;ecoli.txt&quot;)</pre></div></div>

<p>followed by</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; ecoli.rpart1 = rpart(class ~ mcv + gvh + lip + chg + aac + alm1 + alm2, 
  data = ecoli.df)</pre></div></div>

<p>We would then consider whether the tree could be simplified by pruning and make use of the <strong>plotcp</strong> function:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; plotcp(ecoli.rpart1)</pre></div></div>

<p>Once the amount of pruning has been determined from this graph or by looking at the output from the <strong>printcp</strong> function:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; printcp(ecoli.rpart1)
&nbsp;
Classification tree:
rpart(formula = class ~ mcv + gvh + lip + chg + aac + alm1 + 
    alm2, data = ecoli.df)
&nbsp;
Variables actually used in tree construction:
[1] aac  alm1 gvh  mcv 
&nbsp;
Root node error: 193/336 = 0.5744
&nbsp;
n= 336 
&nbsp;
        CP nsplit rel error  xerror     xstd
1 0.388601      0   1.00000 1.00000 0.046959
2 0.207254      1   0.61140 0.61658 0.045423
3 0.062176      2   0.40415 0.45596 0.041758
4 0.051813      3   0.34197 0.38342 0.039359
5 0.031088      4   0.29016 0.36269 0.038571
6 0.015544      5   0.25907 0.30570 0.036136
7 0.010000      6   0.24352 0.31088 0.036375</pre></div></div>

<p>The <strong>prune</strong> function is used to simplify the tree based on a <em>cp</em> identified from the graph or printed output threshold.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; ecoli.rpart2 = prune(ecoli.rpart1, cp = 0.02)</pre></div></div>

<p>The classification tree can be visualised with the plot function and then the text function adds labels to the graph:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; plot(ecoli.rpart2, uniform = TRUE)
&gt; text(ecoli.rpart2, use.n = TRUE, cex = 0.75)</pre></div></div>

<p>Other useful resources are provided on the <a href="http://www.wekaleamstudios.co.uk/supplementary-material/">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/classification-trees-using-the-rpart-function/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Classification Trees</title>
		<link>http://www.wekaleamstudios.co.uk/posts/classification-trees/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/classification-trees/#comments</comments>
		<pubDate>Sat, 18 Sep 2010 09:23:21 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[CART]]></category>
		<category><![CDATA[classification]]></category>
		<category><![CDATA[cross-validation]]></category>
		<category><![CDATA[misclassification]]></category>
		<category><![CDATA[rpart]]></category>
		<category><![CDATA[tree]]></category>
		<category><![CDATA[xtabs]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1387</guid>
		<description><![CDATA[Decision trees are applied to situation where data is divided into groups rather than investigating a numerical response and its relationship to a set of descriptor variables. There are various implementations of classification trees in R and the some commonly used functions are rpart and tree. Fast Tube by Casper To illustrate the use of [...]]]></description>
			<content:encoded><![CDATA[<p>Decision trees are applied to situation where data is divided into groups rather than investigating a numerical response and its relationship to a set of descriptor variables. There are various implementations of classification trees in <strong>R</strong> and the some commonly used functions are <strong>rpart</strong> and <strong>tree</strong>.<span id="more-1387"></span></p>
<p><!--[Fast Tube]--><span id="9XNhqO1bu0A" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/classification-trees/#9XNhqO1bu0A"><img src="http://i.ytimg.com/vi/9XNhqO1bu0A/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>To illustrate the use of the <strong>tree</strong> function we will use a set of data from the UCI <a href="http://archive.ics.uci.edu/ml/">Machine Learning Repository</a> where the objective of the study using this data was to <em>predict the cellular localization sites of proteins</em>.</p>
<p>The data provided on the website is shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; ecoli.df = read.csv(&quot;ecoli.txt&quot;)
&gt; head(ecoli.df)
    Sequence  mcv  gvh  lip chg  aac alm1 alm2 class
1  AAT_ECOLI 0.49 0.29 0.48 0.5 0.56 0.24 0.35    cp
2 ACEA_ECOLI 0.07 0.40 0.48 0.5 0.54 0.35 0.44    cp
3 ACEK_ECOLI 0.56 0.40 0.48 0.5 0.49 0.37 0.46    cp
4 ACKA_ECOLI 0.59 0.49 0.48 0.5 0.52 0.45 0.36    cp
5  ADI_ECOLI 0.23 0.32 0.48 0.5 0.55 0.25 0.35    cp
6 ALKH_ECOLI 0.67 0.39 0.48 0.5 0.36 0.38 0.46    cp</pre></div></div>

<p>We can use the <strong>xtabs</strong> function to summarise the number of cases in each class.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; xtabs( ~ class, data = ecoli.df)
class
 cp  im imL imS imU  om omL  pp 
143  77   2   2  35  20   5  52</pre></div></div>

<p>As noted in the comments the package that I used was the <strong>tree</strong> package:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; require(tree)</pre></div></div>

<p>The complete classification tree using all variables is fitted to the data initially and then we will try to <em>prune</em> the tree to make it smaller.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; ecoli.tree1 = tree(class ~ mcv + gvh + lip + chg + aac + alm1 + alm2,
  data = ecoli.df)
&gt; summary(ecoli.tree1)
&nbsp;
Classification tree:
tree(formula = class ~ mcv + gvh + lip + chg + aac + alm1 + alm2, 
    data = ecoli.df)
Variables actually used in tree construction:
[1] &quot;alm1&quot; &quot;mcv&quot;  &quot;gvh&quot;  &quot;aac&quot;  &quot;alm2&quot;
Number of terminal nodes:  10 
Residual mean deviance:  0.7547 = 246 / 326 
Misclassification error rate: 0.122 = 41 / 336</pre></div></div>

<p>The <strong>tree</strong> function is used in a similar way to other modelling functions in <strong>R</strong>. The misclassification rate is shown as part of the summary of the tree. This tree can be plotted and annotated with these commands:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; plot(ecoli.tree1)
&gt; text(ecoli.tree1, all = T)</pre></div></div>

<p>To prune the tree we use cross-validation to identify the point to <em>prune</em>.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; cv.tree(ecoli.tree1)
$size
 [1] 10  9  8  7  6  5  4  3  2  1
&nbsp;
$dev
 [1]  463.6820  457.4463  447.9824  441.8617  455.8318  478.9234  533.5856  586.2820  713.2992 1040.3878
&nbsp;
$k
 [1]      -Inf  12.16500  15.60004  19.21572  34.29868  41.10627  50.57044  64.05494 180.78800 355.67747
&nbsp;
$method
[1] &quot;deviance&quot;
&nbsp;
attr(,&quot;class&quot;)
[1] &quot;prune&quot;         &quot;tree.sequence&quot;</pre></div></div>

<p>This suggests a tree size of 6 and we can re-fit the tree:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; ecoli.tree2 = prune.misclass(ecoli.tree1, best = 6)
&gt; summary(ecoli.tree2)
&nbsp;
Classification tree:
snip.tree(tree = ecoli.tree1, nodes = c(4, 20, 7))
Variables actually used in tree construction:
[1] &quot;alm1&quot; &quot;mcv&quot;  &quot;aac&quot;  &quot;gvh&quot; 
Number of terminal nodes:  6 
Residual mean deviance:  0.9918 = 327.3 / 330 
Misclassification error rate: 0.1548 = 52 / 336</pre></div></div>

<p>The misclassification rate has increased but not substantially with the <em>pruning</em> of the tree.</p>
<p>Other useful resources are provided on the <a href="http://www.wekaleamstudios.co.uk/supplementary-material/">Supplementary Material</a> page.</p>
<p>Data used in this post: <a href='http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/09/ecoli.txt'>Ecoli Data Set</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/classification-trees/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Variable selection using automatic methods</title>
		<link>http://www.wekaleamstudios.co.uk/posts/variable-selection-using-automatic-methods/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/variable-selection-using-automatic-methods/#comments</comments>
		<pubDate>Sat, 22 May 2010 11:15:03 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Linear Models]]></category>
		<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[backwards elimination]]></category>
		<category><![CDATA[best subset]]></category>
		<category><![CDATA[forward]]></category>
		<category><![CDATA[leaps]]></category>
		<category><![CDATA[regsubset]]></category>
		<category><![CDATA[stepwise]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1094</guid>
		<description><![CDATA[When we have a set of data with a small number of variables we can easily use a manual approach to identifying a good set of variables and the form they take in our statistical model. In other situations we may have a large number of potentially important variables and it soon becomes a time [...]]]></description>
			<content:encoded><![CDATA[<p>When we have a set of data with a small number of variables we can easily use a manual approach to identifying a good set of variables and the form they take in our statistical model. In other situations we may have a large number of potentially important variables and it soon becomes a time consuming effort to follow a manual variable selection process. In this case we may consider using automatic subset selection tools to remove some of the burden of the task.<span id="more-1094"></span></p>
<p>It should be noted that there is some disagreement about whether it is desirable to use an automated method but this post will focus on the mechanics of doing it rather than the debate about whether to be doing it at all.</p>
<p>The <strong>R</strong> package <strong>leaps</strong> has a function <strong>regsubsets</strong> that can be used for best subsets, forward selection and backwards elimination depending on which approach is considered most appropriate for the application under consideration.</p>
<p>In previous <a href="http://www.wekaleamstudios.co.uk/posts/manual-variable-selection-using-the-dropterm-function/">post</a> we considered using data on CPU performance to illustrate the variable selection process. We load the required packages:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; require(leaps)
&gt; require(MASS)</pre></div></div>

<p>First up we consider selecting the <em>best</em> subset of a particular size, say four variables for illustrative purposes (<strong>nvmax</strong> argument), and we specify the largest possible model which in this example has six variables:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, nvmax = 4)</pre></div></div>

<p>A summary for the output from this function is shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; summary(reg1)
Subset selection object
Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + 
    chmax, data = cpus, nvmax = 4)
6 Variables  (and intercept)
      Forced in Forced out
syct      FALSE      FALSE
mmin      FALSE      FALSE
mmax      FALSE      FALSE
cach      FALSE      FALSE
chmin     FALSE      FALSE
chmax     FALSE      FALSE
1 subsets of each size up to 4
Selection Algorithm: exhaustive
         syct mmin mmax cach chmin chmax
1  ( 1 ) &quot; &quot;  &quot; &quot;  &quot;*&quot;  &quot; &quot;  &quot; &quot;   &quot; &quot;  
2  ( 1 ) &quot; &quot;  &quot; &quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;   &quot; &quot;  
3  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;  &quot; &quot;   &quot;*&quot;  
4  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;   &quot;*&quot;</pre></div></div>

<p>The function <strong>regsubsets</strong> identifies the variables <strong>mmin</strong>, <strong>mmax</strong>, <strong>cach</strong> and <strong>chmax</strong> as the <em>best</em> four.</p>
<p>Alternatively we could perform a backwards elimination and the function will indicate the <em>best</em> subset of a particular size, from one to six variables in this example:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; reg2 = regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, method = &quot;backward&quot;)
&gt; summary(reg2)
Subset selection object
Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + 
    chmax, data = cpus, method = &quot;backward&quot;)
6 Variables  (and intercept)
      Forced in Forced out
syct      FALSE      FALSE
mmin      FALSE      FALSE
mmax      FALSE      FALSE
cach      FALSE      FALSE
chmin     FALSE      FALSE
chmax     FALSE      FALSE
1 subsets of each size up to 6
Selection Algorithm: backward
         syct mmin mmax cach chmin chmax
1  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot; &quot;  &quot; &quot;  &quot; &quot;   &quot; &quot;  
2  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot; &quot;  &quot; &quot;  &quot; &quot;   &quot;*&quot;  
3  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;  &quot; &quot;   &quot;*&quot;  
4  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;   &quot;*&quot;  
5  ( 1 ) &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;   &quot;*&quot;  
6  ( 1 ) &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;   &quot;*&quot;</pre></div></div>

<p>The subset of four variables is the same for this example as the <em>best</em> subsets approach. The third approach if forward selection:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; reg3 = regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, method = &quot;backward&quot;)
&gt; summary(reg3)
Subset selection object
Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + 
    chmax, data = cpus, method = &quot;backward&quot;)
6 Variables  (and intercept)
      Forced in Forced out
syct      FALSE      FALSE
mmin      FALSE      FALSE
mmax      FALSE      FALSE
cach      FALSE      FALSE
chmin     FALSE      FALSE
chmax     FALSE      FALSE
1 subsets of each size up to 6
Selection Algorithm: backward
         syct mmin mmax cach chmin chmax
1  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot; &quot;  &quot; &quot;  &quot; &quot;   &quot; &quot;  
2  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot; &quot;  &quot; &quot;  &quot; &quot;   &quot;*&quot;  
3  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;  &quot; &quot;   &quot;*&quot;  
4  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;   &quot;*&quot;  
5  ( 1 ) &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;   &quot;*&quot;  
6  ( 1 ) &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;   &quot;*&quot;</pre></div></div>

<p>For this data set, as there are only six variables, we do not see divergence between the subsets chosen by the different methods.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/variable-selection-using-automatic-methods/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Linear regression models with robust parameter estimation</title>
		<link>http://www.wekaleamstudios.co.uk/posts/linear-regression-models-with-robust-parameter-estimation/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/linear-regression-models-with-robust-parameter-estimation/#comments</comments>
		<pubDate>Sat, 15 May 2010 10:54:51 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Linear Models]]></category>
		<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[lm]]></category>
		<category><![CDATA[lmrob]]></category>
		<category><![CDATA[parameter estimation]]></category>
		<category><![CDATA[robust]]></category>
		<category><![CDATA[robustbase]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1069</guid>
		<description><![CDATA[There are situations in regression modelling where robust methods could be considered to handle unusual observations that do not follow the general trend of the data set. There are various packages in R that provide robust statistical methods which are summarised on the CRAN Robust Task View. As an example of using robust statistical estimation [...]]]></description>
			<content:encoded><![CDATA[<p>There are situations in regression modelling where robust methods could be considered to handle unusual observations that do not follow the general trend of the data set. There are various packages in <strong>R</strong> that provide robust statistical methods which are summarised on the <a href="http://www.r-project.org/">CRAN</a> Robust Task View.<span id="more-1069"></span></p>
<p>As an example of using robust statistical estimation in a linear regression framework consider the CPUs <a href="http://www.wekaleamstudios.co.uk/posts/manual-variable-selection-using-the-dropterm-function/">data</a> that was used in previous posts on linear regression and variable selection. For this data we could fit a model with six variables using least squares and also with a fast MM-estimator from the <strong>robustbase</strong> package.</p>
<p>First step is to make the functions and data available for analysis:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">require(MASS)
require(robustbase)</pre></div></div>

<p>The linear model using least squares is fitted as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; cpu.mod1 = lm(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus)</pre></div></div>

<p>The summary for this model:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; summary(cpu.mod1)
&nbsp;
Call:
lm(formula = perf ~ syct + mmin + mmax + cach + chmin + chmax, 
    data = cpus)
&nbsp;
Residuals:
     Min       1Q   Median       3Q      Max 
-195.841  -25.169    5.409   26.528  385.749 
&nbsp;
Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) -5.590e+01  8.045e+00  -6.948 4.99e-11 ***
syct         4.886e-02  1.752e-02   2.789  0.00579 ** 
mmin         1.529e-02  1.827e-03   8.371 9.42e-15 ***
mmax         5.571e-03  6.418e-04   8.680 1.33e-15 ***
cach         6.412e-01  1.396e-01   4.594 7.64e-06 ***
chmin       -2.701e-01  8.557e-01  -0.316  0.75263    
chmax        1.483e+00  2.201e-01   6.738 1.64e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
Residual standard error: 59.99 on 202 degrees of freedom
Multiple R-squared: 0.8649,     Adjusted R-squared: 0.8609 
F-statistic: 215.5 on 6 and 202 DF,  p-value: &lt; 2.2e-16</pre></div></div>

<p>The linear model using an MM-estimator is fitted as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">cpu.robmod1 = lmrob(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, control = lmrob.control(max.it = 100))</pre></div></div>

<p>Note that we need to increase the default number of iterations (50) to allow the routine to converge to a solution. The summary for this model:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; summary(cpu.robmod1)
&nbsp;
Call:
lmrob(formula = perf ~ syct + mmin + mmax + cach + chmin + chmax, 
    data = cpus, control = lmrob.control(max.it = 100))
&nbsp;
Weighted Residuals:
      Min        1Q    Median        3Q       Max 
-144.0045   -9.4554    0.7691   13.3757  759.6953 
&nbsp;
Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) -3.6634297  6.1697645  -0.594 0.553329    
syct         0.0063112  0.0043877   1.438 0.151868    
mmin         0.0098798  0.0034726   2.845 0.004897 ** 
mmax         0.0024463  0.0004525   5.407 1.80e-07 ***
cach         0.8702102  0.2551245   3.411 0.000782 ***
chmin        2.4078436  1.3319413   1.808 0.072130 .  
chmax        0.1016861  0.1494902   0.680 0.497145    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
Robust residual standard error: 19.27 
Convergence in 65 IRWLS iterations
&nbsp;
Robustness weights: 
 15 observations c(1,8,9,10,31,32,96,97,98,153,154,156,169,199,200) are outliers with |weight| = 0 ( &lt; 0.00048); 
 14 weights are ~= 1. The remaining 180 ones are summarized as
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.002472 0.831300 0.963700 0.849800 0.989500 0.998900 
Algorithmic parameters: 
tuning.chi         bb tuning.psi refine.tol    rel.tol 
 1.5476400  0.5000000  4.6850610  0.0000001  0.0000001 
 nResample     max.it     groups    n.group   best.r.s   k.fast.s      k.max  trace.lev compute.rd 
       500        100          5        400          2          1        200          0          0 
seed : int(0)</pre></div></div>

<p>The two models differ in the variables that are considered important and the output from the <strong>lmrob</strong> function provides a summary of the weights that have been allocated to the data. A total of fifteen of the data points have been allocated very small weights by the fitting algorithm.</p>
<p><em>Related posts:</em></p>
<ul>
<li>The <a href="http://www.wekaleamstudios.co.uk/posts/using-the-update-function-during-variable-selection/">update</a> function for simplifying model selection.</li>
<li>Data analysis using <a href="http://www.wekaleamstudios.co.uk/posts/simple-linear-regression/">simple linear regression</a> models.</li>
<li>The <a href="http://www.wekaleamstudios.co.uk/posts/manual-variable-selection-using-the-dropterm-function/">dropterm</a> function for simplifying models.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/linear-regression-models-with-robust-parameter-estimation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Manual variable selection using the dropterm function</title>
		<link>http://www.wekaleamstudios.co.uk/posts/manual-variable-selection-using-the-dropterm-function/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/manual-variable-selection-using-the-dropterm-function/#comments</comments>
		<pubDate>Wed, 12 May 2010 20:14:06 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[cpus]]></category>
		<category><![CDATA[dropterm]]></category>
		<category><![CDATA[F test]]></category>
		<category><![CDATA[lm]]></category>
		<category><![CDATA[MASS]]></category>
		<category><![CDATA[summary]]></category>
		<category><![CDATA[t-test]]></category>
		<category><![CDATA[variable selection]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1057</guid>
		<description><![CDATA[When fitting a multiple linear regression model to data a natural question is whether a model can be simplified by excluding variables from the model. There are automatic procedures for undertaking these tests but some people prefer to follow a more manual approach to variable selection rather than pressing a button and taking what comes [...]]]></description>
			<content:encoded><![CDATA[<p>When fitting a multiple linear regression model to data a natural question is whether a model can be simplified by excluding variables from the model. There are automatic procedures for undertaking these tests but some people prefer to follow a more manual approach to variable selection rather than pressing a button and taking what comes out.<span id="more-1057"></span></p>
<p><!--[Fast Tube]--><span id="BneY21nS5is" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/manual-variable-selection-using-the-dropterm-function/#BneY21nS5is"><img src="http://i.ytimg.com/vi/BneY21nS5is/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>When there are a large number of variables it is awkward to manually go through each one in turn to make a decision about simplification to a more parsimonious model. In <strong>R</strong> there is a function <strong>dropterm</strong> that removes some of this task by assuming that we are interested in considering the outcome of dropping each model term one at a time.</p>
<p>To illustrate this consider the <strong>cpus</strong> data set in the <strong>MASS</strong> package which contains information about a relative performance measure and characteristics of 209 CPUs. We load the package first to make the data available:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">library(MASS)</pre></div></div>

<p>We first fit a linear model with six explanatory variables:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">cpu.mod1 = lm(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus)</pre></div></div>

<p>The function <strong>dropterm</strong> requires a fitted model, which we saved in the last command, and optionally we could specify what test to use to compare the initial model and each of the possible alternative models with one less variable. We can choose to perform an <strong>F</strong> test:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; dropterm(cpu.mod1, test = &quot;F&quot;)
Single term deletions
&nbsp;
Model:
perf ~ syct + mmin + mmax + cach + chmin + chmax
       Df Sum of Sq    RSS    AIC F Value     Pr(F)    
&lt;none&gt;              727002 1718.3                      
syct    1     27995 754997 1724.2   7.779  0.005793 ** 
mmin    1    252211 979213 1778.5  70.078 9.416e-15 ***
mmax    1    271147 998149 1782.5  75.339 1.326e-15 ***
cach    1     75962 802964 1737.0  21.106 7.640e-06 ***
chmin   1       358 727360 1716.4   0.100  0.752632    
chmax   1    163396 890398 1758.6  45.400 1.640e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</pre></div></div>

<p>The output from the function call indicates that we could excude the <strong>chmin</strong> variable then re-fit the model and continue again with the same checking process.</p>
<p><em>Update:</em></p>
<p>The <strong>dropterm</strong> function considers each variable individually and considers what the change in residual sum of squares would be if this variable was excluded from the model. There is a link between this F test and the t test that appears as part of the model summary &#8211; this is because of the link between these two distributions. For this model we would have:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; summary(cpu.mod1)
&nbsp;
Call:
lm(formula = perf ~ syct + mmin + mmax + cach + chmin + chmax, 
    data = cpus)
&nbsp;
Residuals:
     Min       1Q   Median       3Q      Max 
-195.841  -25.169    5.409   26.528  385.749 
&nbsp;
Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) -5.590e+01  8.045e+00  -6.948 4.99e-11 ***
syct         4.886e-02  1.752e-02   2.789  0.00579 ** 
mmin         1.529e-02  1.827e-03   8.371 9.42e-15 ***
mmax         5.571e-03  6.418e-04   8.680 1.33e-15 ***
cach         6.412e-01  1.396e-01   4.594 7.64e-06 ***
chmin       -2.701e-01  8.557e-01  -0.316  0.75263    
chmax        1.483e+00  2.201e-01   6.738 1.64e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
Residual standard error: 59.99 on 202 degrees of freedom
Multiple R-squared: 0.8649,     Adjusted R-squared: 0.8609 
F-statistic: 215.5 on 6 and 202 DF,  p-value: &lt; 2.2e-16</pre></div></div>

<p>Let us consider the <strong>syct</strong> variable. The t statistic in the model summary is 2.789 and if we square this value we get 7.779 which is the F statistic produced by the <strong>dropterm</strong> function.</p>
<p><em>Related posts:</em></p>
<ul>
<li>The <a href="http://www.wekaleamstudios.co.uk/posts/using-the-update-function-during-variable-selection/">update</a> function for simplifying model selection.</li>
<li>Data analysis using <a href="http://www.wekaleamstudios.co.uk/posts/simple-linear-regression/">simple linear regression</a> models.</li>
<li>Including factors in a regression model via <a href="http://www.wekaleamstudios.co.uk/posts/simple-linear-regression/">analysis of covariance</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/manual-variable-selection-using-the-dropterm-function/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Using the update function during variable selection</title>
		<link>http://www.wekaleamstudios.co.uk/posts/using-the-update-function-during-variable-selection/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/using-the-update-function-during-variable-selection/#comments</comments>
		<pubDate>Sun, 09 May 2010 18:31:13 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[lm]]></category>
		<category><![CDATA[update]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1044</guid>
		<description><![CDATA[When fitting statistical models to data where there are multiple variables we are often interested in adding or removing terms from our model and in cases where there are a large number of terms it can be quicker to use the update function to start with a formula from a model that we have already [...]]]></description>
			<content:encoded><![CDATA[<p>When fitting statistical models to data where there are multiple variables we are often interested in adding or removing terms from our model and in cases where there are a large number of terms it can be quicker to use the <strong>update</strong> function to start with a formula from a model that we have already fitted and to specify the terms that we want to add or remove as opposed to a copy and paste and manually editing the formula to our needs.<span id="more-1044"></span></p>
<p>Consider the oil-bearing rocks data set that is available with the <strong>R</strong> software which is used extensively as an example by many authors. One model that can be used as a starting point is a linear model with additive terms for the three variables:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; rock.mod1 = lm(log(perm) ~ area + peri + shape, data = rock)
&gt; summary(rock.mod1)
&nbsp;
Call:
lm(formula = log(perm) ~ area + peri + shape, data = rock)
&nbsp;
Residuals:
    Min      1Q  Median      3Q     Max 
-1.8092 -0.5413  0.1735  0.6493  1.4788 
&nbsp;
Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)  5.333e+00  5.487e-01   9.720 1.59e-12 ***
area         4.850e-04  8.657e-05   5.602 1.29e-06 ***
peri        -1.527e-03  1.770e-04  -8.623 5.24e-11 ***
shape        1.757e+00  1.756e+00   1.000    0.323    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
Residual standard error: 0.8521 on 44 degrees of freedom
Multiple R-squared: 0.7483,     Adjusted R-squared: 0.7311 
F-statistic:  43.6 on 3 and 44 DF,  p-value: 3.094e-13</pre></div></div>

<p>Given this model, saved as an object <strong>rock.mod1</strong>, we might be interested in considering adding an interaction term between the area and perimeter measurements. The <strong>update</strong> function has various options and the simplest case is to specfiy a model object and a new formula. The new formula can use the period as short hand for keep everything on either the left or right hand side of the formula and the plus or minus sign used to add or remove terms to the model. In the case of adding an interaction term our call would be:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; rock.mod2 = update(rock.mod1, . ~ . + area:peri)</pre></div></div>

<p>The first function argument is the name of the model we fitted previously and the periods indicate that we want to use the same response variable and to start with the whole formula but add an interaction term between area and perimeter &#8211; the colon is used to specify an interaction term by itself. This fitted model is now:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; summary(rock.mod2)
&nbsp;
Call:
lm(formula = log(perm) ~ area + peri + shape + area:peri, data = rock)
&nbsp;
Residuals:
    Min      1Q  Median      3Q     Max 
-1.7255 -0.4760  0.1256  0.6539  1.4269 
&nbsp;
Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)  6.567e+00  8.533e-01   7.696 1.28e-09 ***
area         3.769e-04  1.025e-04   3.678  0.00065 ***
peri        -2.141e-03  3.734e-04  -5.733 8.94e-07 ***
shape        4.022e-01  1.859e+00   0.216  0.82974    
area:peri    6.641e-08  3.583e-08   1.854  0.07065 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
Residual standard error: 0.8295 on 43 degrees of freedom
Multiple R-squared: 0.7669,     Adjusted R-squared: 0.7452 
F-statistic: 35.37 on 4 and 43 DF,  p-value: 4.404e-13</pre></div></div>

<p>The <strong>update</strong> function can also be used to change other aspects of the linear model or in fact many other types of model are set up to repsond sensibly to using this function.</p>
<p><em>Related posts:</em></p>
<ul>
<li>Manual variable selection with the <a href="http://www.wekaleamstudios.co.uk/posts/manual-variable-selection-using-the-dropterm-function/">dropterm</a> function.</li>
<li>Data analysis using <a href="http://www.wekaleamstudios.co.uk/posts/simple-linear-regression/">simple linear regression</a> models.</li>
<li>Including factors in a regression model via <a href="http://www.wekaleamstudios.co.uk/posts/analysis-of-covariance-extending-simple-linear-regression/">analysis of covariance</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/using-the-update-function-during-variable-selection/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Analysis of Covariance &#8211; Extending Simple Linear Regression</title>
		<link>http://www.wekaleamstudios.co.uk/posts/analysis-of-covariance-extending-simple-linear-regression/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/analysis-of-covariance-extending-simple-linear-regression/#comments</comments>
		<pubDate>Wed, 28 Apr 2010 19:25:58 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Analysis of Variance]]></category>
		<category><![CDATA[Lattice Graphics]]></category>
		<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[Trellis Graphics]]></category>
		<category><![CDATA[analysis of variance]]></category>
		<category><![CDATA[ANOVA]]></category>
		<category><![CDATA[covariate]]></category>
		<category><![CDATA[fitted]]></category>
		<category><![CDATA[lattice]]></category>
		<category><![CDATA[lm]]></category>
		<category><![CDATA[panel]]></category>
		<category><![CDATA[panel.lmline]]></category>
		<category><![CDATA[panel.xyplot]]></category>
		<category><![CDATA[regression]]></category>
		<category><![CDATA[resid]]></category>
		<category><![CDATA[residual]]></category>
		<category><![CDATA[summary]]></category>
		<category><![CDATA[trellis]]></category>
		<category><![CDATA[xyplot]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=989</guid>
		<description><![CDATA[The simple linear regression model considers the relationship between two variables and in many cases more information will be available that can be used to extend the model. For example, there might be a categorical variable (sometimes known as a covariate) that can be used to divide the data set to fit a separate linear [...]]]></description>
			<content:encoded><![CDATA[<p>The simple linear regression model considers the relationship between two variables and in many cases more information will be available that can be used to extend the model. For example, there might be a categorical variable (sometimes known as a covariate) that can be used to divide the data set to fit a separate linear regression to each of the subsets. We will consider how to handle this extension using one of the data sets available within the <strong>R</strong> software package.<span id="more-989"></span></p>
<p>There is a set of data relating trunk circumference (in mm) to the age of Orange trees where data was recorded for five trees. This data is available in the data frame <strong>Orange</strong> and we make a copy of this data set so that we can remove the ordering that is recorded for the <strong>Tree</strong> identifier variable. We create a new factor after converting the old factor to a numeric string:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">orange.df = Orange
orange.df$Tree = factor(as.numeric(orange.df$Tree))</pre></div></div>

<p>The purpose of this step is to set up the variable for use in the linear model. The simplest model assumes that the relationship between circumference and age is the same for all five trees and we fit this model as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">orange.mod1 = lm(circumference ~ age, data = orange.df)</pre></div></div>

<p>The summary of the fitted model is shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; summary(orange.mod1)
&nbsp;
Call:
lm(formula = circumference ~ age, data = orange.df)
&nbsp;
Residuals:
      Min        1Q    Median        3Q       Max 
-46.31030 -14.94610  -0.07649  19.69727  45.11146 
&nbsp;
Coefficients:
             Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) 17.399650   8.622660   2.018   0.0518 .  
age          0.106770   0.008277  12.900 1.93e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
Residual standard error: 23.74 on 33 degrees of freedom
Multiple R-squared: 0.8345,     Adjusted R-squared: 0.8295 
F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14</pre></div></div>

<p>The test on the <strong>age</strong> parameter provides very strong evidence of an increase in circumference with age, as would be expected. The next stage is to consider how this model can be extended &#8211; one idea is to have a separate intercept for each of the five trees. This new model assumes that the increase in circumference is consistent between the trees but that the growth starts at different rates. We fit this model and get the summary as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; orange.mod2 = lm(circumference ~ age + Tree, data = orange.df)
&gt; summary(orange.mod2)
&nbsp;
Call:
lm(formula = circumference ~ age + Tree, data = orange.df)
&nbsp;
Residuals:
    Min      1Q  Median      3Q     Max 
-30.505  -8.790   3.738   7.650  21.859 
&nbsp;
Coefficients:
             Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) -4.457493   7.572732  -0.589   0.5607    
age          0.106770   0.005321  20.066  &lt; 2e-16 ***
Tree2        5.571429   8.157252   0.683   0.5000    
Tree3       17.142857   8.157252   2.102   0.0444 *  
Tree4       41.285714   8.157252   5.061 2.14e-05 ***
Tree5       45.285714   8.157252   5.552 5.48e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
Residual standard error: 15.26 on 29 degrees of freedom
Multiple R-squared: 0.9399,     Adjusted R-squared: 0.9295 
F-statistic:  90.7 on 5 and 29 DF,  p-value: &lt; 2.2e-16</pre></div></div>

<p>The additional term is appended to the simple model using the <strong>+</strong> in the formula part of the call to <strong>lm</strong>. The first tree is used as the baseline to compare the other four trees against and the model summary shows that tree 2 is similar to tree 1 (no real need for a different offset) but that there is evidence that the offset for the other three trees is significantly larger than tree 1 (and tree 2). We can compare the two models using an F-test for nested models using the <strong>anova</strong> function:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; anova(orange.mod1, orange.mod2)
Analysis of Variance Table
&nbsp;
Model 1: circumference ~ age
Model 2: circumference ~ age + Tree
  Res.Df     RSS Df Sum of Sq      F    Pr(&gt;F)    
1     33 18594.7                                  
2     29  6753.9  4     11841 12.711 4.289e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</pre></div></div>

<p>Here there are four degrees of freedom used up by the more complicated model (four parameters for the different trees) and the test comparing the two models is highly significant. There is very strong evidence of a difference in starting circumference (for the data that was collected) between the trees.</p>
<p>We can extended this model further by allowing the rate of increase in circumference to vary between the five trees. This additional term can be included in the linear model as an interaction term, assuming that tree 1 is the baseline. An interaction term is included in the model formula with a <strong>:</strong> between the name of two variables. For the Orange tree data the new model is fitted thus:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; orange.mod3 = lm(circumference ~ age + Tree + age:Tree, data = orange.df)
&gt; summary(orange.mod3)
&nbsp;
Call:
lm(formula = circumference ~ age + Tree + age:Tree, data = orange.df)
&nbsp;
Residuals:
    Min      1Q  Median      3Q     Max 
-18.061  -6.639  -1.482   8.069  16.649 
&nbsp;
Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)  1.920e+01  8.458e+00   2.270  0.03206 *  
age          8.111e-02  8.119e-03   9.991 3.27e-10 ***
Tree2        5.234e+00  1.196e+01   0.438  0.66544    
Tree3       -1.045e+01  1.196e+01  -0.873  0.39086    
Tree4        7.574e-01  1.196e+01   0.063  0.95002    
Tree5       -4.566e+00  1.196e+01  -0.382  0.70590    
age:Tree2    3.656e-04  1.148e-02   0.032  0.97485    
age:Tree3    2.992e-02  1.148e-02   2.606  0.01523 *  
age:Tree4    4.395e-02  1.148e-02   3.828  0.00077 ***
age:Tree5    5.406e-02  1.148e-02   4.708 7.93e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
Residual standard error: 10.41 on 25 degrees of freedom
Multiple R-squared: 0.9759,     Adjusted R-squared: 0.9672 
F-statistic: 112.4 on 9 and 25 DF,  p-value: &lt; 2.2e-16</pre></div></div>

<p>Interesting we see that there is strong evidence of a difference in the rate of change in circumference for the five trees. The previously observed difference in intercepts is now longer as strong but this parameter is kept in the model &#8211; there are plenty of books/websites that discuss this marginality restrictin on statistical models. The fitted model described above can be created using <strong>lattice</strong> graphics with a custom panel function making use of available panel functions for fitting and drawing a linear regression line for each panel of a Trellis display. The function call is shown below:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">xyplot(circumference ~ age | Tree, data = orange.df,
  panel = function(x, y, ...)
  {
    panel.xyplot(x, y, ...)
    panel.lmline(x, y, ...)
  }
)</pre></div></div>

<p>The <strong>panel.xyplot</strong> and <strong>panel.lmline</strong> functions are part of the lattice package along with many other panel functions and can be built up to create a display that differs from the standard. The graph that is produced:</p>
<div id="attachment_992" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/orange-fittedmodel.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/orange-fittedmodel-300x300.png" alt="Orange Tree Fitted Model" title="Orange Tree Fitted Model" width="300" height="300" class="size-medium wp-image-992" /></a><p class="wp-caption-text">Analysis of Covariance Model fitted to the Orange Tree data</p></div>
<p>This graph clearly shows the different relationships between circumference and age for the five trees. The residuals from the model can be plotted against fitted values, divided by tree, to investigate the model assumptions:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">xyplot(resid(orange.mod3) ~ fitted(orange.mod3) | orange.df$Tree,
  xlab = &quot;Fitted Values&quot;,
  ylab = &quot;Residuals&quot;,
  main = &quot;Residual Diagnostic Plot&quot;,
  panel = function(x, y, ...)
  {
    panel.grid(h = -1, v = -1)
    panel.abline(h = 0)
    panel.xyplot(x, y, ...)
  }
)</pre></div></div>

<p>The residual diagnostic plot is:</p>
<div id="attachment_994" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/orange-residualplot.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/orange-residualplot-300x300.png" alt="Orange Tree Model Residual Plot" title="Orange Tree Model Residual Plot" width="300" height="300" class="size-medium wp-image-994" /></a><p class="wp-caption-text">Residual diagnostic plot for the analysis of covariance model fitted to the Orange Tree data</p></div>
<p>There are no obvious problematic patterns in this graph so we conclude that this model is a reasonable representation of the relationship between circumference and age.</p>
<p>Additional: The analysis of variance table comparing the second and third models shows an improvement by moving to the more complicated model with different slopes:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; anova(orange.mod2, orange.mod3)
Analysis of Variance Table
&nbsp;
Model 1: circumference ~ age + Tree
Model 2: circumference ~ age + Tree + age:Tree
  Res.Df    RSS Df Sum of Sq      F    Pr(&gt;F)    
1     29 6753.9                                  
2     25 2711.0  4    4042.9 9.3206 9.402e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/analysis-of-covariance-extending-simple-linear-regression/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Simple Linear Regression</title>
		<link>http://www.wekaleamstudios.co.uk/posts/simple-linear-regression/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/simple-linear-regression/#comments</comments>
		<pubDate>Fri, 23 Apr 2010 08:51:57 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Lattice Graphics]]></category>
		<category><![CDATA[Linear Models]]></category>
		<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[Trellis Graphics]]></category>
		<category><![CDATA[explanatory variable]]></category>
		<category><![CDATA[fitted]]></category>
		<category><![CDATA[lattice]]></category>
		<category><![CDATA[linear]]></category>
		<category><![CDATA[lm]]></category>
		<category><![CDATA[modelling]]></category>
		<category><![CDATA[one variable]]></category>
		<category><![CDATA[predictor]]></category>
		<category><![CDATA[qqmath]]></category>
		<category><![CDATA[regression]]></category>
		<category><![CDATA[resid]]></category>
		<category><![CDATA[residual]]></category>
		<category><![CDATA[response]]></category>
		<category><![CDATA[summary]]></category>
		<category><![CDATA[trellis]]></category>
		<category><![CDATA[xyplot]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=907</guid>
		<description><![CDATA[One of the most frequent used techniques in statistics is linear regression where we investigate the potential relationship between a variable of interest (often called the response variable but there are many other names in use) and a set of one of more variables (known as the independent variables or some other term). Unsurprisingly there [...]]]></description>
			<content:encoded><![CDATA[<p>One of the most frequent used techniques in statistics is linear regression where we investigate the potential relationship between a variable of interest (often called the response variable but there are many other names in use) and a set of one of more variables (known as the independent variables or some other term). Unsurprisingly there are flexible facilities in <strong>R</strong> for fitting a range of linear models from the simple case of a single variable to more complex relationships.<span id="more-907"></span></p>
<p>In this post we will consider the case of simple linear regression with one response variable and a single independent variable. For this example we will use some data from the book Mathematical Statistics with Applications by Mendenhall, Wackerly and Scheaffer (Fourth Edition &#8211; Duxbury 1990). This data is for a study in central Florida where 15 alligators were captured and two measurements were made on each of the alligators. The weight (in pounds) was recorded with the snout vent length (in inches &#8211; this is the distance between the back of the head to the end of the nose).</p>
<p>The purpose of using this data is to determine whether there is a relationship, described by a simple linear regression model, between the weight and snout vent length. The authors analysed the data on the log scale (natural logarithms) and we will follow their approach for consistency. We first create a data frame for this study:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">alligator = data.frame(
  lnLength = c(3.87, 3.61, 4.33, 3.43, 3.81, 3.83, 3.46, 3.76,
    3.50, 3.58, 4.19, 3.78, 3.71, 3.73, 3.78),
  lnWeight = c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50,
    3.58, 3.64, 5.90, 4.43, 4.38, 4.42, 4.25)
)</pre></div></div>

<p>As with most analysis the first step is to perform some <a href="http://www.wekaleamstudios.co.uk/exploratory-data-analysis/">exploratory data analysis</a> to get a visual impression of whether there is a relationship between weight and snout vent length and what form it is likely to take. We create a <a href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-scatter-plots/">scatter plot</a> of the data as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">xyplot(lnWeight ~ lnLength, data = alligator,
  xlab = &quot;Snout vent length (inches) on log scale&quot;,
  ylab = &quot;Weight (pounds) on log scale&quot;,
  main = &quot;Alligators in Central Florida&quot;
)</pre></div></div>

<p>The scatter plot is shown here:</p>
<div id="attachment_946" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/Alligator-Data.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/Alligator-Data-300x300.jpg" alt="Plot of the weight and snout vent length" title="Alligator Data Plot" width="300" height="300" class="size-medium wp-image-946" /></a><p class="wp-caption-text">Scatter plot of the weight and snout vent length for alligators caught in central Florida</p></div>
<p>The graph suggests that weight (on the log scale) increases linearly with snout vent length (again on the log scale) so we will fit a simple linear regression model to the data and save the fitted model to an object for further analysis:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">alli.mod1 = lm(lnWeight ~ lnLength, data = alligator)</pre></div></div>

<p>The function <strong>lm</strong> fits a linear model to data are we specify the model using a formula where the response variable is on the left hand side separated by a ~ from the explanatory variables. The formula provides a flexible way to specify various different functional forms for the relationship. The <strong>data</strong> argument is used to tell <strong>R</strong> where to look for the variables used in the formula.</p>
<p>Now that the model is saved as an object we can use some of the general purpose functions for extracting information from this object about the linear model, e.g. the parameters or residuals. The big plus with <strong>R</strong> is that there are functions defined for different types of model, using the same name such as summary, and the system works out what function we intended to use based on the type of object saved. To create a summary of the fitted model:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; summary(alli.mod1)
&nbsp;
Call:
lm(formula = lnWeight ~ lnLength, data = alligator)
&nbsp;
Residuals:
     Min       1Q   Median       3Q      Max 
-0.24348 -0.03186  0.03740  0.07727  0.12669 
&nbsp;
Coefficients:
            Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)  -8.4761     0.5007  -16.93 3.08e-10 ***
lnLength      3.4311     0.1330   25.80 1.49e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
Residual standard error: 0.1229 on 13 degrees of freedom
Multiple R-squared: 0.9808,     Adjusted R-squared: 0.9794 
F-statistic: 665.8 on 1 and 13 DF,  p-value: 1.495e-12</pre></div></div>

<p>We get a lot of useful information here without being too overwhelmed by pages of output.</p>
<p>The estimates for the model intercept is -8.4761 and the coefficient measuring the <strong>slope</strong> of the relationship with snout vent length is 3.4311 and information about standard errors of these estimates is also provided in the Coefficients table. We see that the test of significance of the model coefficients is also summarised in that table so we can see that there is strong evidence that the coefficient is significantly different to zero &#8211; as the snout vent length increases so does the weight.</p>
<p>Rather than stopping here we perform some investigations using residual diagnostics to determine whether the various assumptions that underpin linear regression are reasonable for our data or if there is evidence to suggest that additional variables are required in the model or some other alterations to identify a better description of the variables that determine how weight changes.</p>
<p>A plot of the residuals against fitted values is used to determine whether there are any systematic patterns, such as over estimation for most of the large values or increasing spread as the model fitted values increase. To create this plot we could use the following code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">xyplot(resid(alli.mod1) ~ fitted(alli.mod1),
  xlab = &quot;Fitted Values&quot;,
  ylab = &quot;Residuals&quot;,
  main = &quot;Residual Diagnostic Plot&quot;,
  panel = function(x, y, ...)
  {
    panel.grid(h = -1, v = -1)
    panel.abline(h = 0)
    panel.xyplot(x, y, ...)
  }
)</pre></div></div>

<p>We create our own custom panel function using the buliding blocks provided by the <strong>lattice</strong> package. We start by creating a set of grid lines as the base layer and the <strong>h=-1</strong> and <strong>v=-1</strong> tell <strong>lattice</strong> to align these with the labels on the axes. We then create a solid horizontal line to help distinguish between positive and negative residuals. Finally we get the points plotted on the top layer.</p>
<p>The residual diagnostic plot is shown below:</p>
<div id="attachment_951" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/Alligator-ResidualPlot.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/Alligator-ResidualPlot-300x300.jpg" alt="Residual Diagnostic Plot for Linear Model" title="Alligator Residual Plot" width="300" height="300" class="size-medium wp-image-951" /></a><p class="wp-caption-text">Residual Diagnostics Plot for the Linear Regression Model</p></div>
<p>The plot is probably ok but there are more cases of positive residuals and when we consider a normal probability plot we see that there are some deficiencies with the model:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">qqmath( ~ resid(alli.mod1),
  xlab = &quot;Theoretical Quantiles&quot;,
  ylab = &quot;Residuals&quot;
)</pre></div></div>

<p>The function <strong>resid</strong> extracts the model residuals from the fitted model object. The plot is shown here:</p>
<div id="attachment_952" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/Alligator-QQ.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/Alligator-QQ-300x300.jpg" alt="Quantile-Quantile Plot for Linear Model" title="Alligator Quantile-Quantile Plot" width="300" height="300" class="size-medium wp-image-952" /></a><p class="wp-caption-text">Quantile-Quantile Plot for the Linear Regression Model</p></div>
<p>We would hope that this plot showed something approaching a straight line to support the model assumption about the distribution of the residuals. This and the other plots suggest that further tweaking to the model is required to improve the model or a decision would need to be made about whether to report the model as is with some caveats about its usage. I am interested in the thoughts/comments/suggestions from how other people would proceed when faced with this situation &#8211; feel free to add in the comments.</p>
<p><em>Related posts:</em></p>
<ul>
<li>Manual variable selection with the <a href="http://www.wekaleamstudios.co.uk/posts/manual-variable-selection-using-the-dropterm-function/">dropterm</a> function.</li>
<li>The <a href="http://www.wekaleamstudios.co.uk/posts/using-the-update-function-during-variable-selection/">update</a> function for simplifying model selection.</li>
<li>Including factors in a regression model via <a href="http://www.wekaleamstudios.co.uk/posts/simple-linear-regression/">analysis of covariance</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/simple-linear-regression/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Two-way Analysis of Variance (ANOVA)</title>
		<link>http://www.wekaleamstudios.co.uk/posts/two-way-analysis-of-variance-anova/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/two-way-analysis-of-variance-anova/#comments</comments>
		<pubDate>Mon, 15 Feb 2010 21:45:02 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Analysis of Variance]]></category>
		<category><![CDATA[Design of Experiments]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[analysis of variance]]></category>
		<category><![CDATA[ANOVA]]></category>
		<category><![CDATA[aov]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[summary]]></category>
		<category><![CDATA[Tukey HSD]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[two way]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=660</guid>
		<description><![CDATA[The analysis of variance (ANOVA) model can be extended from making a comparison between multiple groups to take into account additional factors in an experiment. The simplest extension is from one-way to two-way ANOVA where a second factor is included in the model as well as a potential interaction between the two factors. As an [...]]]></description>
			<content:encoded><![CDATA[<p>The analysis of variance (<strong>ANOVA</strong>) model can be extended from making a comparison between multiple groups to take into account additional factors in an experiment. The simplest extension is from one-way to two-way <strong>ANOVA</strong> where a second factor is included in the model as well as a potential interaction between the two factors.<span id="more-660"></span></p>
<p>As an example consider a company that regularly has to ship parcels between its various (five for this example) sub-offices and has the option of using three competing parcel delivery services, all of which charge roughly similar amounts for each delivery. To determine which service to use, the company decides to run an experiment shipping three packages from its head office to each of the five sub-offices. The delivery time for each package is recorded and the data loaded into <strong>R</strong>:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">delivery.df = data.frame(
  Service = c(rep(&quot;Carrier 1&quot;, 15), rep(&quot;Carrier 2&quot;, 15),
    rep(&quot;Carrier 3&quot;, 15)),
  Destination = c(rep(c(&quot;Office 1&quot;, &quot;Office 2&quot;, &quot;Office 3&quot;,
    &quot;Office 4&quot;, &quot;Office 5&quot;), 9)),
  Time = c(15.23, 14.32, 14.77, 15.12, 14.05,
  15.48, 14.13, 14.46, 15.62, 14.23, 15.19, 14.67, 14.48, 15.34, 14.22,
  16.66, 16.27, 16.35, 16.93, 15.05, 16.98, 16.43, 15.95, 16.73, 15.62,
  16.53, 16.26, 15.69, 16.97, 15.37, 17.12, 16.65, 15.73, 17.77, 15.52,
  16.15, 16.86, 15.18, 17.96, 15.26, 16.36, 16.44, 14.82, 17.62, 15.04)
)</pre></div></div>

<p>The data is then displayed using a dot plot for an initial visual investigation of any trends in delivery time between the three services and across the five sub-offices. The colour aesthetic is used to distinguish between the three services in the plot.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.df, aes(Time, Destination, colour = Service)) + geom_point()</pre></div></div>

<p>This code produces the following graph:</p>
<div id="attachment_792" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-data.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-data-300x300.png" alt="Service Delivery Time by Destination" title="Delivery Time" width="300" height="300" class="size-medium wp-image-792" /></a><p class="wp-caption-text">Graph of the delivery time for different services and destintions</p></div>
<p>The graph shows a general pattern of service carrier 1 having shorter delivery times than the other two services. There is also an indication that the differences between the services varies for the five sub-offices and we might expect the interaction term to be significant in the two-way <strong>ANOVA</strong> model. To fit the two-way <strong>ANOVA</strong> model we use this code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">delivery.mod1 = aov(Time ~ Destination*Service, data = delivery.df)</pre></div></div>

<p>The <strong>*</strong> symbol instructs <strong>R</strong> to create a formula that includes main effects for both Destination and Service as well as the two-way interaction between these two factors. We save the fitted model to an object which we can summarise as follows to test for importance of the various model terms:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; summary(delivery.mod1)
                    Df  Sum Sq Mean Sq  F value    Pr(&gt;F)    
Destination          4 17.5415  4.3854  61.1553 5.408e-14 ***
Service              2 23.1706 11.5853 161.5599 &lt; 2.2e-16 ***
Destination:Service  8  4.1888  0.5236   7.3018 2.360e-05 ***
Residuals           30  2.1513  0.0717                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</pre></div></div>

<p>We have strong evidence here that there are differences between the three delivery services, between the five sub-office destinations and that there is an interaction between destination and service in line with what we saw in the original plot of the data. Now that we have fitted the model and identified the important factors we need to investigate the model diagnostics to ensure that the various assumptions are broadly valid.</p>
<p>We can plot the model residuals against fitted values to look for obvious trends that are not consistent with the model assumptions about independence and common variance. The first step is to create a data frame with the fitted values and residuals from the above model:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">delivery.res = delivery.df
delivery.res$M1.Fit = fitted(delivery.mod1)
delivery.res$M1.Resid = resid(delivery.mod1)</pre></div></div>

<p>Then a scatter plot is used to display the fitted values and residuals where the colour asthetic highlights which points correspond to the three competing delivery services:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.res, aes(M1.Fit, M1.Resid, colour = Service)) + geom_point() +
  xlab(&quot;Fitted Values&quot;) + ylab(&quot;Residuals&quot;)</pre></div></div>

<p>The <strong>xlab()</strong> and <strong>ylab()</strong> are used to change the text on the axis labels. The residual diagnostic plot is:</p>
<div id="attachment_798" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid1.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid1-300x300.png" alt="Model Residual Plot" title="Model Residual Plot" width="300" height="300" class="size-medium wp-image-798" /></a><p class="wp-caption-text">Diagnostic Residual Plot for Delivery Time Model</p></div>
<p>There are no obvious patterns in this plot that suggest problems with the two-way <strong>ANOVA</strong> model that we fitted to the data.</p>
<p>As an alternative display we could separate the residuals into destination sub-offices, where the <strong>facet_wrap()</strong> function instructs <strong>ggplot</strong> to create a separate display (panel) for each of the destinations.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.res, aes(M1.Fit, M1.Resid, colour = Service)) +
  geom_point() + xlab(&quot;Fitted Values&quot;) + ylab(&quot;Residuals&quot;) +
  facet_wrap( ~ Destination)</pre></div></div>

<p>To produce the following alternative residual plot:</p>
<div id="attachment_799" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid2.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid2-300x300.png" alt="Model Residual Plot" title="Model Residual Plot" width="300" height="300" class="size-medium wp-image-799" /></a><p class="wp-caption-text">Diagnostic Residual Plot for Delivery Time Model by Destination</p></div>
<p>No obvious problems in this diagnostic plot.</p>
<p>We could also consider dividing the data by delivery service to get a different view of the residuals:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.res, aes(M1.Fit, M1.Resid, colour = Destination)) +
  geom_point() + xlab(&quot;Fitted Values&quot;) + ylab(&quot;Residuals&quot;) +
  facet_wrap( ~ Service)</pre></div></div>

<p>This creates the following graph:</p>
<div id="attachment_800" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid3.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid3-300x300.png" alt="Model Residual Plot" title="Model Residual Plot" width="300" height="300" class="size-medium wp-image-800" /></a><p class="wp-caption-text">Diagnostic Residual Plot for Delivery Time Model by Service</p></div>
<p>Again there is nothing substantial here to lead us to consider an alternative analysis.</p>
<p>Lastly we consider the normal probability plot of the model residuals, using the <strong>stat_qq()</strong> option:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.res, aes(sample = M1.Resid)) + stat_qq()</pre></div></div>

<p>The quantile plot is:</p>
<div id="attachment_806" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-qq.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-qq-300x300.png" alt="Quantile Plot" title="Quantile Plot" width="300" height="300" class="size-medium wp-image-806" /></a><p class="wp-caption-text">Normal Probability Plot for Delivery Time Model</p></div>
<p>This plot is very close to the straight line we would expect to observe if the data was a close approximation to a normal distribution. To round off the analysis we look at the Tukey HSD multiple comparisons to confirm that the differences are between delivery service 1 and the other two competing services:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; TukeyHSD(delivery.mod1, which = &quot;Service&quot;)
  Tukey multiple comparisons of means
    95% family-wise confidence level
&nbsp;
Fit: aov(formula = Time ~ Destination * Service, data = delivery.df)
&nbsp;
$Service
                        diff        lwr       upr     p adj
Carrier 2-Carrier 1 1.498667  1.2576092 1.7397241 0.0000000
Carrier 3-Carrier 1 1.544667  1.3036092 1.7857241 0.0000000
Carrier 3-Carrier 2 0.046000 -0.1950575 0.2870575 0.8856246</pre></div></div>

<p>Even with the multiple comparison post-hoc adjustment there is very strong evidence for the differences that we have consistenly observed throughout the analysis.</p>
<p>We can use <strong>ggplot</strong> to visualise the difference in mean delivery time for the services and the 95% confidence intervals on these differences. We create a data frame from the <strong>TukeyHSD</strong> output by extracting the component relating to the delivery service comparison and add the text labels by extracting the row names from the data frame.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">delivery.hsd = data.frame(TukeyHSD(delivery.mod1, which = &quot;Service&quot;)$Service)
delivery.hsd$Comparison = row.names(delivery.hsd)</pre></div></div>

<p>We then use the <strong>geom_pointrange()</strong> to specify lower, middle and upper values based on the three pairwise comparisons of interest.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.hsd, aes(Comparison, y = diff, ymin = lwr, ymax = upr)) +
  geom_pointrange() + ylab(&quot;Difference in Mean Delivery Time by Service&quot;) +
  coord_flip()</pre></div></div>

<p>The <strong>coord_flip()</strong> is used to make the confidence intervals horizontal rather than vertical on the graph. This can be confusing for creating the axis labels as we specify the label where it would appear prior to the filp of coordinates. In the example above we add text to the y axis but this now appears on the x axis in the final graph:</p>
<div id="attachment_811" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-tukeyHSD.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-tukeyHSD-300x300.png" alt="Tukey HSD" title="Tukey HSD" width="300" height="300" class="size-medium wp-image-811" /></a><p class="wp-caption-text">Plot of Confidence Intervals for Mean Differences using Tukey HSD</p></div>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/two-way-analysis-of-variance-anova/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

