<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Software for Exploratory Data Analysis and Statistical Modelling &#187; Linear Models</title>
	<atom:link href="http://www.wekaleamstudios.co.uk/topics/r-environment/linear-models/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.wekaleamstudios.co.uk</link>
	<description>Statistical Modelling with R</description>
	<lastBuildDate>Wed, 01 Feb 2012 19:44:22 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Generalized Linear Models &#8211; Poisson Regression</title>
		<link>http://www.wekaleamstudios.co.uk/posts/generalized-linear-models-poisson-regression/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/generalized-linear-models-poisson-regression/#comments</comments>
		<pubDate>Sun, 26 Jun 2011 09:28:50 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Linear Models]]></category>
		<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[Generalized Linear Model]]></category>
		<category><![CDATA[glm]]></category>
		<category><![CDATA[Poisson]]></category>
		<category><![CDATA[update]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1547</guid>
		<description><![CDATA[The Generalized Linear Model (GLM) allows us to model responses with distributions other than the Normal distribution, which is one of the assumptions underlying linear regression as used in many cases. When data is counts of events (or items) then a discrete distribution is more appropriate is usually more appropriate than approximating with a continuous [...]]]></description>
			<content:encoded><![CDATA[<p>The Generalized Linear Model (GLM) allows us to model responses with distributions other than the Normal distribution, which is one of the assumptions underlying linear regression as used in many cases. When data is counts of events (or items) then a discrete distribution is more appropriate is usually more appropriate than approximating with a continuous distribution, especially as our counts should be bounded below at zero. Negative counts do not make sense.<span id="more-1547"></span></p>
<p><!--[Fast Tube]--><span id="Z1qE9-Vqw50" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/generalized-linear-models-poisson-regression/#Z1qE9-Vqw50"><img src="http://i.ytimg.com/vi/Z1qE9-Vqw50/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>To investigate using Poisson regression via the GLM framework consider a small data set on failure modes (<a href="http://www.sci.usq.edu.au/staff/dunn/Datasets/tech-glms.html">here</a>).</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; failure.df = read.table(&quot;twomodes.dat&quot;, header = TRUE)
&gt; failure.df
  Mode1 Mode2 Failures
1  33.3  25.3       15
2  52.2  14.4        9
3  64.7  32.5       14
4 137.0  20.5       24
5 125.9  97.6       27
6 116.3  53.6       27
7 131.7  56.6       23
8  85.0  87.3       18
9  91.9  47.8       22</pre></div></div>

<p>The machinery is run in two modes and the objective of the analysis is to determine whether the number of failures depends on how long the machine is run in mode 1 or mode 2 and whether there is an interaction between the time in each mode to increases or decreases the number of failures.</p>
<p>The response for this set of data is the number of failures (count) so a Poisson regression model is considered.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; fmod1 = glm(Failures ~ Mode1 * Mode2, data = failure.df, family = poisson)
&gt; summary(fmod1)
&nbsp;
Call:
glm(formula = Failures ~ Mode1 * Mode2, family = poisson, data = failure.df)
&nbsp;
Deviance Residuals: 
       1         2         3         4         5         6         7         8         9  
 0.91003  -1.15601  -0.28328  -0.10398   0.03526   0.84825  -0.49211  -0.57298   0.64821  
&nbsp;
Coefficients:
              Estimate Std. Error z value Pr(&gt;|z|)    
(Intercept)  2.105e+00  4.481e-01   4.698 2.63e-06 ***
Mode1        7.687e-03  4.285e-03   1.794   0.0729 .  
Mode2        4.703e-03  1.163e-02   0.405   0.6858    
Mode1:Mode2 -1.978e-05  1.037e-04  -0.191   0.8487    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
(Dispersion parameter for poisson family taken to be 1)
&nbsp;
    Null deviance: 16.996  on 8  degrees of freedom
Residual deviance:  3.967  on 5  degrees of freedom
AIC: 55.024
&nbsp;
Number of Fisher Scoring iterations: 4</pre></div></div>

<p>The model output does not provide any support for an interaction between the number of time spent in the two different modes of operation. If we remove the interaction term and re-fit the model, using the update function, we get:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; fmod2 = update(fmod1, . ~ . - Mode1:Mode2)
&gt; summary(fmod2)
&nbsp;
Call:
glm(formula = Failures ~ Mode1 + Mode2, family = poisson, data = failure.df)
&nbsp;
Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.21984  -0.44735  -0.05893   0.68351   0.87510  
&nbsp;
Coefficients:
            Estimate Std. Error z value Pr(&gt;|z|)    
(Intercept) 2.175168   0.255456   8.515  &lt; 2e-16 ***
Mode1       0.007015   0.002429   2.888  0.00387 ** 
Mode2       0.002549   0.002835   0.899  0.36852    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
(Dispersion parameter for poisson family taken to be 1)
&nbsp;
    Null deviance: 16.9964  on 8  degrees of freedom
Residual deviance:  4.0033  on 6  degrees of freedom
AIC: 53.06
&nbsp;
Number of Fisher Scoring iterations: 4</pre></div></div>

<p>This output suggests that the time of operation in mode 1 is important for determining the number of faults but the time of operation in mode 2 is not important. One last step gives us:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; fmod3 = update(fmod2, . ~ . - Mode2)
&gt; summary(fmod3)
&nbsp;
Call:
glm(formula = Failures ~ Mode1, family = poisson, data = failure.df)
&nbsp;
Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.43194  -0.56958  -0.00745   0.66742   0.82231  
&nbsp;
Coefficients:
            Estimate Std. Error z value Pr(&gt;|z|)    
(Intercept) 2.237196   0.243053   9.205  &lt; 2e-16 ***
Mode1       0.007705   0.002264   3.403 0.000667 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
(Dispersion parameter for poisson family taken to be 1)
&nbsp;
    Null deviance: 16.9964  on 8  degrees of freedom
Residual deviance:  4.8078  on 7  degrees of freedom
AIC: 51.865
&nbsp;
Number of Fisher Scoring iterations: 4</pre></div></div>

<p>The diagnostic plots are shown below which do not indicate any major problems with the final model, especially given the small number of data points.</p>
<div id="attachment_1644" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2011/06/Poisson-Regression.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2011/06/Poisson-Regression-300x300.png" alt="Residual Plots for Poisson Regression model" title="Residual Plots for Poisson Regression model" width="300" height="300" class="size-medium wp-image-1644" /></a><p class="wp-caption-text">Four diagnostic plots for a Poisson regression model based on total failures</p></div>
<p>Other useful resources are provided on the <a href="http://www.wekaleamstudios.co.uk/supplementary-material/">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/generalized-linear-models-poisson-regression/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Variable selection using automatic methods</title>
		<link>http://www.wekaleamstudios.co.uk/posts/variable-selection-using-automatic-methods/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/variable-selection-using-automatic-methods/#comments</comments>
		<pubDate>Sat, 22 May 2010 11:15:03 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Linear Models]]></category>
		<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[backwards elimination]]></category>
		<category><![CDATA[best subset]]></category>
		<category><![CDATA[forward]]></category>
		<category><![CDATA[leaps]]></category>
		<category><![CDATA[regsubset]]></category>
		<category><![CDATA[stepwise]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1094</guid>
		<description><![CDATA[When we have a set of data with a small number of variables we can easily use a manual approach to identifying a good set of variables and the form they take in our statistical model. In other situations we may have a large number of potentially important variables and it soon becomes a time [...]]]></description>
			<content:encoded><![CDATA[<p>When we have a set of data with a small number of variables we can easily use a manual approach to identifying a good set of variables and the form they take in our statistical model. In other situations we may have a large number of potentially important variables and it soon becomes a time consuming effort to follow a manual variable selection process. In this case we may consider using automatic subset selection tools to remove some of the burden of the task.<span id="more-1094"></span></p>
<p>It should be noted that there is some disagreement about whether it is desirable to use an automated method but this post will focus on the mechanics of doing it rather than the debate about whether to be doing it at all.</p>
<p>The <strong>R</strong> package <strong>leaps</strong> has a function <strong>regsubsets</strong> that can be used for best subsets, forward selection and backwards elimination depending on which approach is considered most appropriate for the application under consideration.</p>
<p>In previous <a href="http://www.wekaleamstudios.co.uk/posts/manual-variable-selection-using-the-dropterm-function/">post</a> we considered using data on CPU performance to illustrate the variable selection process. We load the required packages:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; require(leaps)
&gt; require(MASS)</pre></div></div>

<p>First up we consider selecting the <em>best</em> subset of a particular size, say four variables for illustrative purposes (<strong>nvmax</strong> argument), and we specify the largest possible model which in this example has six variables:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, nvmax = 4)</pre></div></div>

<p>A summary for the output from this function is shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; summary(reg1)
Subset selection object
Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + 
    chmax, data = cpus, nvmax = 4)
6 Variables  (and intercept)
      Forced in Forced out
syct      FALSE      FALSE
mmin      FALSE      FALSE
mmax      FALSE      FALSE
cach      FALSE      FALSE
chmin     FALSE      FALSE
chmax     FALSE      FALSE
1 subsets of each size up to 4
Selection Algorithm: exhaustive
         syct mmin mmax cach chmin chmax
1  ( 1 ) &quot; &quot;  &quot; &quot;  &quot;*&quot;  &quot; &quot;  &quot; &quot;   &quot; &quot;  
2  ( 1 ) &quot; &quot;  &quot; &quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;   &quot; &quot;  
3  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;  &quot; &quot;   &quot;*&quot;  
4  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;   &quot;*&quot;</pre></div></div>

<p>The function <strong>regsubsets</strong> identifies the variables <strong>mmin</strong>, <strong>mmax</strong>, <strong>cach</strong> and <strong>chmax</strong> as the <em>best</em> four.</p>
<p>Alternatively we could perform a backwards elimination and the function will indicate the <em>best</em> subset of a particular size, from one to six variables in this example:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; reg2 = regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, method = &quot;backward&quot;)
&gt; summary(reg2)
Subset selection object
Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + 
    chmax, data = cpus, method = &quot;backward&quot;)
6 Variables  (and intercept)
      Forced in Forced out
syct      FALSE      FALSE
mmin      FALSE      FALSE
mmax      FALSE      FALSE
cach      FALSE      FALSE
chmin     FALSE      FALSE
chmax     FALSE      FALSE
1 subsets of each size up to 6
Selection Algorithm: backward
         syct mmin mmax cach chmin chmax
1  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot; &quot;  &quot; &quot;  &quot; &quot;   &quot; &quot;  
2  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot; &quot;  &quot; &quot;  &quot; &quot;   &quot;*&quot;  
3  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;  &quot; &quot;   &quot;*&quot;  
4  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;   &quot;*&quot;  
5  ( 1 ) &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;   &quot;*&quot;  
6  ( 1 ) &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;   &quot;*&quot;</pre></div></div>

<p>The subset of four variables is the same for this example as the <em>best</em> subsets approach. The third approach if forward selection:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; reg3 = regsubsets(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, method = &quot;backward&quot;)
&gt; summary(reg3)
Subset selection object
Call: regsubsets.formula(perf ~ syct + mmin + mmax + cach + chmin + 
    chmax, data = cpus, method = &quot;backward&quot;)
6 Variables  (and intercept)
      Forced in Forced out
syct      FALSE      FALSE
mmin      FALSE      FALSE
mmax      FALSE      FALSE
cach      FALSE      FALSE
chmin     FALSE      FALSE
chmax     FALSE      FALSE
1 subsets of each size up to 6
Selection Algorithm: backward
         syct mmin mmax cach chmin chmax
1  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot; &quot;  &quot; &quot;  &quot; &quot;   &quot; &quot;  
2  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot; &quot;  &quot; &quot;  &quot; &quot;   &quot;*&quot;  
3  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;  &quot; &quot;   &quot;*&quot;  
4  ( 1 ) &quot; &quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;   &quot;*&quot;  
5  ( 1 ) &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot; &quot;   &quot;*&quot;  
6  ( 1 ) &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;  &quot;*&quot;   &quot;*&quot;</pre></div></div>

<p>For this data set, as there are only six variables, we do not see divergence between the subsets chosen by the different methods.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/variable-selection-using-automatic-methods/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Linear regression models with robust parameter estimation</title>
		<link>http://www.wekaleamstudios.co.uk/posts/linear-regression-models-with-robust-parameter-estimation/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/linear-regression-models-with-robust-parameter-estimation/#comments</comments>
		<pubDate>Sat, 15 May 2010 10:54:51 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Linear Models]]></category>
		<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[lm]]></category>
		<category><![CDATA[lmrob]]></category>
		<category><![CDATA[parameter estimation]]></category>
		<category><![CDATA[robust]]></category>
		<category><![CDATA[robustbase]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1069</guid>
		<description><![CDATA[There are situations in regression modelling where robust methods could be considered to handle unusual observations that do not follow the general trend of the data set. There are various packages in R that provide robust statistical methods which are summarised on the CRAN Robust Task View. As an example of using robust statistical estimation [...]]]></description>
			<content:encoded><![CDATA[<p>There are situations in regression modelling where robust methods could be considered to handle unusual observations that do not follow the general trend of the data set. There are various packages in <strong>R</strong> that provide robust statistical methods which are summarised on the <a href="http://www.r-project.org/">CRAN</a> Robust Task View.<span id="more-1069"></span></p>
<p>As an example of using robust statistical estimation in a linear regression framework consider the CPUs <a href="http://www.wekaleamstudios.co.uk/posts/manual-variable-selection-using-the-dropterm-function/">data</a> that was used in previous posts on linear regression and variable selection. For this data we could fit a model with six variables using least squares and also with a fast MM-estimator from the <strong>robustbase</strong> package.</p>
<p>First step is to make the functions and data available for analysis:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">require(MASS)
require(robustbase)</pre></div></div>

<p>The linear model using least squares is fitted as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; cpu.mod1 = lm(perf ~ syct + mmin + mmax + cach + chmin + chmax, data = cpus)</pre></div></div>

<p>The summary for this model:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; summary(cpu.mod1)
&nbsp;
Call:
lm(formula = perf ~ syct + mmin + mmax + cach + chmin + chmax, 
    data = cpus)
&nbsp;
Residuals:
     Min       1Q   Median       3Q      Max 
-195.841  -25.169    5.409   26.528  385.749 
&nbsp;
Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) -5.590e+01  8.045e+00  -6.948 4.99e-11 ***
syct         4.886e-02  1.752e-02   2.789  0.00579 ** 
mmin         1.529e-02  1.827e-03   8.371 9.42e-15 ***
mmax         5.571e-03  6.418e-04   8.680 1.33e-15 ***
cach         6.412e-01  1.396e-01   4.594 7.64e-06 ***
chmin       -2.701e-01  8.557e-01  -0.316  0.75263    
chmax        1.483e+00  2.201e-01   6.738 1.64e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
Residual standard error: 59.99 on 202 degrees of freedom
Multiple R-squared: 0.8649,     Adjusted R-squared: 0.8609 
F-statistic: 215.5 on 6 and 202 DF,  p-value: &lt; 2.2e-16</pre></div></div>

<p>The linear model using an MM-estimator is fitted as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">cpu.robmod1 = lmrob(perf ~ syct + mmin + mmax + cach + chmin + chmax,
  data = cpus, control = lmrob.control(max.it = 100))</pre></div></div>

<p>Note that we need to increase the default number of iterations (50) to allow the routine to converge to a solution. The summary for this model:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; summary(cpu.robmod1)
&nbsp;
Call:
lmrob(formula = perf ~ syct + mmin + mmax + cach + chmin + chmax, 
    data = cpus, control = lmrob.control(max.it = 100))
&nbsp;
Weighted Residuals:
      Min        1Q    Median        3Q       Max 
-144.0045   -9.4554    0.7691   13.3757  759.6953 
&nbsp;
Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) -3.6634297  6.1697645  -0.594 0.553329    
syct         0.0063112  0.0043877   1.438 0.151868    
mmin         0.0098798  0.0034726   2.845 0.004897 ** 
mmax         0.0024463  0.0004525   5.407 1.80e-07 ***
cach         0.8702102  0.2551245   3.411 0.000782 ***
chmin        2.4078436  1.3319413   1.808 0.072130 .  
chmax        0.1016861  0.1494902   0.680 0.497145    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
Robust residual standard error: 19.27 
Convergence in 65 IRWLS iterations
&nbsp;
Robustness weights: 
 15 observations c(1,8,9,10,31,32,96,97,98,153,154,156,169,199,200) are outliers with |weight| = 0 ( &lt; 0.00048); 
 14 weights are ~= 1. The remaining 180 ones are summarized as
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.002472 0.831300 0.963700 0.849800 0.989500 0.998900 
Algorithmic parameters: 
tuning.chi         bb tuning.psi refine.tol    rel.tol 
 1.5476400  0.5000000  4.6850610  0.0000001  0.0000001 
 nResample     max.it     groups    n.group   best.r.s   k.fast.s      k.max  trace.lev compute.rd 
       500        100          5        400          2          1        200          0          0 
seed : int(0)</pre></div></div>

<p>The two models differ in the variables that are considered important and the output from the <strong>lmrob</strong> function provides a summary of the weights that have been allocated to the data. A total of fifteen of the data points have been allocated very small weights by the fitting algorithm.</p>
<p><em>Related posts:</em></p>
<ul>
<li>The <a href="http://www.wekaleamstudios.co.uk/posts/using-the-update-function-during-variable-selection/">update</a> function for simplifying model selection.</li>
<li>Data analysis using <a href="http://www.wekaleamstudios.co.uk/posts/simple-linear-regression/">simple linear regression</a> models.</li>
<li>The <a href="http://www.wekaleamstudios.co.uk/posts/manual-variable-selection-using-the-dropterm-function/">dropterm</a> function for simplifying models.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/linear-regression-models-with-robust-parameter-estimation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Simple Linear Regression</title>
		<link>http://www.wekaleamstudios.co.uk/posts/simple-linear-regression/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/simple-linear-regression/#comments</comments>
		<pubDate>Fri, 23 Apr 2010 08:51:57 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Lattice Graphics]]></category>
		<category><![CDATA[Linear Models]]></category>
		<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[Trellis Graphics]]></category>
		<category><![CDATA[explanatory variable]]></category>
		<category><![CDATA[fitted]]></category>
		<category><![CDATA[lattice]]></category>
		<category><![CDATA[linear]]></category>
		<category><![CDATA[lm]]></category>
		<category><![CDATA[modelling]]></category>
		<category><![CDATA[one variable]]></category>
		<category><![CDATA[predictor]]></category>
		<category><![CDATA[qqmath]]></category>
		<category><![CDATA[regression]]></category>
		<category><![CDATA[resid]]></category>
		<category><![CDATA[residual]]></category>
		<category><![CDATA[response]]></category>
		<category><![CDATA[summary]]></category>
		<category><![CDATA[trellis]]></category>
		<category><![CDATA[xyplot]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=907</guid>
		<description><![CDATA[One of the most frequent used techniques in statistics is linear regression where we investigate the potential relationship between a variable of interest (often called the response variable but there are many other names in use) and a set of one of more variables (known as the independent variables or some other term). Unsurprisingly there [...]]]></description>
			<content:encoded><![CDATA[<p>One of the most frequent used techniques in statistics is linear regression where we investigate the potential relationship between a variable of interest (often called the response variable but there are many other names in use) and a set of one of more variables (known as the independent variables or some other term). Unsurprisingly there are flexible facilities in <strong>R</strong> for fitting a range of linear models from the simple case of a single variable to more complex relationships.<span id="more-907"></span></p>
<p>In this post we will consider the case of simple linear regression with one response variable and a single independent variable. For this example we will use some data from the book Mathematical Statistics with Applications by Mendenhall, Wackerly and Scheaffer (Fourth Edition &#8211; Duxbury 1990). This data is for a study in central Florida where 15 alligators were captured and two measurements were made on each of the alligators. The weight (in pounds) was recorded with the snout vent length (in inches &#8211; this is the distance between the back of the head to the end of the nose).</p>
<p>The purpose of using this data is to determine whether there is a relationship, described by a simple linear regression model, between the weight and snout vent length. The authors analysed the data on the log scale (natural logarithms) and we will follow their approach for consistency. We first create a data frame for this study:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">alligator = data.frame(
  lnLength = c(3.87, 3.61, 4.33, 3.43, 3.81, 3.83, 3.46, 3.76,
    3.50, 3.58, 4.19, 3.78, 3.71, 3.73, 3.78),
  lnWeight = c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50,
    3.58, 3.64, 5.90, 4.43, 4.38, 4.42, 4.25)
)</pre></div></div>

<p>As with most analysis the first step is to perform some <a href="http://www.wekaleamstudios.co.uk/exploratory-data-analysis/">exploratory data analysis</a> to get a visual impression of whether there is a relationship between weight and snout vent length and what form it is likely to take. We create a <a href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-scatter-plots/">scatter plot</a> of the data as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">xyplot(lnWeight ~ lnLength, data = alligator,
  xlab = &quot;Snout vent length (inches) on log scale&quot;,
  ylab = &quot;Weight (pounds) on log scale&quot;,
  main = &quot;Alligators in Central Florida&quot;
)</pre></div></div>

<p>The scatter plot is shown here:</p>
<div id="attachment_946" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/Alligator-Data.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/Alligator-Data-300x300.jpg" alt="Plot of the weight and snout vent length" title="Alligator Data Plot" width="300" height="300" class="size-medium wp-image-946" /></a><p class="wp-caption-text">Scatter plot of the weight and snout vent length for alligators caught in central Florida</p></div>
<p>The graph suggests that weight (on the log scale) increases linearly with snout vent length (again on the log scale) so we will fit a simple linear regression model to the data and save the fitted model to an object for further analysis:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">alli.mod1 = lm(lnWeight ~ lnLength, data = alligator)</pre></div></div>

<p>The function <strong>lm</strong> fits a linear model to data are we specify the model using a formula where the response variable is on the left hand side separated by a ~ from the explanatory variables. The formula provides a flexible way to specify various different functional forms for the relationship. The <strong>data</strong> argument is used to tell <strong>R</strong> where to look for the variables used in the formula.</p>
<p>Now that the model is saved as an object we can use some of the general purpose functions for extracting information from this object about the linear model, e.g. the parameters or residuals. The big plus with <strong>R</strong> is that there are functions defined for different types of model, using the same name such as summary, and the system works out what function we intended to use based on the type of object saved. To create a summary of the fitted model:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; summary(alli.mod1)
&nbsp;
Call:
lm(formula = lnWeight ~ lnLength, data = alligator)
&nbsp;
Residuals:
     Min       1Q   Median       3Q      Max 
-0.24348 -0.03186  0.03740  0.07727  0.12669 
&nbsp;
Coefficients:
            Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)  -8.4761     0.5007  -16.93 3.08e-10 ***
lnLength      3.4311     0.1330   25.80 1.49e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
Residual standard error: 0.1229 on 13 degrees of freedom
Multiple R-squared: 0.9808,     Adjusted R-squared: 0.9794 
F-statistic: 665.8 on 1 and 13 DF,  p-value: 1.495e-12</pre></div></div>

<p>We get a lot of useful information here without being too overwhelmed by pages of output.</p>
<p>The estimates for the model intercept is -8.4761 and the coefficient measuring the <strong>slope</strong> of the relationship with snout vent length is 3.4311 and information about standard errors of these estimates is also provided in the Coefficients table. We see that the test of significance of the model coefficients is also summarised in that table so we can see that there is strong evidence that the coefficient is significantly different to zero &#8211; as the snout vent length increases so does the weight.</p>
<p>Rather than stopping here we perform some investigations using residual diagnostics to determine whether the various assumptions that underpin linear regression are reasonable for our data or if there is evidence to suggest that additional variables are required in the model or some other alterations to identify a better description of the variables that determine how weight changes.</p>
<p>A plot of the residuals against fitted values is used to determine whether there are any systematic patterns, such as over estimation for most of the large values or increasing spread as the model fitted values increase. To create this plot we could use the following code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">xyplot(resid(alli.mod1) ~ fitted(alli.mod1),
  xlab = &quot;Fitted Values&quot;,
  ylab = &quot;Residuals&quot;,
  main = &quot;Residual Diagnostic Plot&quot;,
  panel = function(x, y, ...)
  {
    panel.grid(h = -1, v = -1)
    panel.abline(h = 0)
    panel.xyplot(x, y, ...)
  }
)</pre></div></div>

<p>We create our own custom panel function using the buliding blocks provided by the <strong>lattice</strong> package. We start by creating a set of grid lines as the base layer and the <strong>h=-1</strong> and <strong>v=-1</strong> tell <strong>lattice</strong> to align these with the labels on the axes. We then create a solid horizontal line to help distinguish between positive and negative residuals. Finally we get the points plotted on the top layer.</p>
<p>The residual diagnostic plot is shown below:</p>
<div id="attachment_951" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/Alligator-ResidualPlot.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/Alligator-ResidualPlot-300x300.jpg" alt="Residual Diagnostic Plot for Linear Model" title="Alligator Residual Plot" width="300" height="300" class="size-medium wp-image-951" /></a><p class="wp-caption-text">Residual Diagnostics Plot for the Linear Regression Model</p></div>
<p>The plot is probably ok but there are more cases of positive residuals and when we consider a normal probability plot we see that there are some deficiencies with the model:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">qqmath( ~ resid(alli.mod1),
  xlab = &quot;Theoretical Quantiles&quot;,
  ylab = &quot;Residuals&quot;
)</pre></div></div>

<p>The function <strong>resid</strong> extracts the model residuals from the fitted model object. The plot is shown here:</p>
<div id="attachment_952" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/Alligator-QQ.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/Alligator-QQ-300x300.jpg" alt="Quantile-Quantile Plot for Linear Model" title="Alligator Quantile-Quantile Plot" width="300" height="300" class="size-medium wp-image-952" /></a><p class="wp-caption-text">Quantile-Quantile Plot for the Linear Regression Model</p></div>
<p>We would hope that this plot showed something approaching a straight line to support the model assumption about the distribution of the residuals. This and the other plots suggest that further tweaking to the model is required to improve the model or a decision would need to be made about whether to report the model as is with some caveats about its usage. I am interested in the thoughts/comments/suggestions from how other people would proceed when faced with this situation &#8211; feel free to add in the comments.</p>
<p><em>Related posts:</em></p>
<ul>
<li>Manual variable selection with the <a href="http://www.wekaleamstudios.co.uk/posts/manual-variable-selection-using-the-dropterm-function/">dropterm</a> function.</li>
<li>The <a href="http://www.wekaleamstudios.co.uk/posts/using-the-update-function-during-variable-selection/">update</a> function for simplifying model selection.</li>
<li>Including factors in a regression model via <a href="http://www.wekaleamstudios.co.uk/posts/simple-linear-regression/">analysis of covariance</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/simple-linear-regression/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

