<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Software for Exploratory Data Analysis and Statistical Modelling &#187; Grammar of Graphics</title>
	<atom:link href="http://www.wekaleamstudios.co.uk/topics/r-environment/ggplot2-r-environment/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.wekaleamstudios.co.uk</link>
	<description>Statistical Modelling with R</description>
	<lastBuildDate>Wed, 01 Feb 2012 19:44:22 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Generalized Linear Models &#8211; Poisson Regression</title>
		<link>http://www.wekaleamstudios.co.uk/posts/generalized-linear-models-poisson-regression/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/generalized-linear-models-poisson-regression/#comments</comments>
		<pubDate>Sun, 26 Jun 2011 09:28:50 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Linear Models]]></category>
		<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[Generalized Linear Model]]></category>
		<category><![CDATA[glm]]></category>
		<category><![CDATA[Poisson]]></category>
		<category><![CDATA[update]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1547</guid>
		<description><![CDATA[The Generalized Linear Model (GLM) allows us to model responses with distributions other than the Normal distribution, which is one of the assumptions underlying linear regression as used in many cases. When data is counts of events (or items) then a discrete distribution is more appropriate is usually more appropriate than approximating with a continuous [...]]]></description>
			<content:encoded><![CDATA[<p>The Generalized Linear Model (GLM) allows us to model responses with distributions other than the Normal distribution, which is one of the assumptions underlying linear regression as used in many cases. When data is counts of events (or items) then a discrete distribution is more appropriate is usually more appropriate than approximating with a continuous distribution, especially as our counts should be bounded below at zero. Negative counts do not make sense.<span id="more-1547"></span></p>
<p><!--[Fast Tube]--><span id="Z1qE9-Vqw50" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/generalized-linear-models-poisson-regression/#Z1qE9-Vqw50"><img src="http://i.ytimg.com/vi/Z1qE9-Vqw50/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>To investigate using Poisson regression via the GLM framework consider a small data set on failure modes (<a href="http://www.sci.usq.edu.au/staff/dunn/Datasets/tech-glms.html">here</a>).</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; failure.df = read.table(&quot;twomodes.dat&quot;, header = TRUE)
&gt; failure.df
  Mode1 Mode2 Failures
1  33.3  25.3       15
2  52.2  14.4        9
3  64.7  32.5       14
4 137.0  20.5       24
5 125.9  97.6       27
6 116.3  53.6       27
7 131.7  56.6       23
8  85.0  87.3       18
9  91.9  47.8       22</pre></div></div>

<p>The machinery is run in two modes and the objective of the analysis is to determine whether the number of failures depends on how long the machine is run in mode 1 or mode 2 and whether there is an interaction between the time in each mode to increases or decreases the number of failures.</p>
<p>The response for this set of data is the number of failures (count) so a Poisson regression model is considered.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; fmod1 = glm(Failures ~ Mode1 * Mode2, data = failure.df, family = poisson)
&gt; summary(fmod1)
&nbsp;
Call:
glm(formula = Failures ~ Mode1 * Mode2, family = poisson, data = failure.df)
&nbsp;
Deviance Residuals: 
       1         2         3         4         5         6         7         8         9  
 0.91003  -1.15601  -0.28328  -0.10398   0.03526   0.84825  -0.49211  -0.57298   0.64821  
&nbsp;
Coefficients:
              Estimate Std. Error z value Pr(&gt;|z|)    
(Intercept)  2.105e+00  4.481e-01   4.698 2.63e-06 ***
Mode1        7.687e-03  4.285e-03   1.794   0.0729 .  
Mode2        4.703e-03  1.163e-02   0.405   0.6858    
Mode1:Mode2 -1.978e-05  1.037e-04  -0.191   0.8487    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
(Dispersion parameter for poisson family taken to be 1)
&nbsp;
    Null deviance: 16.996  on 8  degrees of freedom
Residual deviance:  3.967  on 5  degrees of freedom
AIC: 55.024
&nbsp;
Number of Fisher Scoring iterations: 4</pre></div></div>

<p>The model output does not provide any support for an interaction between the number of time spent in the two different modes of operation. If we remove the interaction term and re-fit the model, using the update function, we get:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; fmod2 = update(fmod1, . ~ . - Mode1:Mode2)
&gt; summary(fmod2)
&nbsp;
Call:
glm(formula = Failures ~ Mode1 + Mode2, family = poisson, data = failure.df)
&nbsp;
Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.21984  -0.44735  -0.05893   0.68351   0.87510  
&nbsp;
Coefficients:
            Estimate Std. Error z value Pr(&gt;|z|)    
(Intercept) 2.175168   0.255456   8.515  &lt; 2e-16 ***
Mode1       0.007015   0.002429   2.888  0.00387 ** 
Mode2       0.002549   0.002835   0.899  0.36852    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
(Dispersion parameter for poisson family taken to be 1)
&nbsp;
    Null deviance: 16.9964  on 8  degrees of freedom
Residual deviance:  4.0033  on 6  degrees of freedom
AIC: 53.06
&nbsp;
Number of Fisher Scoring iterations: 4</pre></div></div>

<p>This output suggests that the time of operation in mode 1 is important for determining the number of faults but the time of operation in mode 2 is not important. One last step gives us:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; fmod3 = update(fmod2, . ~ . - Mode2)
&gt; summary(fmod3)
&nbsp;
Call:
glm(formula = Failures ~ Mode1, family = poisson, data = failure.df)
&nbsp;
Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.43194  -0.56958  -0.00745   0.66742   0.82231  
&nbsp;
Coefficients:
            Estimate Std. Error z value Pr(&gt;|z|)    
(Intercept) 2.237196   0.243053   9.205  &lt; 2e-16 ***
Mode1       0.007705   0.002264   3.403 0.000667 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
(Dispersion parameter for poisson family taken to be 1)
&nbsp;
    Null deviance: 16.9964  on 8  degrees of freedom
Residual deviance:  4.8078  on 7  degrees of freedom
AIC: 51.865
&nbsp;
Number of Fisher Scoring iterations: 4</pre></div></div>

<p>The diagnostic plots are shown below which do not indicate any major problems with the final model, especially given the small number of data points.</p>
<div id="attachment_1644" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2011/06/Poisson-Regression.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2011/06/Poisson-Regression-300x300.png" alt="Residual Plots for Poisson Regression model" title="Residual Plots for Poisson Regression model" width="300" height="300" class="size-medium wp-image-1644" /></a><p class="wp-caption-text">Four diagnostic plots for a Poisson regression model based on total failures</p></div>
<p>Other useful resources are provided on the <a href="http://www.wekaleamstudios.co.uk/supplementary-material/">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/generalized-linear-models-poisson-regression/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Plotting Time Series data using ggplot2</title>
		<link>http://www.wekaleamstudios.co.uk/posts/plotting-time-series-data-using-ggplot2/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/plotting-time-series-data-using-ggplot2/#comments</comments>
		<pubDate>Thu, 30 Sep 2010 21:05:18 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[aes]]></category>
		<category><![CDATA[date]]></category>
		<category><![CDATA[geom_line]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[line]]></category>
		<category><![CDATA[plot]]></category>
		<category><![CDATA[scale_x_date]]></category>
		<category><![CDATA[time series]]></category>
		<category><![CDATA[xlab]]></category>
		<category><![CDATA[ylab]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1404</guid>
		<description><![CDATA[There are various ways to plot data that is represented by a time series in R. The ggplot2 package has scales that can handle dates reasonably easily. Fast Tube by Casper As an example consider a data set on the number of views of the you tube channel ramstatvid. A short snippet of the data [...]]]></description>
			<content:encoded><![CDATA[<p>There are various ways to plot data that is represented by a time series in <strong>R</strong>. The <strong><a href="http://had.co.nz/ggplot2/">ggplot2</a></strong> package has scales that can handle dates reasonably easily.<span id="more-1404"></span></p>
<p><!--[Fast Tube]--><span id="irtSRkhGbXg" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/plotting-time-series-data-using-ggplot2/#irtSRkhGbXg"><img src="http://i.ytimg.com/vi/irtSRkhGbXg/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>As an example consider a data set on the number of views of the you tube channel <a href="http://www.youtube.com/user/ramstatvid?feature=mhum">ramstatvid</a>. A short snippet of the data is shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; head(yt.views)
        Date Views
1 2010-05-17    13
2 2010-05-18    11
3 2010-05-19     4
4 2010-05-20     2
5 2010-05-21    23
6 2010-05-22    26</pre></div></div>

<p>The <strong>ggplot</strong> function is used by specifying a data frame and the <strong>aes</strong> maps the <strong>Date</strong> to the x-axis and the number of <strong>Views</strong> to the y-axis.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(yt.views, aes(Date, Views)) + geom_line() +
  scale_x_date(format = &quot;%b-%Y&quot;) + xlab(&quot;&quot;) + ylab(&quot;Daily Views&quot;)</pre></div></div>

<p>The axis labels for the <strong>Date</strong> variable are created with the <strong>scale_x_date</strong> function where the format is specified as a Month/Year combination with the <strong>%b</strong> and <strong>%Y</strong> formatting strings. The graph that is produced is shown here:</p>
<div id="attachment_1403" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/09/ts-example1.jpg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/09/ts-example1-300x300.jpg" alt="Time Series Example" title="Time Series Example" width="300" height="300" class="size-medium wp-image-1403" /></a><p class="wp-caption-text">Time Series Plot Example with ggplot2 package</p></div>
<p>Other useful resources are provided on the <a href="http://www.wekaleamstudios.co.uk/supplementary-material/">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/plotting-time-series-data-using-ggplot2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Charting the performance of cricket all-rounders &#8211; IT Botham</title>
		<link>http://www.wekaleamstudios.co.uk/posts/charting-the-performance-of-cricket-all-rounders-it-botham/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/charting-the-performance-of-cricket-all-rounders-it-botham/#comments</comments>
		<pubDate>Mon, 16 Aug 2010 19:59:54 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Summary]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[all-rounder]]></category>
		<category><![CDATA[botham]]></category>
		<category><![CDATA[catches]]></category>
		<category><![CDATA[cricket]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[histogram]]></category>
		<category><![CDATA[runs]]></category>
		<category><![CDATA[wickets]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1321</guid>
		<description><![CDATA[Cricket is a sport that generates a large volume of performance data and corresponding debate about the relative qualities of various players over their careers and in relation to their contemporaries. The cricinfo website has an extensive database of statistics for professional cricketers that can be searched to access the information in various formats. As [...]]]></description>
			<content:encoded><![CDATA[<p>Cricket is a sport that generates a large volume of performance data and corresponding debate about the relative qualities of various players over their careers and in relation to their contemporaries. The <a href="http://www.cricinfo.com/">cricinfo</a> website has an extensive database of statistics for professional cricketers that can be searched to access the information in various formats.<span id="more-1321"></span></p>
<p>As an initial example we will consider the English legend Sir Ian Botham who played 102 test matches for England between his debut in 1977 until his final game in 1992.</p>
<p>The first obvious breakdown is to consider how Botham performed against the six countries that he played against during his test career. A summary of his statistics are shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;"> Opposition Matches Bat Inns Runs NO Bowl Inns Wicket Catch
  Australia      36       49 1673  2       66     148    57
      India      14       16 1201  0       23      59    14
New Zealand      15       22  846  2       28      64    14
   Pakistan      14       20  647  1       18      40    14
  Sri Lanka       3        3   41  0        6      11     2
West Indies      20       37  792  1       27      61    19</pre></div></div>

<p>Botham only played three matches against Sri Lanka so it is difficult to properly assess his performance against them. If the above table is stored in a data frame <strong>itb.opp</strong> then we can create a histogram of the total runs (or wickets) by opposition country:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(itb.opp, aes(Opposition, Runs)) + geom_bar() + xlab(&quot;Country&quot;) +
  ylab(&quot;Total Runs&quot;)</pre></div></div>

<p>This code produces the following graph:</p>
<div id="attachment_1355" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Total-Runs.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Total-Runs-300x300.png" alt="IT Botham Total Runs by Opposition" title="IT Botham Total Runs" width="300" height="300" class="size-medium wp-image-1355" /></a><p class="wp-caption-text">IT Botham Total Runs by Opposition</p></div>
<p>The total wickes graph is produced by the next code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(itb.opp, aes(Opposition, Wicket)) + geom_bar() + xlab(&quot;Country&quot;) +
  ylab(&quot;Total Wickets&quot;)</pre></div></div>

<div id="attachment_1356" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Total-Wickets.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Total-Wickets-300x300.png" alt="IT Botham Total Wickets by Opposition" title="IT Botham Total Wickets" width="300" height="300" class="size-medium wp-image-1356" /></a><p class="wp-caption-text">IT Botham Total Wickets by Opposition</p></div>
<p>We may now want to delve deeper into the performance against different nations to take into account the number of games or innings where Botham batted or bowled. The traditional way to assess performance is to calculate batting and bowling averages and we can do this by opposition which provides the following data frame:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; itb.opp.sum
 Opposition Discipline  Average
  Australia    Batting 29.35088
      India    Batting 70.64706
New Zealand    Batting 42.30000
   Pakistan    Batting 32.35000
  Sri Lanka    Batting 13.66667
West Indies    Batting 21.40541
  Australia    Bowling 27.65541
      India    Bowling 26.40678
New Zealand    Bowling 23.43750
   Pakistan    Bowling 31.77500
  Sri Lanka    Bowling 28.18182
West Indies    Bowling 35.18033</pre></div></div>

<p>This can be converted into a dot plot so we can see whether Botham had a high batting average than bowling average, which is often taken to be one of the signs of an all-rounder.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(itb.opp.sum, aes(Average, Opposition, colour = Discipline)) +
  geom_point()+ xlab(&quot;Average&quot;) + ylab(&quot;&quot;)</pre></div></div>

<p>The graph is shown here:</p>
<div id="attachment_1362" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Averages-Country.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Averages-Country-300x300.png" alt="IT Botham Batting and Bowling Averages by Opposition" title="IT Botham Batting and Bowling Averages" width="300" height="300" class="size-medium wp-image-1362" /></a><p class="wp-caption-text">IT Botham Batting and Bowling Averages by Opposition</p></div>
<p>We can see the differences in performance based on the opposition. Botham&#8217;s performance against the West Indies, by far the strongest team during most of his international career, were worse than against the other countries. However, his averages were far from embarassing when compared to other players at the time. The graph also shows that Botham enjoyed batting and bowling against India.</p>
<p>We can divide this data further based on whether the matches were played in England or outside of England and this data is shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; itb.opp.ha.sum
  Opposition Venue Discipline  Average
   Australia  Away    Batting 30.22581
       India  Away    Batting 61.55556
 New Zealand  Away    Batting 50.44444
    Pakistan  Away    Batting 16.00000
   Sri Lanka  Away    Batting 13.00000
 West Indies  Away    Batting 14.17647
   Australia  Home    Batting 28.30769
       India  Home    Batting 80.87500
 New Zealand  Home    Batting 35.63636
    Pakistan  Home    Batting 34.16667
   Sri Lanka  Home    Batting 14.00000
 West Indies  Home    Batting 27.55000
   Australia  Away    Bowling 28.44928
       India  Away    Bowling 25.53333
 New Zealand  Away    Bowling 27.44444
    Pakistan  Away    Bowling 45.00000
   Sri Lanka  Away    Bowling 21.66667
 West Indies  Away    Bowling 39.50000
   Australia  Home    Bowling 26.96203
       India  Home    Bowling 27.31034
 New Zealand  Home    Bowling 20.51351
    Pakistan  Home    Bowling 31.07895
   Sri Lanka  Home    Bowling 30.62500
 West Indies  Home    Bowling 31.97143</pre></div></div>

<p>A dot plot is created from this data with a separate panel for each of the six opposition countries and the averages divided into batting and bowling performances. The coloured dots in the graph indicated whether the average is for matches at home or away.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(itb.opp.ha.sum, aes(Average, Discipline, colour = Venue)) +
  geom_point() + facet_wrap( ~ Opposition) +
  xlab(&quot;Batting Average&quot;) + ylab(&quot;&quot;)</pre></div></div>

<p>This graph is shown below:</p>
<div id="attachment_1366" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Averages-Country-HomeAway.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Averages-Country-HomeAway-300x300.png" alt="IT Botham Batting and Bowling Averages by Country and Home/Away" title="IT Botham Batting and Bowling Averages" width="300" height="300" class="size-medium wp-image-1366" /></a><p class="wp-caption-text">IT Botham Batting and Bowling Averages by Country and Home/Away</p></div>
<p>We can see that the difference between home and away peformance is, in general, not very large for bowling averages but in some cases there is a noticeable difference in batting averages. When looking at Botham&#8217;s performances against the West Indies his statistics at home are much better than his away performance, suggesting that his main struggles against the strong West Indies team were in the Caribbean. This might be due to his swing bowling being more suitable to English conditions compared to pitches in the West Indies.</p>
<p>To round off this brief look at the career of IT Botham let us consider some other important statistics, in particular games where he performed with the bat and ball.</p>
<ul>
<li>Overall Botham scored 14 hundreds and 22 fifties out of 161 innings so he reached fifty runs every five innings or so.</li>
<li>He also took 27 five wicket hauls and 17 four wicket hauls so he took four or more wickets every four innings or so.</li>
<li>He took 120 catches.</li>
</ul>
<p>Individual matches of excellence include five games with a century and at least five wickets:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">Year  Opposition       Ground Venue Runs Wicket
1978 New Zealand Christchurch  Away  133      8
1978    Pakistan       Lord's  Home  108      8
1980       India       Mumbai  Away  114     13
1981   Australia        Leeds  Home  199      7
1984 New Zealand   Wellington  Away  138      6</pre></div></div>

<p>These performances and others show why Botham was considered such a great player as he produced some sustained periods of excellent all-round cricket rather than having one discipline more dominant for a long period of time.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/charting-the-performance-of-cricket-all-rounders-it-botham/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Displaying data using level plots</title>
		<link>http://www.wekaleamstudios.co.uk/posts/displaying-data-using-level-plots/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/displaying-data-using-level-plots/#comments</comments>
		<pubDate>Mon, 03 May 2010 10:17:08 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Base Graphics]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Lattice Graphics]]></category>
		<category><![CDATA[Trellis Graphics]]></category>
		<category><![CDATA[box]]></category>
		<category><![CDATA[expand.grid]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[image]]></category>
		<category><![CDATA[lattice]]></category>
		<category><![CDATA[levelplot]]></category>
		<category><![CDATA[loess]]></category>
		<category><![CDATA[predict]]></category>
		<category><![CDATA[surface]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1008</guid>
		<description><![CDATA[A level plot is a type of graph that is used to display a surface in two rather than three dimensions &#8211; the surface is viewed from above as if we were looking straight down and is an alternative to a contour plot &#8211; geographic data is an example of where this type of graph [...]]]></description>
			<content:encoded><![CDATA[<p>A level plot is a type of graph that is used to display a surface in two rather than three dimensions &#8211; the surface is viewed from above as if we were looking straight down and is an alternative to a contour plot &#8211; geographic data is an example of where this type of graph would be used. A contour plot uses lines to identify regions of different heights and the level plot uses coloured regions to produce a similar effect.<span id="more-1008"></span></p>
<p>To illustrate this type of graph we will consider some surface elevation data that is available in the <strong>geoR</strong> package. The data set in this package is called <strong>elevation</strong> and stores the elevation height in feet (as multiples of ten feet) for a grid region of x and y coordinates (recorded as multiples of 50 feet). To access this data we load the <strong>geoR</strong> pacakage and then use the <strong>data</strong> function:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">require(geoR)
data(elevation)</pre></div></div>

<p>For some packages we need the call to the <strong>data</strong> function to make a set of data available for our use. The <strong>elevation</strong> object is not a data frame so our first step is to create our own data frame to be used to create the level plots using the different graphics packages.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">elevation.df = data.frame(x = 50 * elevation$coords[,&quot;x&quot;],
  y = 50 * elevation$coords[,&quot;y&quot;], z = 10 * elevation$data)</pre></div></div>

<p>We extract the x and y grid coordinates and the height values, multiplying them by 50 and 10 respectively to convert to feet for the graphs. Rather than trying to plot the individual values we need to create a surface to cover the whole grid region as the points themselves are too sparse. We make use of the <strong>loess</strong> function to fit a local polynomial trend surface (using weighted least squares) to approximate the elevation across the whole region. The function call for a local quadratic surface is shown below:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">elevation.loess = loess(z ~ x*y, data = elevation.df,
  degree = 2, span = 0.25)</pre></div></div>

<p>The next stage is to extract heights from this fitted surface at regular intervals across the whole grid region of interest &#8211; which runs from 10 to 300 feet in both the x and y directions. The <strong>expand.grid</strong> function creates an array of all combinations of the x and y values that we specify in a list. We choose a range every foot from 10 to 300 feet to create a fine grid:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">elevation.fit = expand.grid(list(x = seq(10, 300, 1), y = seq(10, 300, 1)))</pre></div></div>

<p>The <strong>predict</strong> function is then used to estimate the surface height at all of these combinations of x and y coordinates covering our grid region. This is saved as an object <strong>z</strong> which will be used by the <strong>base</strong> graphics function:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">z = predict(elevation.loess, newdata = elevation.fit)</pre></div></div>

<p>The <strong>lattice</strong> and <strong>ggplot2</strong> expect the data in a different format so we make use of the <strong>as.numeric</strong> function to convert from a table of heights to a single column and append to the object we create based on all combinations of x and y coordinates:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">elevation.fit$Height = as.numeric(z)</pre></div></div>

<p>The data is now in a format that can be used to create the level plots in the various packages.</p>
<p><strong>Base Graphics</strong></p>
<p>The function <strong>image</strong> in the <strong>base</strong> graphics package is the function we use to create a level plot. This function requires a list of x and y values that cover the grid of vertical values that will be used to create the surface. These heights are specified as a table of values, which in our case was saved as the object <strong>z</strong> during the calculations on the local trend surface.</p>
<p>The text on the axis labels are specified by the <strong>xlab</strong> and <strong>ylab</strong> function arguments and the <strong>main</strong> argument determines the overall title for the graph. The function call below creates the level plot:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">image(seq(10, 300, 1), seq(10, 300, 1), z,
  xlab = &quot;X Coordinate (feet)&quot;, ylab = &quot;Y Coordinate (feet)&quot;,
  main = &quot;Surface elevation data&quot;)
box()</pre></div></div>

<p>After the <strong>image</strong> function is used we call the <strong>box</strong> function mainly for aesthetic purposes to ensure there is a line surrounding the level plot. The graph that is created is shown below:</p>
<div id="attachment_1012" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/05/levelplot-base.jpg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/05/levelplot-base-300x300.jpg" alt="Base Graphics Level Plot" title="Level plot Example" width="300" height="300" class="size-medium wp-image-1012" /></a><p class="wp-caption-text">Base Graphics Level Plot</p></div>
<p>The default colour scheme used by the <strong>base</strong> graphics produces an attractive level plot graph where we can easily see the variation in height across the grid region. It is basically a fancy version of a contour plot where the regions between the contour lines are coloured with different shades indicating the height in those regions.</p>
<p><strong>Lattice Graphics</strong></p>
<p>The <strong>lattice</strong> graphics package provides a function <strong>levelplot</strong> for this type of graphical dispaly. We use the data stored in the object <strong>elevation.fit</strong> to create the graph with <strong>lattice</strong> graphics.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">levelplot(Height ~ x*y, data = elevation.fit,
  xlab = &quot;X Coordinate (feet)&quot;, ylab = &quot;Y Coordinate (feet)&quot;,
  main = &quot;Surface elevation data&quot;,
  col.regions = terrain.colors(100)
)</pre></div></div>

<p>The formula is used to specify which variable to use for the three axes and a data frame where the values are stored &#8211; as there are three dimensions it is the z axis that is specified on the left hand side of the formula. The axes labels and title are specified in the same way as the <strong>base</strong> graphics.</p>
<p>The range of colours used in the <strong>lattice</strong> level plot can be specified as a vector of colours to the <strong>col.regions</strong> argument of the function. We make use of the <strong>terrian.colors</strong> function to create this vector which a range of 100 colours which are less striking than those used above with the <strong>base</strong> graphics. The level plot that we can is shown here:</p>
<div id="attachment_1014" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/05/levelplot-lattice.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/05/levelplot-lattice-300x300.jpg" alt="Lattice Graphics Level Plot" title="Level plot Example" width="300" height="300" class="size-medium wp-image-1014" /></a><p class="wp-caption-text">Lattice Graphics Level Plot</p></div>
<p>This is in general similar to the <strong>base</strong> graphics display but the actual plot region is a different shape that makes things look slightly different.</p>
<p><strong>ggplot2</strong></p>
<p>The <strong>ggplot2</strong> package also provides facilities for creating a level plot making use of the tile geom to create the desired graph. The function <strong>ggplot</strong> forms the basis of the graph and various other options are used to customise the graph:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(elevation.fit, aes(x, y, fill = Height)) + geom_tile() +
  xlab(&quot;X Coordinate (feet)&quot;) + ylab(&quot;Y Coordinate (feet)&quot;) +
  opts(title = &quot;Surface elevation data&quot;) +
  scale_fill_gradient(limits = c(7000, 10000),low = &quot;black&quot;,high = &quot;white&quot;) +
  scale_x_continuous(expand = c(0,0)) +
  scale_y_continuous(expand = c(0,0))</pre></div></div>

<p>This large number of options that are added to the graph change various settings. The choice of colours for the heights used on graph is selected by the <strong>scale_fill_gradient</strong> function with colours ranging from black to white. The <strong>scale_x_continuous</strong> and <strong>scale_y_continuous</strong> options are used to stretch the tiles to cover the whole grid region covering up the default gray background &#8211; this makes the graph more visually appealing. The graph that is produced is shown here:</p>
<div id="attachment_1013" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/05/levelplot-ggplot2.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/05/levelplot-ggplot2-300x300.jpg" alt="ggplot2 Level Plot" title="Level plot Example" width="300" height="300" class="size-medium wp-image-1013" /></a><p class="wp-caption-text">ggplot2 Level Plot</p></div>
<p>The graph from <strong>ggplot2</strong> is visually as impressive as the other graphs &#8211; there is more smoothing between the colours which blurs some of the lines on the other graphs because of the type of colour gradient that was selected.</p>
<p>This blog post is summarised in a pdf leaflet on the <a href="http://www.wekaleamstudios.co.uk/supplementary-material/">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/displaying-data-using-level-plots/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Summarising data using box and whisker plots</title>
		<link>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-box-and-whisker-plots/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-box-and-whisker-plots/#comments</comments>
		<pubDate>Sun, 25 Apr 2010 07:37:10 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Base Graphics]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Lattice Graphics]]></category>
		<category><![CDATA[Trellis Graphics]]></category>
		<category><![CDATA[Box and Whisker]]></category>
		<category><![CDATA[boxplot]]></category>
		<category><![CDATA[bwplot]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[lattice]]></category>
		<category><![CDATA[trellis]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=960</guid>
		<description><![CDATA[A box and whisker plot is a type of graphical display that can be used to summarise a set of data based on the five number summary of this data. The summary statistics used to create a box and whisker plot are the median of the data, the lower and upper quartiles (25% and 75%) [...]]]></description>
			<content:encoded><![CDATA[<p>A box and whisker plot is a type of graphical display that can be used to summarise a set of data based on the five number summary of this data. The summary statistics used to create a box and whisker plot are the median of the data, the lower and upper quartiles (25% and 75%) and the minimum and maximum values.<span id="more-960"></span></p>
<p>The box and whisker plot is an effective way to investigate the distribution of a set of data. For example, skewness can be identified from the box and whisker as the display does not make any assumptions about the underlying distribution of the data. The extreme values at either end of the scale are sometimes included on the display to show how far they extend beyond the majority of the data.</p>
<p>To illustrate creating box and whisker plots we consider UK meteorological data that has been collected on a monthly basis at Southampton, UK between 1950 and 1999 and is publicly available. This data is available from the <a href="http://www.metoffice.gov.uk/">UK Met Office</a> and we will compare the range of temperatures recorded in each month of the year over this period by creating box and whisker plots with the different packages.</p>
<p>The data is assumed to have been imported into <strong>R</strong> and stored in a data frame called <strong>soton.df</strong>. An extract of the data is shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">    Year Month Max.Temp Min.Temp Frost  Rain
1   1950   Jan      7.7      2.8     7  20.1
2   1950   Feb     10.3        4     4 127.0
3   1950   Mar     13.0      4.5     2  39.4
4   1950   Apr     13.6      4.7     0  62.0
5   1950   May     17.9      7.8     0  32.2</pre></div></div>

<p><strong>Base Graphics</strong></p>
<p><!--[Fast Tube]--><span id="Pe-48TAtBho" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-box-and-whisker-plots/#Pe-48TAtBho"><img src="http://i.ytimg.com/vi/Pe-48TAtBho/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>The <strong>base</strong> graphics approach makes use of the <strong>boxplot</strong> function to create box and whisker plots. In this situation the function can be used with a formula rather than specifying two separate vectors of data &#8211; we can specify a data frame to point towards a source of data to be used in the graph. For the temperature data we use this code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">boxplot(Max.Temp ~ Month, data = soton.df,
  xlab = &quot;Month&quot;, ylab = &quot;Maximum Temperature&quot;,
  main = &quot;Temperature at Southampton Weather Station (1950-1999)&quot;
)</pre></div></div>

<p>The horizontal and vertical axes labels are specified using the <strong>xlab</strong> and <strong>ylab</strong> arguments respectively and the title of the plot is created using the <strong>main</strong> argument. The box and whisker plot is shown here:</p>
<div id="attachment_962" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/boxwhisker-base.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/boxwhisker-base-300x300.jpg" alt="Base Graphics Box and Whisker Plot" title="Box and Whisker plot Example" width="300" height="300" class="size-medium wp-image-962" /></a><p class="wp-caption-text">Base Graphics Box and Whisker Plot</p></div>
<p>The function <strong>boxplot</strong> makes it easy to create a reasonably attractive box and whisker plot. The variation in the distribution of temperatures across the year can be seen from the graph.</p>
<p><strong>Lattice Graphics</strong></p>
<p><!--[Fast Tube]--><span id="RJcZ_7EOzv8" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-box-and-whisker-plots/#RJcZ_7EOzv8"><img src="http://i.ytimg.com/vi/RJcZ_7EOzv8/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In the <strong>lattice</strong> graphics package there is a function <strong>bwplot</strong> which is used to create box and whisker plots. The function call also uses a formula to specify the <strong>x</strong> and <strong>y</strong> variables to use on the graph. The function call arguments are identical to the <strong>boxplot</strong> function in <strong>base</strong> graphics:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">bwplot(Max.Temp ~ Month, data = soton.df,
  xlab = &quot;Month&quot;, ylab = &quot;Maximum Temperature&quot;,
  main = &quot;Temperature at Southampton Weather Station (1950-1999)&quot;
)</pre></div></div>

<p>The variable <strong>Month</strong> is categorical so a separate box and whisker summary is created for each month separately. The <strong>lattice</strong> version of the graph is shown here:</p>
<div id="attachment_963" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/boxwhisker-lattice.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/boxwhisker-lattice-300x300.jpg" alt="Lattice Graphics Box and Whisker Plot" title="Box and Whisker plot Example" width="300" height="300" class="size-medium wp-image-963" /></a><p class="wp-caption-text">Lattice Graphics Box and Whisker Plot</p></div>
<p>This is very similar to the box and whisker plot created by <strong>base</strong> graphics with a similar level of effort required. The main difference is the use of a circle rather than a line to identify the location of the median of the data.</p>
<p><strong>ggplot2</strong></p>
<p><!--[Fast Tube]--><span id="WJQdYId2TUA" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-box-and-whisker-plots/#WJQdYId2TUA"><img src="http://i.ytimg.com/vi/WJQdYId2TUA/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In the <strong>ggplot2</strong> package there is a general function <strong>ggplot</strong> that is used to create graphs of any type. We make use of the boxplot geom to create a box and whisker plot following the standard approach. The first step is to specify a data frame to use to create the graph and then map the columns of this data frame, via the \texttt{aes} argument, to the different axes or other aesthetics (such as colour or symbol shape). The particular geom is used to specify the type of plot that we want to create. Our final step is to add on the various axes labels and an overall title to the graph.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(soton.df, aes(Month, Max.Temp)) + geom_boxplot() +
  ylab(&quot;Maximum Temperature&quot;) +
  opts(title = &quot;Temperature at Southampton Weather Station (1950-1999)&quot;)</pre></div></div>

<p>The <strong>ggplot2</strong> version of box and whisker plots is shown here:</p>
<div id="attachment_964" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/boxwhisker-ggplot2.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/boxwhisker-ggplot2-300x300.jpg" alt="ggplot2 Graphics Box and Whisker Plot" title="Box and Whisker plot Example" width="300" height="300" class="size-medium wp-image-964" /></a><p class="wp-caption-text">ggplot2 Graphics Box and Whisker Plot</p></div>
<p>The distinctive gray background used by <strong>ggplot2</strong> is an obvious visual difference compared to the default clear background used in the other two approaches. The boxes themselves have a cleaner look in this graph than the other two methods and the overall look is slick.</p>
<p>This blog post is summarised in a pdf leaflet on the <a href="http://www.wekaleamstudios.co.uk/supplementary-material/">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-box-and-whisker-plots/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Summarising data using scatter plots</title>
		<link>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-scatter-plots/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-scatter-plots/#comments</comments>
		<pubDate>Sun, 18 Apr 2010 18:56:06 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Base Graphics]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Lattice Graphics]]></category>
		<category><![CDATA[Trellis Graphics]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[lattice]]></category>
		<category><![CDATA[plot]]></category>
		<category><![CDATA[scatter plot]]></category>
		<category><![CDATA[trellis]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[xyplot]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=912</guid>
		<description><![CDATA[A scatter plot is a graph used to investigate the relationship between two variables in a data set. The x and y axes are used for the values of the two variables and a symbol on the graph represents the combination for each pair of values in the data set. This type of graph is [...]]]></description>
			<content:encoded><![CDATA[<p>A scatter plot is a graph used to investigate the relationship between two variables in a data set. The x and y axes are used for the values of the two variables and a symbol on the graph represents the combination for each pair of values in the data set. This type of graph is used in many common situations and can convey a lot of useful information.<span id="more-912"></span></p>
<p>To illustrate creating a scatter plot we will use a simple data set for the population of the UK between 1992 and 2009. This data is saved in a data frame <strong>uk.df</strong> using the following command:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">uk.df = data.frame(Year = 1992:2009,
  Population = c(57770, 57933, 58096, 58258, 58418, 58577,
  58743, 58925, 59131, 59363, 59618, 59894, 60186, 60489,
  60804, 61129, 61461, 61796)
)</pre></div></div>

<p>For this example the data is recorded in thousands to make the graph easier to read and there is no benefit or noticeable improvement to be seen by using greater detail.</p>
<p><strong>Base Graphics</strong></p>
<p><!--[Fast Tube]--><span id="aqXuiQR4bnY" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-scatter-plots/#aqXuiQR4bnY"><img src="http://i.ytimg.com/vi/aqXuiQR4bnY/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In the <strong>base</strong> graphics system the general purpose <strong>plot</strong> function can be used to create a scatter plot for the UK population data set that we created. The first two arguments to the <strong>plot</strong> function are the x and y variables respectively. The following code will create a scatter plot, including various labels:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">plot(uk.df$Year, uk.df$Population,
  xlab = &quot;Year&quot;, ylab = &quot;Total Population (Thousands)&quot;,
  main = &quot;UK Population (1992-2009)&quot;, pch = 16)</pre></div></div>

<p>The labels for the x and y axes are specified via the <strong>xlab</strong> and <strong>ylab</strong> arguments to the plot function and the <strong>main</strong> argument specifies the title for the plot.</p>
<div id="attachment_919" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/scatterplot-base.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/scatterplot-base-300x300.jpg" alt="Base Graphics Histogram" title="Scatter plot Example" width="300" height="300" class="size-medium wp-image-919" /></a><p class="wp-caption-text">Base Graphics Histogram</p></div>
<p>The graph itself is plain and functional which solid circles indicating the population (in thousands) for each of the years covered by the data.</p>
<p><strong>Lattice Graphics</strong></p>
<p><!--[Fast Tube]--><span id="NMTCIViCLOU" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-scatter-plots/#NMTCIViCLOU"><img src="http://i.ytimg.com/vi/NMTCIViCLOU/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>The <strong>lattice</strong> graphics package provides a function <strong>xyplot</strong> specifically to create scatter plots and the function is used in a similar way to the <strong>base</strong> graphics approach. The first argument to the function is a formula describing the relationship to be plotted on the graph, with the y variable preceding the x variable as we are used to when describing mathematical fomula such as y=a+bx. The data frame is specified with the <strong>data</strong> argument to simplify the expression in the formula. The code used is as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">xyplot(Population ~ Year, data = uk.df,
  xlab = &quot;Year&quot;, ylab = &quot;Total Population (Thousands)&quot;,
  main = &quot;UK Population (1992-2009)&quot;,
  scales = list(x = list(at = seq(1992, 2009, 2)))
)</pre></div></div>

<p>The axis labels and the overall title for the graph are specified in the same way as the <strong>base</strong> graphics system. We indulge in some fine tuning of the labels on the x axis via the <strong>scales</strong> argument &#8211; here we indicate that every second year should be included on the label starting in 1992 and running until 2009. The <strong>lattice</strong> graph is shown here for comparison with the graphs created using the other two packages:</p>
<div id="attachment_921" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/scatterplot-lattice.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/scatterplot-lattice-300x300.jpg" alt="Lattice Graphics Scatter Plot" title="Scatter plot Example" width="300" height="300" class="size-medium wp-image-921" /></a><p class="wp-caption-text">Lattice Graphics Scatter Plot</p></div>
<p>There are very few visual differences between the <strong>lattice</strong> and <strong>base</strong> graphics. In <strong>lattice</strong> graphics an object is created that can be edited to add or remove components and then printed to the screen. This approach is more flexible than the base graphics where the components are painted on top of each other and the use of themes in <strong>lattice</strong> will make it easier to keep a consistent look to all graphs in a document.</p>
<p><strong>ggplot2</strong></p>
<p><!--[Fast Tube]--><span id="TagaAeIHKks" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-scatter-plots/#TagaAeIHKks"><img src="http://i.ytimg.com/vi/TagaAeIHKks/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In the <strong>ggplot2</strong> package the <strong>ggplot</strong> function is used to create graphs of all types rather than having a separate function defined for each type of graph. The first argument is adata frame with the data to be plotted and the <strong>aes</strong> argument specifies the aesthetics associated with the graph such as the point symbol, size or colour. In this case the <strong>Year</strong> variable appears on the x axis and the <strong>Population</strong> variable on the y axis. The code to create the scatter plot is shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(uk.df, aes(Year, Population)) + geom_point() +
  xlab(&quot;Year&quot;) + ylab(&quot;Total Population (Thousands)&quot;) +
  opts(title = &quot;UK Population (1992-2009)&quot;)</pre></div></div>

<p>The <strong>geom_point</strong> specifies the type of graph to create (a scatter plot in this situation and this highlights the flexibility of the <strong>ggplot2</strong> package as changing the geom will create a new type of graph) and the labels for the graph are created by adding them to the graph with the <strong>xlab</strong>, <strong>ylab</strong> and <strong>opts</strong> functions. The graph is shown below:</p>
<div id="attachment_920" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/scatterplot-ggplot2.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/scatterplot-ggplot2-300x300.jpg" alt="ggplot2 Scatter plot" title="Scatter plot Example" width="300" height="300" class="size-medium wp-image-920" /></a><p class="wp-caption-text">ggplot2 Scatter plot</p></div>
<p>This graph is not greatly different to the scatter plot created using the <strong>base</strong> and <strong>lattice</strong> packages. The default theme in the <strong>ggplot2</strong> package has a gray background with white grid lines that allows easy visual recognition of graphs created using this package.</p>
<p>This blog post is summarised in a pdf leaflet on the <a href="http://www.wekaleamstudios.co.uk/supplementary-material/">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-scatter-plots/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Summarising data using histograms</title>
		<link>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-histograms/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-histograms/#comments</comments>
		<pubDate>Sun, 11 Apr 2010 08:53:16 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Base Graphics]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Lattice Graphics]]></category>
		<category><![CDATA[Trellis Graphics]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[hist]]></category>
		<category><![CDATA[histogram]]></category>
		<category><![CDATA[lattice]]></category>
		<category><![CDATA[trellis]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=870</guid>
		<description><![CDATA[The histogram is a standard type of graphic used to summarise univariate data where the range of values in the data set is divided into regions and a bar (usually vertical) is plotted in each of these regions with height proportional to the frequency of observations in that region. In some cases the proportion of [...]]]></description>
			<content:encoded><![CDATA[<p>The histogram is a standard type of graphic used to summarise univariate data where the range of values in the data set is divided into regions and a bar (usually vertical) is plotted in each of these regions with height proportional to the frequency of observations in that region. In some cases the proportion of data points in each region is shown instead of counts.<span id="more-870"></span></p>
<p>The shape of the histogram is determined by the width and number of regions that divided up the data. A histogram provides an indication the following features of a set of data: the general shape, symmetry or skewness of data and modality (uni-, bi- or multi-modal). There are some situations where a different type of graph would be preferable but histograms are useful for describing the general features of the distribution of a set of data.</p>
<p>To illustrate creating a histogram we consider data from the AFL sports league in Australia and the total number of points scored by the home team in each fixture. If we assume that the data is in a comma separated text file, called <strong>afl_2003_2007.csv</strong>, then we would import that data using the following command saving the results in a data frame:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">afl.df = read.csv(&quot;afl_2003_2007.csv&quot;)</pre></div></div>

<p>Edit: The data is available as <a href='http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/12/afl_2003_2007.txt'>AFL Data Set</a>. Change the file extension manually to <strong>csv</strong> or change the command to reflect the different file name.</p>
<p><strong>Base Graphics</strong></p>
<p><!--[Fast Tube]--><span id="4Q9vPuj4w8c" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-histograms/#4Q9vPuj4w8c"><img src="http://i.ytimg.com/vi/4Q9vPuj4w8c/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In <strong>base</strong> graphics the function <strong>hist</strong> is used to create a histogram with the first argument being the name of the vector that contains the data to be plotted. The <strong>x-axis</strong> is given a label using the <strong>xlab</strong> argument and the <strong>main</strong> argument is used to add a title to the graph. Code to create a histogram of home points is shown below:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">hist(afl.df$Home.Total, xlab = &quot;Home Points&quot;,
  main = &quot;Histogram of Points Scored at Home\nAFL 2003-2007&quot;)</pre></div></div>

<p>The default option is to display bars representing the frequency of data values in each of the ranges and the overall look of the graph is basic as shown here:</p>
<div id="attachment_877" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/histogram-base.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/histogram-base-300x300.jpg" alt="Base Graphics Histogram" title="Histogram Example" width="300" height="300" class="size-medium wp-image-877" /></a><p class="wp-caption-text">Base Graphics Histogram</p></div>
<p>The default algorithm for selecting number of bins to use for the histogram usually makes a sensible selection but this can be specified if required.</p>
<p><strong>Lattice Graphics</strong></p>
<p><!--[Fast Tube]--><span id="hxQmEhzgWks" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-histograms/#hxQmEhzgWks"><img src="http://i.ytimg.com/vi/hxQmEhzgWks/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In the <strong>lattice</strong> graphics package there is a function <strong>histogram</strong> and we make use of the formula to specify a single variable for the number of points scored by the home team. The specification for the axis labels and graph title are the same as for the <strong>base</strong> graphics package. The equivalent graph is created using the following code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">histogram( ~ Home.Total, data = afl.df, xlab = &quot;Home Points&quot;,
  main = &quot;Histogram of Points Scored at Home\nAFL 2003-2007&quot;)</pre></div></div>

<p>Here the default option is the work with proportions of the total number of data points rather than counts so the shape of the distribution is slightly different when compared to the <strong>base</strong> graphics plot. The <strong>lattice</strong> version is shown below:</p>
<div id="attachment_880" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/histogram-lattice.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/histogram-lattice-300x300.jpg" alt="Lattice Graphics Histogram" title="Histogram Example" width="300" height="300" class="size-medium wp-image-880" /></a><p class="wp-caption-text">Lattice Graphics Histogram</p></div>
<p>The main other difference is the choice of colour for the bars in the histogram and these can be adjusted by changing the global theme for <strong>lattice</strong>.</p>
<p><strong>ggplot2</strong></p>
<p><!--[Fast Tube]--><span id="47kWynt3b6M" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-histograms/#47kWynt3b6M"><img src="http://i.ytimg.com/vi/47kWynt3b6M/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>The <strong>ggplot2</strong> library uses a general purpose graphics function called <strong>ggplot</strong> to create graphs of all types and the geom specifies the type of display to create, in this case a histogram. Components that make up the graph are added sequentially to build up the whole plot and in the example below we add axis labels and a main title.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(afl.df, aes(Home.Total)) + geom_histogram() +
  xlab(&quot;Home Points&quot;) + ylab(&quot;Frequency&quot;) +
  opts(title = &quot;Histogram of Points Scored at Home\nAFL 2003-2007&quot;)</pre></div></div>

<p>The default theme for <strong>ggplot2</strong> is distinctive and the histogram is shown in the graph below:</p>
<div id="attachment_881" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/histogram-ggplot2.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/histogram-ggplot2-300x300.jpg" alt="ggplot 2 Histogram" title="Histogram Example" width="300" height="300" class="size-medium wp-image-881" /></a><p class="wp-caption-text">ggplot 2 Histogram</p></div>
<p>The default number of bins is larger compared to <strong>base</strong> and <strong>lattice</strong> graphics which provides a rough distribution in this particular case. The online <a href="http://had.co.nz/ggplot2/">ggplot2</a> manual is a good source of information about customising graphs created using this approach.</p>
<p>This blog post is summarised in a pdf leaflet on the <a href="http://www.wekaleamstudios.co.uk/?page_id=282">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-histograms/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Summarising data using dot plots</title>
		<link>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-dot-plots/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-dot-plots/#comments</comments>
		<pubDate>Fri, 26 Mar 2010 10:53:00 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Base Graphics]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Lattice Graphics]]></category>
		<category><![CDATA[Trellis Graphics]]></category>
		<category><![CDATA[Cleveland]]></category>
		<category><![CDATA[dot plot]]></category>
		<category><![CDATA[dotplot]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[lattice]]></category>
		<category><![CDATA[plot]]></category>
		<category><![CDATA[points]]></category>
		<category><![CDATA[trellis]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=847</guid>
		<description><![CDATA[A dot plot is a type of display that compares counts, frequencies, totals or other summary measures for a series of categories. The dot plot can be arranged with the categories either on the vertical or horizontal axis of the display to allow comparising between the different categories as well as comparison within categories where [...]]]></description>
			<content:encoded><![CDATA[<p>A dot plot is a type of display that compares counts, frequencies, totals or other summary measures for a series of categories. The dot plot can be arranged with the categories either on the vertical or horizontal axis of the display to allow comparising between the different categories as well as comparison within categories where there are multiple symbols used to denote say different years.<span id="more-847"></span></p>
<p>In this post we will considered creating a dot plot using the <strong>base</strong> graphics, <strong>lattice</strong> graphics and <strong>ggplot2</strong> approaches. To illustrate creating a dot plot we used data from the <a href="http://faostat.fao.org">FAO website</a> on the total irrigation area for Africa, Latin America, North America and Europe. We create a data frame using the following code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">irrigation.df = data.frame(
  Region = rep(c(&quot;Africa&quot;, &quot;Latin America&quot;, &quot;North America&quot;, &quot;Europe&quot;), 4),
  Year = factor(c(rep(1980, 4), rep(1990, 4), rep(2000, 4), rep(2007, 4))),
  Area = c(9.3, 12.7, 21.2, 18.8, 11.0, 15.5, 21.6, 25.3,
    13.2, 17.3, 23.3, 26.7, 13.6, 17.3, 23.8, 26.3)
)</pre></div></div>

<p><strong>Base Graphics</strong></p>
<p><!--[Fast Tube]--><span id="5izUzQKL1yw" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-dot-plots/#5izUzQKL1yw"><img src="http://i.ytimg.com/vi/5izUzQKL1yw/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In the <strong>base</strong> graphics system we build up the <strong>dotplot</strong> with a series of commands. The first function call creates the graph region based on the data set but we do not plot any data by setting the <strong>type = &#8220;n&#8221;</strong> argument. The axis labels for the horizontal and vertical scales are set along with the title in the initial function call:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">plot(irrigation.df$Area, irrigation.df$Region, xlab = &quot;Area&quot;,
  ylab = &quot;Region&quot;, main = &quot;Irrigation Area by Region&quot;, type = &quot;n&quot;)</pre></div></div>

<p>To add the points with separate colours for each of the four years we use the <strong>points</strong> function and subset to the particular year by testing a condition on the year. The <strong>col</strong> argument is used with a text string to specify the colour for the symbols for the given year:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">points(irrigation.df$Area[irrigation.df$Year == 1980],
  irrigation.df$Region[irrigation.df$Year == 1980], col = &quot;black&quot;, pch = 16)
points(irrigation.df$Area[irrigation.df$Year == 1990],
  irrigation.df$Region[irrigation.df$Year == 1990], col = &quot;blue&quot;, pch = 16)
points(irrigation.df$Area[irrigation.df$Year == 2000],
  irrigation.df$Region[irrigation.df$Year == 2000], col = &quot;red&quot;, pch = 16)
points(irrigation.df$Area[irrigation.df$Year == 2007],
  irrigation.df$Region[irrigation.df$Year == 2007], col = &quot;green&quot;, pch = 16)</pre></div></div>

<p>The code is rather long winded compared to the using the other two graphics packages. We can add a legend to the graph so that the years can be identified:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">legend(10, 4, legend = c(&quot;1980&quot;, &quot;1990&quot;, &quot;2000&quot;, &quot;2007&quot;),
  col = c(&quot;black&quot;, &quot;blue&quot;, &quot;red&quot;, &quot;green&quot;), pch = 16)</pre></div></div>

<p>The placement of the legend uses the <strong>x</strong> and <strong>y</strong> coordinates within the graph to position the box. All the code above produces the following graph:</p>
<div id="attachment_856" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/03/dotplot-base.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/03/dotplot-base-300x300.jpg" alt="Base Graphics Dot Plot" title="Dot Plot Example" width="300" height="300" class="size-medium wp-image-856" /></a><p class="wp-caption-text">Base Graphics Dot Plot</p></div>
<p>The graph is basic but we can consider the changes over time for the four regions. One downside is that the regions have been labelled with numbers rather than text strings.</p>
<p><strong>Lattice Graphics</strong></p>
<p><!--[Fast Tube]--><span id="-FGU6PMaSRY" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-dot-plots/#-FGU6PMaSRY"><img src="http://i.ytimg.com/vi/-FGU6PMaSRY/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>The <strong>lattice</strong> graphics package has a function <strong>dotplot</strong> that is used to create dot plots. The first argument to the function is a formula describing the variables to use for the horizontal and vertical axes. We also specify the data frame to use for the graph and which column to determine different symbols and/or colours to highlight groupings within the plot:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">dotplot(Region ~ Area, data = irrigation.df, groups = Year,
  main = &quot;Irrigation Area by Region&quot;)</pre></div></div>

<p>The lattice variant of the graph is shown here:</p>
<div id="attachment_857" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/03/dotplot-lattice.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/03/dotplot-lattice-300x300.jpg" alt="Lattice Graphics Dot Plot" title="Dot Plot Example" width="300" height="300" class="size-medium wp-image-857" /></a><p class="wp-caption-text">Lattice Graphics Dot Plot</p></div>
<p>The graph is simple and very similar to the one produced using the base graphics with the advantage that the R code is not as complicated.</p>
<p><strong>ggplot2</strong></p>
<p><!--[Fast Tube]--><span id="y1CsT-jAWZQ" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-dot-plots/#y1CsT-jAWZQ"><img src="http://i.ytimg.com/vi/y1CsT-jAWZQ/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>The <strong>ggplot</strong> function is used to create the dot plot where we first specify the name of the data frame with the information to be displayed and then use the <strong>aes</strong> argument to list the variables to plot on the horizontal and vertical axes. The colour argument determines the variable to use for assigning colours to (usually) a categorical variable.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(irrigation.df, aes(x = Area, y = Region, colour = Year)) +
  geom_point() + opts(title = &quot;Irrigation Area by Region&quot;)</pre></div></div>

<p>The <strong>ggplot2</strong> version of the dot plot is shown below:</p>
<div id="attachment_858" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/03/dotplot-ggplot2.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/03/dotplot-ggplot2-300x300.jpg" alt="ggplot2 Dot Plot" title="Dot Plot Example" width="300" height="300" class="size-medium wp-image-858" /></a><p class="wp-caption-text">ggplot2 Dot Plot</p></div>
<p>This graph is very similar to the ones produced using the other graphics packages but has the distinctive background and legend style that is used as the default option in <strong>ggplot2</strong>.</p>
<p>This blog post is summarised in a pdf leaflet on the <a href="http://www.wekaleamstudios.co.uk/?page_id=282">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-dot-plots/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Two-way Analysis of Variance (ANOVA)</title>
		<link>http://www.wekaleamstudios.co.uk/posts/two-way-analysis-of-variance-anova/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/two-way-analysis-of-variance-anova/#comments</comments>
		<pubDate>Mon, 15 Feb 2010 21:45:02 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Analysis of Variance]]></category>
		<category><![CDATA[Design of Experiments]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[analysis of variance]]></category>
		<category><![CDATA[ANOVA]]></category>
		<category><![CDATA[aov]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[summary]]></category>
		<category><![CDATA[Tukey HSD]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[two way]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=660</guid>
		<description><![CDATA[The analysis of variance (ANOVA) model can be extended from making a comparison between multiple groups to take into account additional factors in an experiment. The simplest extension is from one-way to two-way ANOVA where a second factor is included in the model as well as a potential interaction between the two factors. As an [...]]]></description>
			<content:encoded><![CDATA[<p>The analysis of variance (<strong>ANOVA</strong>) model can be extended from making a comparison between multiple groups to take into account additional factors in an experiment. The simplest extension is from one-way to two-way <strong>ANOVA</strong> where a second factor is included in the model as well as a potential interaction between the two factors.<span id="more-660"></span></p>
<p>As an example consider a company that regularly has to ship parcels between its various (five for this example) sub-offices and has the option of using three competing parcel delivery services, all of which charge roughly similar amounts for each delivery. To determine which service to use, the company decides to run an experiment shipping three packages from its head office to each of the five sub-offices. The delivery time for each package is recorded and the data loaded into <strong>R</strong>:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">delivery.df = data.frame(
  Service = c(rep(&quot;Carrier 1&quot;, 15), rep(&quot;Carrier 2&quot;, 15),
    rep(&quot;Carrier 3&quot;, 15)),
  Destination = c(rep(c(&quot;Office 1&quot;, &quot;Office 2&quot;, &quot;Office 3&quot;,
    &quot;Office 4&quot;, &quot;Office 5&quot;), 9)),
  Time = c(15.23, 14.32, 14.77, 15.12, 14.05,
  15.48, 14.13, 14.46, 15.62, 14.23, 15.19, 14.67, 14.48, 15.34, 14.22,
  16.66, 16.27, 16.35, 16.93, 15.05, 16.98, 16.43, 15.95, 16.73, 15.62,
  16.53, 16.26, 15.69, 16.97, 15.37, 17.12, 16.65, 15.73, 17.77, 15.52,
  16.15, 16.86, 15.18, 17.96, 15.26, 16.36, 16.44, 14.82, 17.62, 15.04)
)</pre></div></div>

<p>The data is then displayed using a dot plot for an initial visual investigation of any trends in delivery time between the three services and across the five sub-offices. The colour aesthetic is used to distinguish between the three services in the plot.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.df, aes(Time, Destination, colour = Service)) + geom_point()</pre></div></div>

<p>This code produces the following graph:</p>
<div id="attachment_792" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-data.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-data-300x300.png" alt="Service Delivery Time by Destination" title="Delivery Time" width="300" height="300" class="size-medium wp-image-792" /></a><p class="wp-caption-text">Graph of the delivery time for different services and destintions</p></div>
<p>The graph shows a general pattern of service carrier 1 having shorter delivery times than the other two services. There is also an indication that the differences between the services varies for the five sub-offices and we might expect the interaction term to be significant in the two-way <strong>ANOVA</strong> model. To fit the two-way <strong>ANOVA</strong> model we use this code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">delivery.mod1 = aov(Time ~ Destination*Service, data = delivery.df)</pre></div></div>

<p>The <strong>*</strong> symbol instructs <strong>R</strong> to create a formula that includes main effects for both Destination and Service as well as the two-way interaction between these two factors. We save the fitted model to an object which we can summarise as follows to test for importance of the various model terms:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; summary(delivery.mod1)
                    Df  Sum Sq Mean Sq  F value    Pr(&gt;F)    
Destination          4 17.5415  4.3854  61.1553 5.408e-14 ***
Service              2 23.1706 11.5853 161.5599 &lt; 2.2e-16 ***
Destination:Service  8  4.1888  0.5236   7.3018 2.360e-05 ***
Residuals           30  2.1513  0.0717                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</pre></div></div>

<p>We have strong evidence here that there are differences between the three delivery services, between the five sub-office destinations and that there is an interaction between destination and service in line with what we saw in the original plot of the data. Now that we have fitted the model and identified the important factors we need to investigate the model diagnostics to ensure that the various assumptions are broadly valid.</p>
<p>We can plot the model residuals against fitted values to look for obvious trends that are not consistent with the model assumptions about independence and common variance. The first step is to create a data frame with the fitted values and residuals from the above model:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">delivery.res = delivery.df
delivery.res$M1.Fit = fitted(delivery.mod1)
delivery.res$M1.Resid = resid(delivery.mod1)</pre></div></div>

<p>Then a scatter plot is used to display the fitted values and residuals where the colour asthetic highlights which points correspond to the three competing delivery services:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.res, aes(M1.Fit, M1.Resid, colour = Service)) + geom_point() +
  xlab(&quot;Fitted Values&quot;) + ylab(&quot;Residuals&quot;)</pre></div></div>

<p>The <strong>xlab()</strong> and <strong>ylab()</strong> are used to change the text on the axis labels. The residual diagnostic plot is:</p>
<div id="attachment_798" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid1.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid1-300x300.png" alt="Model Residual Plot" title="Model Residual Plot" width="300" height="300" class="size-medium wp-image-798" /></a><p class="wp-caption-text">Diagnostic Residual Plot for Delivery Time Model</p></div>
<p>There are no obvious patterns in this plot that suggest problems with the two-way <strong>ANOVA</strong> model that we fitted to the data.</p>
<p>As an alternative display we could separate the residuals into destination sub-offices, where the <strong>facet_wrap()</strong> function instructs <strong>ggplot</strong> to create a separate display (panel) for each of the destinations.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.res, aes(M1.Fit, M1.Resid, colour = Service)) +
  geom_point() + xlab(&quot;Fitted Values&quot;) + ylab(&quot;Residuals&quot;) +
  facet_wrap( ~ Destination)</pre></div></div>

<p>To produce the following alternative residual plot:</p>
<div id="attachment_799" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid2.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid2-300x300.png" alt="Model Residual Plot" title="Model Residual Plot" width="300" height="300" class="size-medium wp-image-799" /></a><p class="wp-caption-text">Diagnostic Residual Plot for Delivery Time Model by Destination</p></div>
<p>No obvious problems in this diagnostic plot.</p>
<p>We could also consider dividing the data by delivery service to get a different view of the residuals:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.res, aes(M1.Fit, M1.Resid, colour = Destination)) +
  geom_point() + xlab(&quot;Fitted Values&quot;) + ylab(&quot;Residuals&quot;) +
  facet_wrap( ~ Service)</pre></div></div>

<p>This creates the following graph:</p>
<div id="attachment_800" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid3.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-resid3-300x300.png" alt="Model Residual Plot" title="Model Residual Plot" width="300" height="300" class="size-medium wp-image-800" /></a><p class="wp-caption-text">Diagnostic Residual Plot for Delivery Time Model by Service</p></div>
<p>Again there is nothing substantial here to lead us to consider an alternative analysis.</p>
<p>Lastly we consider the normal probability plot of the model residuals, using the <strong>stat_qq()</strong> option:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.res, aes(sample = M1.Resid)) + stat_qq()</pre></div></div>

<p>The quantile plot is:</p>
<div id="attachment_806" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-qq.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-qq-300x300.png" alt="Quantile Plot" title="Quantile Plot" width="300" height="300" class="size-medium wp-image-806" /></a><p class="wp-caption-text">Normal Probability Plot for Delivery Time Model</p></div>
<p>This plot is very close to the straight line we would expect to observe if the data was a close approximation to a normal distribution. To round off the analysis we look at the Tukey HSD multiple comparisons to confirm that the differences are between delivery service 1 and the other two competing services:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; TukeyHSD(delivery.mod1, which = &quot;Service&quot;)
  Tukey multiple comparisons of means
    95% family-wise confidence level
&nbsp;
Fit: aov(formula = Time ~ Destination * Service, data = delivery.df)
&nbsp;
$Service
                        diff        lwr       upr     p adj
Carrier 2-Carrier 1 1.498667  1.2576092 1.7397241 0.0000000
Carrier 3-Carrier 1 1.544667  1.3036092 1.7857241 0.0000000
Carrier 3-Carrier 2 0.046000 -0.1950575 0.2870575 0.8856246</pre></div></div>

<p>Even with the multiple comparison post-hoc adjustment there is very strong evidence for the differences that we have consistenly observed throughout the analysis.</p>
<p>We can use <strong>ggplot</strong> to visualise the difference in mean delivery time for the services and the 95% confidence intervals on these differences. We create a data frame from the <strong>TukeyHSD</strong> output by extracting the component relating to the delivery service comparison and add the text labels by extracting the row names from the data frame.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">delivery.hsd = data.frame(TukeyHSD(delivery.mod1, which = &quot;Service&quot;)$Service)
delivery.hsd$Comparison = row.names(delivery.hsd)</pre></div></div>

<p>We then use the <strong>geom_pointrange()</strong> to specify lower, middle and upper values based on the three pairwise comparisons of interest.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(delivery.hsd, aes(Comparison, y = diff, ymin = lwr, ymax = upr)) +
  geom_pointrange() + ylab(&quot;Difference in Mean Delivery Time by Service&quot;) +
  coord_flip()</pre></div></div>

<p>The <strong>coord_flip()</strong> is used to make the confidence intervals horizontal rather than vertical on the graph. This can be confusing for creating the axis labels as we specify the label where it would appear prior to the filp of coordinates. In the example above we add text to the y axis but this now appears on the x axis in the final graph:</p>
<div id="attachment_811" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-tukeyHSD.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-twoway-tukeyHSD-300x300.png" alt="Tukey HSD" title="Tukey HSD" width="300" height="300" class="size-medium wp-image-811" /></a><p class="wp-caption-text">Plot of Confidence Intervals for Mean Differences using Tukey HSD</p></div>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/two-way-analysis-of-variance-anova/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>One-way Analysis of Variance (ANOVA)</title>
		<link>http://www.wekaleamstudios.co.uk/posts/one-way-analysis-of-variance-anova/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/one-way-analysis-of-variance-anova/#comments</comments>
		<pubDate>Wed, 03 Feb 2010 21:01:24 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Analysis of Variance]]></category>
		<category><![CDATA[Design of Experiments]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Statistical Modelling]]></category>
		<category><![CDATA[analysis of variance]]></category>
		<category><![CDATA[ANOVA]]></category>
		<category><![CDATA[factor]]></category>
		<category><![CDATA[fitted values]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[lm]]></category>
		<category><![CDATA[one way]]></category>
		<category><![CDATA[residuals]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=658</guid>
		<description><![CDATA[Analysis of Variance (ANOVA) is a commonly used statistical technique for investigating data by comparing the means of subsets of the data. The base case is the one-way ANOVA which is an extension of two-sample t test for independent groups covering situations where there are more than two groups being compared. Fast Tube by Casper [...]]]></description>
			<content:encoded><![CDATA[<p>Analysis of Variance (<strong>ANOVA</strong>) is a commonly used statistical technique for investigating data by comparing the means of subsets of the data. The base case is the one-way <strong>ANOVA</strong> which is an extension of two-sample t test for independent groups covering situations where there are more than two groups being compared.<span id="more-658"></span></p>
<p><!--[Fast Tube]--><span id="PBE-llEkiHk" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/one-way-analysis-of-variance-anova/#PBE-llEkiHk"><img src="http://i.ytimg.com/vi/PBE-llEkiHk/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p><!--[Fast Tube]--><span id="r_uSH0Xaau8" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/one-way-analysis-of-variance-anova/#r_uSH0Xaau8"><img src="http://i.ytimg.com/vi/r_uSH0Xaau8/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In one-way <strong>ANOVA</strong> the data is sub-divided into groups based on a single classification factor and the standard terminology used to describe the set of factor levels is <strong>treatment</strong> even though this might not always have meaning for the particular application. There is variation in the measurements taken on the individual components of the data set and ANOVA investigates whether this variation can be explained by the grouping introduced by the classification factor.</p>
<p>As an example we consider one of the data sets available with R relating to an experiment into plant growth. The purpose of the experiment was to compare the yields on the plants for a control group and two treatments of interest. The response variable was a measurement taken on the dried weight of the plants.</p>
<p>The first step in the investigation is to take a copy of the data frame so that we can make some adjustments as necessary while leaving the original data alone. We use the <strong>factor</strong> function to re-define the labels of the <strong>group</strong> variables that will appear in the output and graphs:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">plant.df = PlantGrowth
plant.df$group = factor(plant.df$group,
  labels = c(&quot;Control&quot;, &quot;Treatment 1&quot;, &quot;Treatment 2&quot;))</pre></div></div>

<p>The <strong>labels</strong> argument is a list of names corresponding to the levels of the <strong>group</strong> factor variable.</p>
<p>A boxplot of the distributions of the dried weights for the three competing groups is created using the <strong>ggplot</strong> package:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">require(ggplot2)
&nbsp;
ggplot(plant.df, aes(x = group, y = weight)) +
  geom_boxplot(fill = &quot;grey80&quot;, colour = &quot;blue&quot;) +
  scale_x_discrete() + xlab(&quot;Treatment Group&quot;) +
  ylab(&quot;Dried weight of plants&quot;)</pre></div></div>

<p>The <strong>geom_boxplot()</strong> option is used to specify background and outline colours for the boxes. The axis labels are created with the <strong>xlab()</strong> and <strong>ylab()</strong> options. The plot that is produce looks like this:</p>
<p><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/01/anova-oneway-data.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/01/anova-oneway-data-300x300.png" alt="Boxplot of Plant Growth by Treatment Group" title="Plant Growth Data Summary" width="300" height="300" class="aligncenter size-medium wp-image-754" /></a></p>
<p>Initial inspection of the data suggests that there are differences in the dried weight for the two treatments but it is not so clear cut to determine whether the treatments are different to the control group. To investigate these differences we fit the one-way ANOVA model using the <strong>lm</strong> function and look at the parameter estimates and standard errors for the treatment effects. The function call is:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">plant.mod1 = lm(weight ~ group, data = plant.df)</pre></div></div>

<p>We save the model fitted to the data in an object so that we can undertake various actions to study the goodness of the fit to the data and other model assumptions. The standard summary of a <strong>lm</strong> object is used to produce the following output:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; summary(plant.mod1)
&nbsp;
Call:
lm(formula = weight ~ group, data = plant.df)
&nbsp;
Residuals:
    Min      1Q  Median      3Q     Max 
-1.0710 -0.4180 -0.0060  0.2627  1.3690 
&nbsp;
Coefficients:
                 Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)        5.0320     0.1971  25.527   &lt;2e-16 ***
groupTreatment 1  -0.3710     0.2788  -1.331   0.1944    
groupTreatment 2   0.4940     0.2788   1.772   0.0877 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
&nbsp;
Residual standard error: 0.6234 on 27 degrees of freedom
Multiple R-squared: 0.2641,     Adjusted R-squared: 0.2096 
F-statistic: 4.846 on 2 and 27 DF,  p-value: 0.01591</pre></div></div>

<p>The model output indicates some evidence of a difference in the average growth for the 2nd treatment compared to the control group. An analysis of variance table for this model can be produced via the <strong>anova</strong> command:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; anova(plant.mod1)
Analysis of Variance Table
&nbsp;
Response: weight
          Df  Sum Sq Mean Sq F value  Pr(&gt;F)  
group      2  3.7663  1.8832  4.8461 0.01591 *
Residuals 27 10.4921  0.3886                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1</pre></div></div>

<p>This table confirms that there are differences between the groups which were highlighted in the model summary. The function <strong>confint</strong> is used to calculate confidence intervals on the treatment parameters, by default 95% confidence intervals:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; confint(plant.mod1)
                       2.5 %    97.5 %
(Intercept)       4.62752600 5.4364740
groupTreatment 1 -0.94301261 0.2010126
groupTreatment 2 -0.07801261 1.0660126</pre></div></div>

<p>The model residuals can be plotted against the fitted values to investigate the model assumptions. First we create a data frame with the fitted values, residuals and treatment identifiers:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">plant.mod = data.frame(Fitted = fitted(plant.mod1),
  Residuals = resid(plant.mod1), Treatment = plant.df$group)</pre></div></div>

<p>and then produce the plot:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(plant.mod, aes(Fitted, Residuals, colour = Treatment)) + geom_point()</pre></div></div>

<p>which produces this graph:<br />
<a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-oneway-residualplot.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/02/anova-oneway-residualplot-300x300.png" alt="Residual diagnostic plot" title="Plant Growth Residual Plot" width="300" height="300" class="aligncenter size-medium wp-image-762" /></a><br />
We can see that there is no major problem with the diagnostic plot but some evidence of different variabilities in the spread of the residuals for the three treatment groups.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/one-way-analysis-of-variance-anova/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>

