<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Software for Exploratory Data Analysis and Statistical Modelling &#187; Exploratory Data Analysis</title>
	<atom:link href="http://www.wekaleamstudios.co.uk/topics/statistical-analysis/exploratory-data-analysis/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.wekaleamstudios.co.uk</link>
	<description>Statistical Modelling with R</description>
	<lastBuildDate>Wed, 01 Feb 2012 19:44:22 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Plotting Time Series data using ggplot2</title>
		<link>http://www.wekaleamstudios.co.uk/posts/plotting-time-series-data-using-ggplot2/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/plotting-time-series-data-using-ggplot2/#comments</comments>
		<pubDate>Thu, 30 Sep 2010 21:05:18 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[aes]]></category>
		<category><![CDATA[date]]></category>
		<category><![CDATA[geom_line]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[line]]></category>
		<category><![CDATA[plot]]></category>
		<category><![CDATA[scale_x_date]]></category>
		<category><![CDATA[time series]]></category>
		<category><![CDATA[xlab]]></category>
		<category><![CDATA[ylab]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1404</guid>
		<description><![CDATA[There are various ways to plot data that is represented by a time series in R. The ggplot2 package has scales that can handle dates reasonably easily. Fast Tube by Casper As an example consider a data set on the number of views of the you tube channel ramstatvid. A short snippet of the data [...]]]></description>
			<content:encoded><![CDATA[<p>There are various ways to plot data that is represented by a time series in <strong>R</strong>. The <strong><a href="http://had.co.nz/ggplot2/">ggplot2</a></strong> package has scales that can handle dates reasonably easily.<span id="more-1404"></span></p>
<p><!--[Fast Tube]--><span id="irtSRkhGbXg" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/plotting-time-series-data-using-ggplot2/#irtSRkhGbXg"><img src="http://i.ytimg.com/vi/irtSRkhGbXg/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>As an example consider a data set on the number of views of the you tube channel <a href="http://www.youtube.com/user/ramstatvid?feature=mhum">ramstatvid</a>. A short snippet of the data is shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; head(yt.views)
        Date Views
1 2010-05-17    13
2 2010-05-18    11
3 2010-05-19     4
4 2010-05-20     2
5 2010-05-21    23
6 2010-05-22    26</pre></div></div>

<p>The <strong>ggplot</strong> function is used by specifying a data frame and the <strong>aes</strong> maps the <strong>Date</strong> to the x-axis and the number of <strong>Views</strong> to the y-axis.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(yt.views, aes(Date, Views)) + geom_line() +
  scale_x_date(format = &quot;%b-%Y&quot;) + xlab(&quot;&quot;) + ylab(&quot;Daily Views&quot;)</pre></div></div>

<p>The axis labels for the <strong>Date</strong> variable are created with the <strong>scale_x_date</strong> function where the format is specified as a Month/Year combination with the <strong>%b</strong> and <strong>%Y</strong> formatting strings. The graph that is produced is shown here:</p>
<div id="attachment_1403" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/09/ts-example1.jpg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/09/ts-example1-300x300.jpg" alt="Time Series Example" title="Time Series Example" width="300" height="300" class="size-medium wp-image-1403" /></a><p class="wp-caption-text">Time Series Plot Example with ggplot2 package</p></div>
<p>Other useful resources are provided on the <a href="http://www.wekaleamstudios.co.uk/supplementary-material/">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/plotting-time-series-data-using-ggplot2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Charting the performance of cricket all-rounders &#8211; IT Botham</title>
		<link>http://www.wekaleamstudios.co.uk/posts/charting-the-performance-of-cricket-all-rounders-it-botham/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/charting-the-performance-of-cricket-all-rounders-it-botham/#comments</comments>
		<pubDate>Mon, 16 Aug 2010 19:59:54 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Summary]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[all-rounder]]></category>
		<category><![CDATA[botham]]></category>
		<category><![CDATA[catches]]></category>
		<category><![CDATA[cricket]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[histogram]]></category>
		<category><![CDATA[runs]]></category>
		<category><![CDATA[wickets]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1321</guid>
		<description><![CDATA[Cricket is a sport that generates a large volume of performance data and corresponding debate about the relative qualities of various players over their careers and in relation to their contemporaries. The cricinfo website has an extensive database of statistics for professional cricketers that can be searched to access the information in various formats. As [...]]]></description>
			<content:encoded><![CDATA[<p>Cricket is a sport that generates a large volume of performance data and corresponding debate about the relative qualities of various players over their careers and in relation to their contemporaries. The <a href="http://www.cricinfo.com/">cricinfo</a> website has an extensive database of statistics for professional cricketers that can be searched to access the information in various formats.<span id="more-1321"></span></p>
<p>As an initial example we will consider the English legend Sir Ian Botham who played 102 test matches for England between his debut in 1977 until his final game in 1992.</p>
<p>The first obvious breakdown is to consider how Botham performed against the six countries that he played against during his test career. A summary of his statistics are shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;"> Opposition Matches Bat Inns Runs NO Bowl Inns Wicket Catch
  Australia      36       49 1673  2       66     148    57
      India      14       16 1201  0       23      59    14
New Zealand      15       22  846  2       28      64    14
   Pakistan      14       20  647  1       18      40    14
  Sri Lanka       3        3   41  0        6      11     2
West Indies      20       37  792  1       27      61    19</pre></div></div>

<p>Botham only played three matches against Sri Lanka so it is difficult to properly assess his performance against them. If the above table is stored in a data frame <strong>itb.opp</strong> then we can create a histogram of the total runs (or wickets) by opposition country:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(itb.opp, aes(Opposition, Runs)) + geom_bar() + xlab(&quot;Country&quot;) +
  ylab(&quot;Total Runs&quot;)</pre></div></div>

<p>This code produces the following graph:</p>
<div id="attachment_1355" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Total-Runs.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Total-Runs-300x300.png" alt="IT Botham Total Runs by Opposition" title="IT Botham Total Runs" width="300" height="300" class="size-medium wp-image-1355" /></a><p class="wp-caption-text">IT Botham Total Runs by Opposition</p></div>
<p>The total wickes graph is produced by the next code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(itb.opp, aes(Opposition, Wicket)) + geom_bar() + xlab(&quot;Country&quot;) +
  ylab(&quot;Total Wickets&quot;)</pre></div></div>

<div id="attachment_1356" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Total-Wickets.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Total-Wickets-300x300.png" alt="IT Botham Total Wickets by Opposition" title="IT Botham Total Wickets" width="300" height="300" class="size-medium wp-image-1356" /></a><p class="wp-caption-text">IT Botham Total Wickets by Opposition</p></div>
<p>We may now want to delve deeper into the performance against different nations to take into account the number of games or innings where Botham batted or bowled. The traditional way to assess performance is to calculate batting and bowling averages and we can do this by opposition which provides the following data frame:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; itb.opp.sum
 Opposition Discipline  Average
  Australia    Batting 29.35088
      India    Batting 70.64706
New Zealand    Batting 42.30000
   Pakistan    Batting 32.35000
  Sri Lanka    Batting 13.66667
West Indies    Batting 21.40541
  Australia    Bowling 27.65541
      India    Bowling 26.40678
New Zealand    Bowling 23.43750
   Pakistan    Bowling 31.77500
  Sri Lanka    Bowling 28.18182
West Indies    Bowling 35.18033</pre></div></div>

<p>This can be converted into a dot plot so we can see whether Botham had a high batting average than bowling average, which is often taken to be one of the signs of an all-rounder.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(itb.opp.sum, aes(Average, Opposition, colour = Discipline)) +
  geom_point()+ xlab(&quot;Average&quot;) + ylab(&quot;&quot;)</pre></div></div>

<p>The graph is shown here:</p>
<div id="attachment_1362" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Averages-Country.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Averages-Country-300x300.png" alt="IT Botham Batting and Bowling Averages by Opposition" title="IT Botham Batting and Bowling Averages" width="300" height="300" class="size-medium wp-image-1362" /></a><p class="wp-caption-text">IT Botham Batting and Bowling Averages by Opposition</p></div>
<p>We can see the differences in performance based on the opposition. Botham&#8217;s performance against the West Indies, by far the strongest team during most of his international career, were worse than against the other countries. However, his averages were far from embarassing when compared to other players at the time. The graph also shows that Botham enjoyed batting and bowling against India.</p>
<p>We can divide this data further based on whether the matches were played in England or outside of England and this data is shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; itb.opp.ha.sum
  Opposition Venue Discipline  Average
   Australia  Away    Batting 30.22581
       India  Away    Batting 61.55556
 New Zealand  Away    Batting 50.44444
    Pakistan  Away    Batting 16.00000
   Sri Lanka  Away    Batting 13.00000
 West Indies  Away    Batting 14.17647
   Australia  Home    Batting 28.30769
       India  Home    Batting 80.87500
 New Zealand  Home    Batting 35.63636
    Pakistan  Home    Batting 34.16667
   Sri Lanka  Home    Batting 14.00000
 West Indies  Home    Batting 27.55000
   Australia  Away    Bowling 28.44928
       India  Away    Bowling 25.53333
 New Zealand  Away    Bowling 27.44444
    Pakistan  Away    Bowling 45.00000
   Sri Lanka  Away    Bowling 21.66667
 West Indies  Away    Bowling 39.50000
   Australia  Home    Bowling 26.96203
       India  Home    Bowling 27.31034
 New Zealand  Home    Bowling 20.51351
    Pakistan  Home    Bowling 31.07895
   Sri Lanka  Home    Bowling 30.62500
 West Indies  Home    Bowling 31.97143</pre></div></div>

<p>A dot plot is created from this data with a separate panel for each of the six opposition countries and the averages divided into batting and bowling performances. The coloured dots in the graph indicated whether the average is for matches at home or away.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(itb.opp.ha.sum, aes(Average, Discipline, colour = Venue)) +
  geom_point() + facet_wrap( ~ Opposition) +
  xlab(&quot;Batting Average&quot;) + ylab(&quot;&quot;)</pre></div></div>

<p>This graph is shown below:</p>
<div id="attachment_1366" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Averages-Country-HomeAway.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Averages-Country-HomeAway-300x300.png" alt="IT Botham Batting and Bowling Averages by Country and Home/Away" title="IT Botham Batting and Bowling Averages" width="300" height="300" class="size-medium wp-image-1366" /></a><p class="wp-caption-text">IT Botham Batting and Bowling Averages by Country and Home/Away</p></div>
<p>We can see that the difference between home and away peformance is, in general, not very large for bowling averages but in some cases there is a noticeable difference in batting averages. When looking at Botham&#8217;s performances against the West Indies his statistics at home are much better than his away performance, suggesting that his main struggles against the strong West Indies team were in the Caribbean. This might be due to his swing bowling being more suitable to English conditions compared to pitches in the West Indies.</p>
<p>To round off this brief look at the career of IT Botham let us consider some other important statistics, in particular games where he performed with the bat and ball.</p>
<ul>
<li>Overall Botham scored 14 hundreds and 22 fifties out of 161 innings so he reached fifty runs every five innings or so.</li>
<li>He also took 27 five wicket hauls and 17 four wicket hauls so he took four or more wickets every four innings or so.</li>
<li>He took 120 catches.</li>
</ul>
<p>Individual matches of excellence include five games with a century and at least five wickets:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">Year  Opposition       Ground Venue Runs Wicket
1978 New Zealand Christchurch  Away  133      8
1978    Pakistan       Lord's  Home  108      8
1980       India       Mumbai  Away  114     13
1981   Australia        Leeds  Home  199      7
1984 New Zealand   Wellington  Away  138      6</pre></div></div>

<p>These performances and others show why Botham was considered such a great player as he produced some sustained periods of excellent all-round cricket rather than having one discipline more dominant for a long period of time.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/charting-the-performance-of-cricket-all-rounders-it-botham/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Displaying data using level plots</title>
		<link>http://www.wekaleamstudios.co.uk/posts/displaying-data-using-level-plots/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/displaying-data-using-level-plots/#comments</comments>
		<pubDate>Mon, 03 May 2010 10:17:08 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Base Graphics]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Lattice Graphics]]></category>
		<category><![CDATA[Trellis Graphics]]></category>
		<category><![CDATA[box]]></category>
		<category><![CDATA[expand.grid]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[image]]></category>
		<category><![CDATA[lattice]]></category>
		<category><![CDATA[levelplot]]></category>
		<category><![CDATA[loess]]></category>
		<category><![CDATA[predict]]></category>
		<category><![CDATA[surface]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1008</guid>
		<description><![CDATA[A level plot is a type of graph that is used to display a surface in two rather than three dimensions &#8211; the surface is viewed from above as if we were looking straight down and is an alternative to a contour plot &#8211; geographic data is an example of where this type of graph [...]]]></description>
			<content:encoded><![CDATA[<p>A level plot is a type of graph that is used to display a surface in two rather than three dimensions &#8211; the surface is viewed from above as if we were looking straight down and is an alternative to a contour plot &#8211; geographic data is an example of where this type of graph would be used. A contour plot uses lines to identify regions of different heights and the level plot uses coloured regions to produce a similar effect.<span id="more-1008"></span></p>
<p>To illustrate this type of graph we will consider some surface elevation data that is available in the <strong>geoR</strong> package. The data set in this package is called <strong>elevation</strong> and stores the elevation height in feet (as multiples of ten feet) for a grid region of x and y coordinates (recorded as multiples of 50 feet). To access this data we load the <strong>geoR</strong> pacakage and then use the <strong>data</strong> function:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">require(geoR)
data(elevation)</pre></div></div>

<p>For some packages we need the call to the <strong>data</strong> function to make a set of data available for our use. The <strong>elevation</strong> object is not a data frame so our first step is to create our own data frame to be used to create the level plots using the different graphics packages.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">elevation.df = data.frame(x = 50 * elevation$coords[,&quot;x&quot;],
  y = 50 * elevation$coords[,&quot;y&quot;], z = 10 * elevation$data)</pre></div></div>

<p>We extract the x and y grid coordinates and the height values, multiplying them by 50 and 10 respectively to convert to feet for the graphs. Rather than trying to plot the individual values we need to create a surface to cover the whole grid region as the points themselves are too sparse. We make use of the <strong>loess</strong> function to fit a local polynomial trend surface (using weighted least squares) to approximate the elevation across the whole region. The function call for a local quadratic surface is shown below:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">elevation.loess = loess(z ~ x*y, data = elevation.df,
  degree = 2, span = 0.25)</pre></div></div>

<p>The next stage is to extract heights from this fitted surface at regular intervals across the whole grid region of interest &#8211; which runs from 10 to 300 feet in both the x and y directions. The <strong>expand.grid</strong> function creates an array of all combinations of the x and y values that we specify in a list. We choose a range every foot from 10 to 300 feet to create a fine grid:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">elevation.fit = expand.grid(list(x = seq(10, 300, 1), y = seq(10, 300, 1)))</pre></div></div>

<p>The <strong>predict</strong> function is then used to estimate the surface height at all of these combinations of x and y coordinates covering our grid region. This is saved as an object <strong>z</strong> which will be used by the <strong>base</strong> graphics function:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">z = predict(elevation.loess, newdata = elevation.fit)</pre></div></div>

<p>The <strong>lattice</strong> and <strong>ggplot2</strong> expect the data in a different format so we make use of the <strong>as.numeric</strong> function to convert from a table of heights to a single column and append to the object we create based on all combinations of x and y coordinates:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">elevation.fit$Height = as.numeric(z)</pre></div></div>

<p>The data is now in a format that can be used to create the level plots in the various packages.</p>
<p><strong>Base Graphics</strong></p>
<p>The function <strong>image</strong> in the <strong>base</strong> graphics package is the function we use to create a level plot. This function requires a list of x and y values that cover the grid of vertical values that will be used to create the surface. These heights are specified as a table of values, which in our case was saved as the object <strong>z</strong> during the calculations on the local trend surface.</p>
<p>The text on the axis labels are specified by the <strong>xlab</strong> and <strong>ylab</strong> function arguments and the <strong>main</strong> argument determines the overall title for the graph. The function call below creates the level plot:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">image(seq(10, 300, 1), seq(10, 300, 1), z,
  xlab = &quot;X Coordinate (feet)&quot;, ylab = &quot;Y Coordinate (feet)&quot;,
  main = &quot;Surface elevation data&quot;)
box()</pre></div></div>

<p>After the <strong>image</strong> function is used we call the <strong>box</strong> function mainly for aesthetic purposes to ensure there is a line surrounding the level plot. The graph that is created is shown below:</p>
<div id="attachment_1012" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/05/levelplot-base.jpg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/05/levelplot-base-300x300.jpg" alt="Base Graphics Level Plot" title="Level plot Example" width="300" height="300" class="size-medium wp-image-1012" /></a><p class="wp-caption-text">Base Graphics Level Plot</p></div>
<p>The default colour scheme used by the <strong>base</strong> graphics produces an attractive level plot graph where we can easily see the variation in height across the grid region. It is basically a fancy version of a contour plot where the regions between the contour lines are coloured with different shades indicating the height in those regions.</p>
<p><strong>Lattice Graphics</strong></p>
<p>The <strong>lattice</strong> graphics package provides a function <strong>levelplot</strong> for this type of graphical dispaly. We use the data stored in the object <strong>elevation.fit</strong> to create the graph with <strong>lattice</strong> graphics.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">levelplot(Height ~ x*y, data = elevation.fit,
  xlab = &quot;X Coordinate (feet)&quot;, ylab = &quot;Y Coordinate (feet)&quot;,
  main = &quot;Surface elevation data&quot;,
  col.regions = terrain.colors(100)
)</pre></div></div>

<p>The formula is used to specify which variable to use for the three axes and a data frame where the values are stored &#8211; as there are three dimensions it is the z axis that is specified on the left hand side of the formula. The axes labels and title are specified in the same way as the <strong>base</strong> graphics.</p>
<p>The range of colours used in the <strong>lattice</strong> level plot can be specified as a vector of colours to the <strong>col.regions</strong> argument of the function. We make use of the <strong>terrian.colors</strong> function to create this vector which a range of 100 colours which are less striking than those used above with the <strong>base</strong> graphics. The level plot that we can is shown here:</p>
<div id="attachment_1014" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/05/levelplot-lattice.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/05/levelplot-lattice-300x300.jpg" alt="Lattice Graphics Level Plot" title="Level plot Example" width="300" height="300" class="size-medium wp-image-1014" /></a><p class="wp-caption-text">Lattice Graphics Level Plot</p></div>
<p>This is in general similar to the <strong>base</strong> graphics display but the actual plot region is a different shape that makes things look slightly different.</p>
<p><strong>ggplot2</strong></p>
<p>The <strong>ggplot2</strong> package also provides facilities for creating a level plot making use of the tile geom to create the desired graph. The function <strong>ggplot</strong> forms the basis of the graph and various other options are used to customise the graph:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(elevation.fit, aes(x, y, fill = Height)) + geom_tile() +
  xlab(&quot;X Coordinate (feet)&quot;) + ylab(&quot;Y Coordinate (feet)&quot;) +
  opts(title = &quot;Surface elevation data&quot;) +
  scale_fill_gradient(limits = c(7000, 10000),low = &quot;black&quot;,high = &quot;white&quot;) +
  scale_x_continuous(expand = c(0,0)) +
  scale_y_continuous(expand = c(0,0))</pre></div></div>

<p>This large number of options that are added to the graph change various settings. The choice of colours for the heights used on graph is selected by the <strong>scale_fill_gradient</strong> function with colours ranging from black to white. The <strong>scale_x_continuous</strong> and <strong>scale_y_continuous</strong> options are used to stretch the tiles to cover the whole grid region covering up the default gray background &#8211; this makes the graph more visually appealing. The graph that is produced is shown here:</p>
<div id="attachment_1013" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/05/levelplot-ggplot2.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/05/levelplot-ggplot2-300x300.jpg" alt="ggplot2 Level Plot" title="Level plot Example" width="300" height="300" class="size-medium wp-image-1013" /></a><p class="wp-caption-text">ggplot2 Level Plot</p></div>
<p>The graph from <strong>ggplot2</strong> is visually as impressive as the other graphs &#8211; there is more smoothing between the colours which blurs some of the lines on the other graphs because of the type of colour gradient that was selected.</p>
<p>This blog post is summarised in a pdf leaflet on the <a href="http://www.wekaleamstudios.co.uk/supplementary-material/">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/displaying-data-using-level-plots/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Summarising data using box and whisker plots</title>
		<link>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-box-and-whisker-plots/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-box-and-whisker-plots/#comments</comments>
		<pubDate>Sun, 25 Apr 2010 07:37:10 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Base Graphics]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Lattice Graphics]]></category>
		<category><![CDATA[Trellis Graphics]]></category>
		<category><![CDATA[Box and Whisker]]></category>
		<category><![CDATA[boxplot]]></category>
		<category><![CDATA[bwplot]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[lattice]]></category>
		<category><![CDATA[trellis]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=960</guid>
		<description><![CDATA[A box and whisker plot is a type of graphical display that can be used to summarise a set of data based on the five number summary of this data. The summary statistics used to create a box and whisker plot are the median of the data, the lower and upper quartiles (25% and 75%) [...]]]></description>
			<content:encoded><![CDATA[<p>A box and whisker plot is a type of graphical display that can be used to summarise a set of data based on the five number summary of this data. The summary statistics used to create a box and whisker plot are the median of the data, the lower and upper quartiles (25% and 75%) and the minimum and maximum values.<span id="more-960"></span></p>
<p>The box and whisker plot is an effective way to investigate the distribution of a set of data. For example, skewness can be identified from the box and whisker as the display does not make any assumptions about the underlying distribution of the data. The extreme values at either end of the scale are sometimes included on the display to show how far they extend beyond the majority of the data.</p>
<p>To illustrate creating box and whisker plots we consider UK meteorological data that has been collected on a monthly basis at Southampton, UK between 1950 and 1999 and is publicly available. This data is available from the <a href="http://www.metoffice.gov.uk/">UK Met Office</a> and we will compare the range of temperatures recorded in each month of the year over this period by creating box and whisker plots with the different packages.</p>
<p>The data is assumed to have been imported into <strong>R</strong> and stored in a data frame called <strong>soton.df</strong>. An extract of the data is shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">    Year Month Max.Temp Min.Temp Frost  Rain
1   1950   Jan      7.7      2.8     7  20.1
2   1950   Feb     10.3        4     4 127.0
3   1950   Mar     13.0      4.5     2  39.4
4   1950   Apr     13.6      4.7     0  62.0
5   1950   May     17.9      7.8     0  32.2</pre></div></div>

<p><strong>Base Graphics</strong></p>
<p><!--[Fast Tube]--><span id="Pe-48TAtBho" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-box-and-whisker-plots/#Pe-48TAtBho"><img src="http://i.ytimg.com/vi/Pe-48TAtBho/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>The <strong>base</strong> graphics approach makes use of the <strong>boxplot</strong> function to create box and whisker plots. In this situation the function can be used with a formula rather than specifying two separate vectors of data &#8211; we can specify a data frame to point towards a source of data to be used in the graph. For the temperature data we use this code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">boxplot(Max.Temp ~ Month, data = soton.df,
  xlab = &quot;Month&quot;, ylab = &quot;Maximum Temperature&quot;,
  main = &quot;Temperature at Southampton Weather Station (1950-1999)&quot;
)</pre></div></div>

<p>The horizontal and vertical axes labels are specified using the <strong>xlab</strong> and <strong>ylab</strong> arguments respectively and the title of the plot is created using the <strong>main</strong> argument. The box and whisker plot is shown here:</p>
<div id="attachment_962" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/boxwhisker-base.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/boxwhisker-base-300x300.jpg" alt="Base Graphics Box and Whisker Plot" title="Box and Whisker plot Example" width="300" height="300" class="size-medium wp-image-962" /></a><p class="wp-caption-text">Base Graphics Box and Whisker Plot</p></div>
<p>The function <strong>boxplot</strong> makes it easy to create a reasonably attractive box and whisker plot. The variation in the distribution of temperatures across the year can be seen from the graph.</p>
<p><strong>Lattice Graphics</strong></p>
<p><!--[Fast Tube]--><span id="RJcZ_7EOzv8" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-box-and-whisker-plots/#RJcZ_7EOzv8"><img src="http://i.ytimg.com/vi/RJcZ_7EOzv8/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In the <strong>lattice</strong> graphics package there is a function <strong>bwplot</strong> which is used to create box and whisker plots. The function call also uses a formula to specify the <strong>x</strong> and <strong>y</strong> variables to use on the graph. The function call arguments are identical to the <strong>boxplot</strong> function in <strong>base</strong> graphics:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">bwplot(Max.Temp ~ Month, data = soton.df,
  xlab = &quot;Month&quot;, ylab = &quot;Maximum Temperature&quot;,
  main = &quot;Temperature at Southampton Weather Station (1950-1999)&quot;
)</pre></div></div>

<p>The variable <strong>Month</strong> is categorical so a separate box and whisker summary is created for each month separately. The <strong>lattice</strong> version of the graph is shown here:</p>
<div id="attachment_963" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/boxwhisker-lattice.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/boxwhisker-lattice-300x300.jpg" alt="Lattice Graphics Box and Whisker Plot" title="Box and Whisker plot Example" width="300" height="300" class="size-medium wp-image-963" /></a><p class="wp-caption-text">Lattice Graphics Box and Whisker Plot</p></div>
<p>This is very similar to the box and whisker plot created by <strong>base</strong> graphics with a similar level of effort required. The main difference is the use of a circle rather than a line to identify the location of the median of the data.</p>
<p><strong>ggplot2</strong></p>
<p><!--[Fast Tube]--><span id="WJQdYId2TUA" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-box-and-whisker-plots/#WJQdYId2TUA"><img src="http://i.ytimg.com/vi/WJQdYId2TUA/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In the <strong>ggplot2</strong> package there is a general function <strong>ggplot</strong> that is used to create graphs of any type. We make use of the boxplot geom to create a box and whisker plot following the standard approach. The first step is to specify a data frame to use to create the graph and then map the columns of this data frame, via the \texttt{aes} argument, to the different axes or other aesthetics (such as colour or symbol shape). The particular geom is used to specify the type of plot that we want to create. Our final step is to add on the various axes labels and an overall title to the graph.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(soton.df, aes(Month, Max.Temp)) + geom_boxplot() +
  ylab(&quot;Maximum Temperature&quot;) +
  opts(title = &quot;Temperature at Southampton Weather Station (1950-1999)&quot;)</pre></div></div>

<p>The <strong>ggplot2</strong> version of box and whisker plots is shown here:</p>
<div id="attachment_964" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/boxwhisker-ggplot2.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/boxwhisker-ggplot2-300x300.jpg" alt="ggplot2 Graphics Box and Whisker Plot" title="Box and Whisker plot Example" width="300" height="300" class="size-medium wp-image-964" /></a><p class="wp-caption-text">ggplot2 Graphics Box and Whisker Plot</p></div>
<p>The distinctive gray background used by <strong>ggplot2</strong> is an obvious visual difference compared to the default clear background used in the other two approaches. The boxes themselves have a cleaner look in this graph than the other two methods and the overall look is slick.</p>
<p>This blog post is summarised in a pdf leaflet on the <a href="http://www.wekaleamstudios.co.uk/supplementary-material/">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-box-and-whisker-plots/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>R and Tolerance Intervals</title>
		<link>http://www.wekaleamstudios.co.uk/posts/r-and-tolerance-intervals/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/r-and-tolerance-intervals/#comments</comments>
		<pubDate>Mon, 19 Apr 2010 20:19:31 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Summary]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Probability Distributions]]></category>
		<category><![CDATA[normtol.int]]></category>
		<category><![CDATA[tolerance]]></category>
		<category><![CDATA[Tolerance Intervals]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=905</guid>
		<description><![CDATA[Confidence intervals and prediction intervals are used by statisticians on a regular basis. Another useful interval is the tolerance interval that describes the range of values for a distribution with confidence limits calculated to a particular percentile of the distribution. The R package tolerance can be used to create a variety of tolerance intervals of [...]]]></description>
			<content:encoded><![CDATA[<p>Confidence intervals and prediction intervals are used by statisticians on a regular basis. Another useful interval is the tolerance interval that describes the range of values for a distribution with confidence limits calculated to a particular percentile of the distribution. The <strong>R</strong> package <strong>tolerance</strong> can be used to create a variety of tolerance intervals of interest.<span id="more-905"></span></p>
<p>These tolerance limits, taken from the estimated interval, are limits within which a stated proportion of the population is expected to occur. The function <strong>normtol.int</strong> from the <strong>tolerance</strong> package can be used to calculate a tolerance interval for data from a normal distribution.</p>
<p>The function arguments include the data itself in a vector denoted <strong>x</strong>. The confidence level associated with the tolerance interval is specified by <strong>alpha</strong>, where <strong>alpha</strong> is the difference between 100% and the confidence level &#8211; <strong>alpha</strong> is 0.05 for 95% confidence. The argument <strong>P</strong> is the proportion of the data to be included in the tolerance interval. The <strong>side</strong> argument determines whether a one-sided or two-sided interval is required.</p>
<p>Consider a simulated set of data from a manufacturing process loaded into R, stored as vector object <strong>obs</strong>, as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">obs = c(102.17, 102.45, 106.23, 98.16, 100.82, 101.40, 90.51, 102.51, 97.93,
  96.98, 101.74, 104.34, 103.50, 94.72, 102.80, 103.92, 97.43, 102.76, 100.03,
  107.12, 104.96, 105.32, 87.06, 97.89, 100.23)</pre></div></div>

<p>A 95% tolerance interval for 90% of data of this type, based on the 25 observations above is created with this code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; normtol.int(x = obs, alpha = 0.05, P = 0.90, side = 2)
  alpha   P    x.bar 2-sided.lower 2-sided.upper
1  0.05 0.9 100.5192      90.07606      110.9623</pre></div></div>

<p>The <strong>alpha</strong> and <strong>P</strong> are as noted above and the average of the data is reported along with the lower and upper tolerance intervals in this case as we asked for a two-sided interval. This can be easily changed to cover 95% rather than 90% of the data:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; normtol.int(x = obs, alpha = 0.05, P = 0.95, side = 2)
  alpha    P    x.bar 2-sided.lower 2-sided.upper
1  0.05 0.95 100.5192      88.07543      112.9630</pre></div></div>

<p>The package <strong>tolerance</strong> can create intervals for other data distributions.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/r-and-tolerance-intervals/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Summarising data using scatter plots</title>
		<link>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-scatter-plots/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-scatter-plots/#comments</comments>
		<pubDate>Sun, 18 Apr 2010 18:56:06 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Base Graphics]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Lattice Graphics]]></category>
		<category><![CDATA[Trellis Graphics]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[lattice]]></category>
		<category><![CDATA[plot]]></category>
		<category><![CDATA[scatter plot]]></category>
		<category><![CDATA[trellis]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[xyplot]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=912</guid>
		<description><![CDATA[A scatter plot is a graph used to investigate the relationship between two variables in a data set. The x and y axes are used for the values of the two variables and a symbol on the graph represents the combination for each pair of values in the data set. This type of graph is [...]]]></description>
			<content:encoded><![CDATA[<p>A scatter plot is a graph used to investigate the relationship between two variables in a data set. The x and y axes are used for the values of the two variables and a symbol on the graph represents the combination for each pair of values in the data set. This type of graph is used in many common situations and can convey a lot of useful information.<span id="more-912"></span></p>
<p>To illustrate creating a scatter plot we will use a simple data set for the population of the UK between 1992 and 2009. This data is saved in a data frame <strong>uk.df</strong> using the following command:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">uk.df = data.frame(Year = 1992:2009,
  Population = c(57770, 57933, 58096, 58258, 58418, 58577,
  58743, 58925, 59131, 59363, 59618, 59894, 60186, 60489,
  60804, 61129, 61461, 61796)
)</pre></div></div>

<p>For this example the data is recorded in thousands to make the graph easier to read and there is no benefit or noticeable improvement to be seen by using greater detail.</p>
<p><strong>Base Graphics</strong></p>
<p><!--[Fast Tube]--><span id="aqXuiQR4bnY" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-scatter-plots/#aqXuiQR4bnY"><img src="http://i.ytimg.com/vi/aqXuiQR4bnY/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In the <strong>base</strong> graphics system the general purpose <strong>plot</strong> function can be used to create a scatter plot for the UK population data set that we created. The first two arguments to the <strong>plot</strong> function are the x and y variables respectively. The following code will create a scatter plot, including various labels:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">plot(uk.df$Year, uk.df$Population,
  xlab = &quot;Year&quot;, ylab = &quot;Total Population (Thousands)&quot;,
  main = &quot;UK Population (1992-2009)&quot;, pch = 16)</pre></div></div>

<p>The labels for the x and y axes are specified via the <strong>xlab</strong> and <strong>ylab</strong> arguments to the plot function and the <strong>main</strong> argument specifies the title for the plot.</p>
<div id="attachment_919" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/scatterplot-base.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/scatterplot-base-300x300.jpg" alt="Base Graphics Histogram" title="Scatter plot Example" width="300" height="300" class="size-medium wp-image-919" /></a><p class="wp-caption-text">Base Graphics Histogram</p></div>
<p>The graph itself is plain and functional which solid circles indicating the population (in thousands) for each of the years covered by the data.</p>
<p><strong>Lattice Graphics</strong></p>
<p><!--[Fast Tube]--><span id="NMTCIViCLOU" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-scatter-plots/#NMTCIViCLOU"><img src="http://i.ytimg.com/vi/NMTCIViCLOU/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>The <strong>lattice</strong> graphics package provides a function <strong>xyplot</strong> specifically to create scatter plots and the function is used in a similar way to the <strong>base</strong> graphics approach. The first argument to the function is a formula describing the relationship to be plotted on the graph, with the y variable preceding the x variable as we are used to when describing mathematical fomula such as y=a+bx. The data frame is specified with the <strong>data</strong> argument to simplify the expression in the formula. The code used is as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">xyplot(Population ~ Year, data = uk.df,
  xlab = &quot;Year&quot;, ylab = &quot;Total Population (Thousands)&quot;,
  main = &quot;UK Population (1992-2009)&quot;,
  scales = list(x = list(at = seq(1992, 2009, 2)))
)</pre></div></div>

<p>The axis labels and the overall title for the graph are specified in the same way as the <strong>base</strong> graphics system. We indulge in some fine tuning of the labels on the x axis via the <strong>scales</strong> argument &#8211; here we indicate that every second year should be included on the label starting in 1992 and running until 2009. The <strong>lattice</strong> graph is shown here for comparison with the graphs created using the other two packages:</p>
<div id="attachment_921" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/scatterplot-lattice.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/scatterplot-lattice-300x300.jpg" alt="Lattice Graphics Scatter Plot" title="Scatter plot Example" width="300" height="300" class="size-medium wp-image-921" /></a><p class="wp-caption-text">Lattice Graphics Scatter Plot</p></div>
<p>There are very few visual differences between the <strong>lattice</strong> and <strong>base</strong> graphics. In <strong>lattice</strong> graphics an object is created that can be edited to add or remove components and then printed to the screen. This approach is more flexible than the base graphics where the components are painted on top of each other and the use of themes in <strong>lattice</strong> will make it easier to keep a consistent look to all graphs in a document.</p>
<p><strong>ggplot2</strong></p>
<p><!--[Fast Tube]--><span id="TagaAeIHKks" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-scatter-plots/#TagaAeIHKks"><img src="http://i.ytimg.com/vi/TagaAeIHKks/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In the <strong>ggplot2</strong> package the <strong>ggplot</strong> function is used to create graphs of all types rather than having a separate function defined for each type of graph. The first argument is adata frame with the data to be plotted and the <strong>aes</strong> argument specifies the aesthetics associated with the graph such as the point symbol, size or colour. In this case the <strong>Year</strong> variable appears on the x axis and the <strong>Population</strong> variable on the y axis. The code to create the scatter plot is shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(uk.df, aes(Year, Population)) + geom_point() +
  xlab(&quot;Year&quot;) + ylab(&quot;Total Population (Thousands)&quot;) +
  opts(title = &quot;UK Population (1992-2009)&quot;)</pre></div></div>

<p>The <strong>geom_point</strong> specifies the type of graph to create (a scatter plot in this situation and this highlights the flexibility of the <strong>ggplot2</strong> package as changing the geom will create a new type of graph) and the labels for the graph are created by adding them to the graph with the <strong>xlab</strong>, <strong>ylab</strong> and <strong>opts</strong> functions. The graph is shown below:</p>
<div id="attachment_920" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/scatterplot-ggplot2.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/scatterplot-ggplot2-300x300.jpg" alt="ggplot2 Scatter plot" title="Scatter plot Example" width="300" height="300" class="size-medium wp-image-920" /></a><p class="wp-caption-text">ggplot2 Scatter plot</p></div>
<p>This graph is not greatly different to the scatter plot created using the <strong>base</strong> and <strong>lattice</strong> packages. The default theme in the <strong>ggplot2</strong> package has a gray background with white grid lines that allows easy visual recognition of graphs created using this package.</p>
<p>This blog post is summarised in a pdf leaflet on the <a href="http://www.wekaleamstudios.co.uk/supplementary-material/">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-scatter-plots/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Summarising data using histograms</title>
		<link>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-histograms/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-histograms/#comments</comments>
		<pubDate>Sun, 11 Apr 2010 08:53:16 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Base Graphics]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Lattice Graphics]]></category>
		<category><![CDATA[Trellis Graphics]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[hist]]></category>
		<category><![CDATA[histogram]]></category>
		<category><![CDATA[lattice]]></category>
		<category><![CDATA[trellis]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=870</guid>
		<description><![CDATA[The histogram is a standard type of graphic used to summarise univariate data where the range of values in the data set is divided into regions and a bar (usually vertical) is plotted in each of these regions with height proportional to the frequency of observations in that region. In some cases the proportion of [...]]]></description>
			<content:encoded><![CDATA[<p>The histogram is a standard type of graphic used to summarise univariate data where the range of values in the data set is divided into regions and a bar (usually vertical) is plotted in each of these regions with height proportional to the frequency of observations in that region. In some cases the proportion of data points in each region is shown instead of counts.<span id="more-870"></span></p>
<p>The shape of the histogram is determined by the width and number of regions that divided up the data. A histogram provides an indication the following features of a set of data: the general shape, symmetry or skewness of data and modality (uni-, bi- or multi-modal). There are some situations where a different type of graph would be preferable but histograms are useful for describing the general features of the distribution of a set of data.</p>
<p>To illustrate creating a histogram we consider data from the AFL sports league in Australia and the total number of points scored by the home team in each fixture. If we assume that the data is in a comma separated text file, called <strong>afl_2003_2007.csv</strong>, then we would import that data using the following command saving the results in a data frame:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">afl.df = read.csv(&quot;afl_2003_2007.csv&quot;)</pre></div></div>

<p>Edit: The data is available as <a href='http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/12/afl_2003_2007.txt'>AFL Data Set</a>. Change the file extension manually to <strong>csv</strong> or change the command to reflect the different file name.</p>
<p><strong>Base Graphics</strong></p>
<p><!--[Fast Tube]--><span id="4Q9vPuj4w8c" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-histograms/#4Q9vPuj4w8c"><img src="http://i.ytimg.com/vi/4Q9vPuj4w8c/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In <strong>base</strong> graphics the function <strong>hist</strong> is used to create a histogram with the first argument being the name of the vector that contains the data to be plotted. The <strong>x-axis</strong> is given a label using the <strong>xlab</strong> argument and the <strong>main</strong> argument is used to add a title to the graph. Code to create a histogram of home points is shown below:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">hist(afl.df$Home.Total, xlab = &quot;Home Points&quot;,
  main = &quot;Histogram of Points Scored at Home\nAFL 2003-2007&quot;)</pre></div></div>

<p>The default option is to display bars representing the frequency of data values in each of the ranges and the overall look of the graph is basic as shown here:</p>
<div id="attachment_877" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/histogram-base.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/histogram-base-300x300.jpg" alt="Base Graphics Histogram" title="Histogram Example" width="300" height="300" class="size-medium wp-image-877" /></a><p class="wp-caption-text">Base Graphics Histogram</p></div>
<p>The default algorithm for selecting number of bins to use for the histogram usually makes a sensible selection but this can be specified if required.</p>
<p><strong>Lattice Graphics</strong></p>
<p><!--[Fast Tube]--><span id="hxQmEhzgWks" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-histograms/#hxQmEhzgWks"><img src="http://i.ytimg.com/vi/hxQmEhzgWks/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In the <strong>lattice</strong> graphics package there is a function <strong>histogram</strong> and we make use of the formula to specify a single variable for the number of points scored by the home team. The specification for the axis labels and graph title are the same as for the <strong>base</strong> graphics package. The equivalent graph is created using the following code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">histogram( ~ Home.Total, data = afl.df, xlab = &quot;Home Points&quot;,
  main = &quot;Histogram of Points Scored at Home\nAFL 2003-2007&quot;)</pre></div></div>

<p>Here the default option is the work with proportions of the total number of data points rather than counts so the shape of the distribution is slightly different when compared to the <strong>base</strong> graphics plot. The <strong>lattice</strong> version is shown below:</p>
<div id="attachment_880" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/histogram-lattice.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/histogram-lattice-300x300.jpg" alt="Lattice Graphics Histogram" title="Histogram Example" width="300" height="300" class="size-medium wp-image-880" /></a><p class="wp-caption-text">Lattice Graphics Histogram</p></div>
<p>The main other difference is the choice of colour for the bars in the histogram and these can be adjusted by changing the global theme for <strong>lattice</strong>.</p>
<p><strong>ggplot2</strong></p>
<p><!--[Fast Tube]--><span id="47kWynt3b6M" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-histograms/#47kWynt3b6M"><img src="http://i.ytimg.com/vi/47kWynt3b6M/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>The <strong>ggplot2</strong> library uses a general purpose graphics function called <strong>ggplot</strong> to create graphs of all types and the geom specifies the type of display to create, in this case a histogram. Components that make up the graph are added sequentially to build up the whole plot and in the example below we add axis labels and a main title.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(afl.df, aes(Home.Total)) + geom_histogram() +
  xlab(&quot;Home Points&quot;) + ylab(&quot;Frequency&quot;) +
  opts(title = &quot;Histogram of Points Scored at Home\nAFL 2003-2007&quot;)</pre></div></div>

<p>The default theme for <strong>ggplot2</strong> is distinctive and the histogram is shown in the graph below:</p>
<div id="attachment_881" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/histogram-ggplot2.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/04/histogram-ggplot2-300x300.jpg" alt="ggplot 2 Histogram" title="Histogram Example" width="300" height="300" class="size-medium wp-image-881" /></a><p class="wp-caption-text">ggplot 2 Histogram</p></div>
<p>The default number of bins is larger compared to <strong>base</strong> and <strong>lattice</strong> graphics which provides a rough distribution in this particular case. The online <a href="http://had.co.nz/ggplot2/">ggplot2</a> manual is a good source of information about customising graphs created using this approach.</p>
<p>This blog post is summarised in a pdf leaflet on the <a href="http://www.wekaleamstudios.co.uk/?page_id=282">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-histograms/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Summarising data using dot plots</title>
		<link>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-dot-plots/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-dot-plots/#comments</comments>
		<pubDate>Fri, 26 Mar 2010 10:53:00 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Base Graphics]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Lattice Graphics]]></category>
		<category><![CDATA[Trellis Graphics]]></category>
		<category><![CDATA[Cleveland]]></category>
		<category><![CDATA[dot plot]]></category>
		<category><![CDATA[dotplot]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[lattice]]></category>
		<category><![CDATA[plot]]></category>
		<category><![CDATA[points]]></category>
		<category><![CDATA[trellis]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=847</guid>
		<description><![CDATA[A dot plot is a type of display that compares counts, frequencies, totals or other summary measures for a series of categories. The dot plot can be arranged with the categories either on the vertical or horizontal axis of the display to allow comparising between the different categories as well as comparison within categories where [...]]]></description>
			<content:encoded><![CDATA[<p>A dot plot is a type of display that compares counts, frequencies, totals or other summary measures for a series of categories. The dot plot can be arranged with the categories either on the vertical or horizontal axis of the display to allow comparising between the different categories as well as comparison within categories where there are multiple symbols used to denote say different years.<span id="more-847"></span></p>
<p>In this post we will considered creating a dot plot using the <strong>base</strong> graphics, <strong>lattice</strong> graphics and <strong>ggplot2</strong> approaches. To illustrate creating a dot plot we used data from the <a href="http://faostat.fao.org">FAO website</a> on the total irrigation area for Africa, Latin America, North America and Europe. We create a data frame using the following code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">irrigation.df = data.frame(
  Region = rep(c(&quot;Africa&quot;, &quot;Latin America&quot;, &quot;North America&quot;, &quot;Europe&quot;), 4),
  Year = factor(c(rep(1980, 4), rep(1990, 4), rep(2000, 4), rep(2007, 4))),
  Area = c(9.3, 12.7, 21.2, 18.8, 11.0, 15.5, 21.6, 25.3,
    13.2, 17.3, 23.3, 26.7, 13.6, 17.3, 23.8, 26.3)
)</pre></div></div>

<p><strong>Base Graphics</strong></p>
<p><!--[Fast Tube]--><span id="5izUzQKL1yw" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-dot-plots/#5izUzQKL1yw"><img src="http://i.ytimg.com/vi/5izUzQKL1yw/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In the <strong>base</strong> graphics system we build up the <strong>dotplot</strong> with a series of commands. The first function call creates the graph region based on the data set but we do not plot any data by setting the <strong>type = &#8220;n&#8221;</strong> argument. The axis labels for the horizontal and vertical scales are set along with the title in the initial function call:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">plot(irrigation.df$Area, irrigation.df$Region, xlab = &quot;Area&quot;,
  ylab = &quot;Region&quot;, main = &quot;Irrigation Area by Region&quot;, type = &quot;n&quot;)</pre></div></div>

<p>To add the points with separate colours for each of the four years we use the <strong>points</strong> function and subset to the particular year by testing a condition on the year. The <strong>col</strong> argument is used with a text string to specify the colour for the symbols for the given year:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">points(irrigation.df$Area[irrigation.df$Year == 1980],
  irrigation.df$Region[irrigation.df$Year == 1980], col = &quot;black&quot;, pch = 16)
points(irrigation.df$Area[irrigation.df$Year == 1990],
  irrigation.df$Region[irrigation.df$Year == 1990], col = &quot;blue&quot;, pch = 16)
points(irrigation.df$Area[irrigation.df$Year == 2000],
  irrigation.df$Region[irrigation.df$Year == 2000], col = &quot;red&quot;, pch = 16)
points(irrigation.df$Area[irrigation.df$Year == 2007],
  irrigation.df$Region[irrigation.df$Year == 2007], col = &quot;green&quot;, pch = 16)</pre></div></div>

<p>The code is rather long winded compared to the using the other two graphics packages. We can add a legend to the graph so that the years can be identified:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">legend(10, 4, legend = c(&quot;1980&quot;, &quot;1990&quot;, &quot;2000&quot;, &quot;2007&quot;),
  col = c(&quot;black&quot;, &quot;blue&quot;, &quot;red&quot;, &quot;green&quot;), pch = 16)</pre></div></div>

<p>The placement of the legend uses the <strong>x</strong> and <strong>y</strong> coordinates within the graph to position the box. All the code above produces the following graph:</p>
<div id="attachment_856" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/03/dotplot-base.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/03/dotplot-base-300x300.jpg" alt="Base Graphics Dot Plot" title="Dot Plot Example" width="300" height="300" class="size-medium wp-image-856" /></a><p class="wp-caption-text">Base Graphics Dot Plot</p></div>
<p>The graph is basic but we can consider the changes over time for the four regions. One downside is that the regions have been labelled with numbers rather than text strings.</p>
<p><strong>Lattice Graphics</strong></p>
<p><!--[Fast Tube]--><span id="-FGU6PMaSRY" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-dot-plots/#-FGU6PMaSRY"><img src="http://i.ytimg.com/vi/-FGU6PMaSRY/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>The <strong>lattice</strong> graphics package has a function <strong>dotplot</strong> that is used to create dot plots. The first argument to the function is a formula describing the variables to use for the horizontal and vertical axes. We also specify the data frame to use for the graph and which column to determine different symbols and/or colours to highlight groupings within the plot:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">dotplot(Region ~ Area, data = irrigation.df, groups = Year,
  main = &quot;Irrigation Area by Region&quot;)</pre></div></div>

<p>The lattice variant of the graph is shown here:</p>
<div id="attachment_857" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/03/dotplot-lattice.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/03/dotplot-lattice-300x300.jpg" alt="Lattice Graphics Dot Plot" title="Dot Plot Example" width="300" height="300" class="size-medium wp-image-857" /></a><p class="wp-caption-text">Lattice Graphics Dot Plot</p></div>
<p>The graph is simple and very similar to the one produced using the base graphics with the advantage that the R code is not as complicated.</p>
<p><strong>ggplot2</strong></p>
<p><!--[Fast Tube]--><span id="y1CsT-jAWZQ" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-dot-plots/#y1CsT-jAWZQ"><img src="http://i.ytimg.com/vi/y1CsT-jAWZQ/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>The <strong>ggplot</strong> function is used to create the dot plot where we first specify the name of the data frame with the information to be displayed and then use the <strong>aes</strong> argument to list the variables to plot on the horizontal and vertical axes. The colour argument determines the variable to use for assigning colours to (usually) a categorical variable.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(irrigation.df, aes(x = Area, y = Region, colour = Year)) +
  geom_point() + opts(title = &quot;Irrigation Area by Region&quot;)</pre></div></div>

<p>The <strong>ggplot2</strong> version of the dot plot is shown below:</p>
<div id="attachment_858" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/03/dotplot-ggplot2.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/03/dotplot-ggplot2-300x300.jpg" alt="ggplot2 Dot Plot" title="Dot Plot Example" width="300" height="300" class="size-medium wp-image-858" /></a><p class="wp-caption-text">ggplot2 Dot Plot</p></div>
<p>This graph is very similar to the ones produced using the other graphics packages but has the distinctive background and legend style that is used as the default option in <strong>ggplot2</strong>.</p>
<p>This blog post is summarised in a pdf leaflet on the <a href="http://www.wekaleamstudios.co.uk/?page_id=282">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-dot-plots/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Summarising data using bar charts</title>
		<link>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-bar-charts/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-bar-charts/#comments</comments>
		<pubDate>Sat, 12 Dec 2009 08:52:33 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Base Graphics]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[Lattice Graphics]]></category>
		<category><![CDATA[Trellis Graphics]]></category>
		<category><![CDATA[bar chart]]></category>
		<category><![CDATA[barchart]]></category>
		<category><![CDATA[barplot]]></category>
		<category><![CDATA[FAO]]></category>
		<category><![CDATA[geom_bar]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[lattice]]></category>
		<category><![CDATA[trellis]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=664</guid>
		<description><![CDATA[A bar graph is a frequently used type of display that compares counts, frequencies, totals or other summary measures for a series of categories, e.g. sales in different market sectors or in quarters in a financial year. The bar graph can be laid out with the categories either on the vertical or horizontal axis of [...]]]></description>
			<content:encoded><![CDATA[<p>A bar graph is a frequently used type of display that compares counts, frequencies, totals or other summary measures for a series of categories, e.g. sales in different market sectors or in quarters in a financial year. The bar graph can be laid out with the categories either on the vertical or horizontal axis of the display &#8211; depending on whether we consider making a vertical or horizontal comparison is easier for interpreting the graph.<span id="more-664"></span></p>
<p>In <strong>R</strong> there are multiple ways for creating graphs, including the base graphics, lattice graphics and the ggplot2 grammar of graphics approach. To illustrate how we can create a bar chart using these packages we will make use of some data taken from the <a href="http://faostat.fao.org">FAO</a> statistics website for the UK in 2007. The data is for production (in metric tonnes) of the top five, in terms of production, food and agricultural commodities.</p>
<p>The first step before creating the graphs is to prepare the data in a format that can be used by the graphing functions. As this dataset is small we can manually create the data object. To make the labels on the graph less cluttered the production is recorded as 1,000s of metric tonnes.</p>
<p>The <strong>R</strong> code to create the data object is shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">uk2007 = data.frame(Commodity =
  factor(c(&quot;Cow milk&quot;, &quot;Wheat&quot;, &quot;Sugar beet&quot;, &quot;Potatoes&quot;, &quot;Barley&quot;),
    levels = c(&quot;Cow milk&quot;, &quot;Wheat&quot;, &quot;Sugar beet&quot;, &quot;Potatoes&quot;, &quot;Barley&quot;)),
  Production = c(14023, 13221, 6500, 5635, 5079))</pre></div></div>

<p>The <strong>levels</strong> argument is explicity defined to make sure that the ordering is as required from largest to smallest production rather than being alphabetical which would be how the categories are ordered otherwise.</p>
<p><strong>Base Graphics</strong></p>
<p><!--[Fast Tube]--><span id="fVhdPbntKdw" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-bar-charts/#fVhdPbntKdw"><img src="http://i.ytimg.com/vi/fVhdPbntKdw/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>The <strong>base</strong> graphics in R provide a function <strong>barplot</strong> that we can use to create a bar chart. The first argument to the function is the name of the object with the data. The <strong>names</strong> argument is used to provide the labels for the categories in the graph. We also specify the text for the labels for the x-axis, y-axis and title of the graph with the <strong>xlab</strong>, <strong>ylab</strong> and <strong>main</strong> arguments respectively.</p>
<p>The function call is:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">barplot(uk2007\$Production, names = uk2007\$Commodity,
  xlab = &quot;Commodity&quot;, ylab = &quot;Production (1,000 MT)&quot;,
  main = &quot;UK 2007 Top 5 Food and Agricultural Commodities&quot;)</pre></div></div>

<p>to produce the following graph:</p>
<div id="attachment_685" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2009/12/barchart-base1.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2009/12/barchart-base1-300x299.jpg" alt="Base Graphics Bar Chart" title="Barchart Example" width="300" height="299" class="size-medium wp-image-685" /></a><p class="wp-caption-text">Base Graphics Bar Chart</p></div>
<p>This graph is visually appealing with sensible space between the bars for the five commodity categories.</p>
<p><strong>Lattice Graphics</strong></p>
<p><!--[Fast Tube]--><span id="KvQOjlkseBA" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-bar-charts/#KvQOjlkseBA"><img src="http://i.ytimg.com/vi/KvQOjlkseBA/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In the <strong>lattice</strong> graphics package the <strong>barchart</strong> function is used to create bar charts. The <strong>x</strong> and <strong>y</strong> variables are specified using a formula, which is the standard way when using Trellis graphics. The variable on the vertical axis is specified on the left hand side of the formula and the variable for the horizontal axis is on the right hand side, where they are separated by the tilda character.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">barchart(Production $\sim$ Commodity, data = uk2007, xlab = &quot;Commodity&quot;,
  ylab = &quot;Production (1,000 MT)&quot;,
  main = &quot;UK 2007 Top 5 Food and Agricultural Commodities&quot;)</pre></div></div>

<p>This code produces the following graph:</p>
<div id="attachment_686" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2009/12/barchart-lattice1.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2009/12/barchart-lattice1-300x299.jpg" alt="Lattice Graphics Bar Chart" title="Barchart Example" width="300" height="299" class="size-medium wp-image-686" /></a><p class="wp-caption-text">Lattice Graphics Bar Chart</p></div>
<p>The main visual difference compared to the base graphics example is the default colours for the bars which is much brighter than the base graphics example. There is also a large gap between the bars in the display.</p>
<p><strong>ggplot2</strong></p>
<p><!--[Fast Tube]--><span id="4jSfbKFdrTo" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/summarising-data-using-bar-charts/#4jSfbKFdrTo"><img src="http://i.ytimg.com/vi/4jSfbKFdrTo/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>The create the bar chart in the <strong>ggplot2</strong> package we use the <strong>ggplot</strong> function to specify the data to appear in the graph then gradually add in the other components of the graph. </p>
<p>We specify the data frame where the data is stored and then use the <strong>aes</strong> argument to identify the <strong>x</strong> and <strong>y</strong> variables. The <strong>geom\_bar</strong> function is used to create a bar chart display with the specified data and the last three options in the example are for creating the various labels to be added to the graph.</p>
<p>The graph itself is constructed piece by piece to add the various layers and components on top of the base layer:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(uk2007, aes(Commodity, Production)) + geom_bar() + xlab(&quot;Commodity&quot;) +
  ylab(&quot;Production (1,000 MT)&quot;) +
  opts(title = &quot;UK 2007 Top 5 Food and Agricultural Commodities&quot;)</pre></div></div>

<p>This code produces the following graph:</p>
<div id="attachment_691" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2009/12/barchart-ggplot2.jpeg"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2009/12/barchart-ggplot2-300x299.jpg" alt="ggplot2 Bar Chart" title="Barchart Example" width="300" height="299" class="size-medium wp-image-691" /></a><p class="wp-caption-text">ggplot2 Bar Chart</p></div>
<p>The layout of this graph differs mainly with the grid background layout, which by default is a gray with white lines.</p>
<p>This blog post is summarised in a pdf leaflet on the <a href="http://www.wekaleamstudios.co.uk/?page_id=282">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/summarising-data-using-bar-charts/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Creating scatter plots using ggplot2</title>
		<link>http://www.wekaleamstudios.co.uk/posts/creating-scatter-plots-using-ggplot2/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/creating-scatter-plots-using-ggplot2/#comments</comments>
		<pubDate>Fri, 06 Nov 2009 09:02:19 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[qplot]]></category>
		<category><![CDATA[scatter plot]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=504</guid>
		<description><![CDATA[The ggplot2 package can be used as an alternative to lattice for producing high quality graphics in R. The package provides a framework and hopefully simple interface to producing graphs and is inspired by the grammar of graphics. The main function for producing graphs in this package is qplot, which stands for quick plot. The [...]]]></description>
			<content:encoded><![CDATA[<p>The <strong>ggplot2</strong> package can be used as an alternative to <strong>lattice</strong> for producing high quality graphics in <strong>R</strong>. The package provides a framework and hopefully simple interface to producing graphs and is inspired by the grammar of graphics.<span id="more-504"></span></p>
<p>The main function for producing graphs in this package is <strong>qplot</strong>, which stands for quick plot. The first two arguments to the function are the name of objects that contain the <strong>x</strong> and <strong>y</strong> variables for the plot that is being created. Like many functions in <strong>R</strong> there is a <strong>data</strong> argument that can be used to specify a data frame to look in for the variables.</p>
<p>As a first example we create a scatterplot of age and circumference for the data set in <strong>R</strong> that has measurements of the growth of Orange trees. The code to produce this graph is very simple and is shown below:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">qplot(age, circumference, data = Orange)</pre></div></div>

<p>This produces the following graph:</p>
<div id="attachment_512" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/posts/creating-scatter-plots-using-ggplot2/ggplot-1/" rel="attachment wp-att-512"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2009/11/ggplot-1-300x300.png" alt="Scatterplot Example 1" title="Scatterplot Example 1" width="300" height="300" class="size-medium wp-image-512" /></a><p class="wp-caption-text">Scatterplot Example 1</p></div>
<p>The main thing with this graph is that we are ignoring the different trees and looking at the overall trend. If we want to distinguish between the growth for the trees separately we can use different colours for the plotting symbols and add a legend to indicate which colour corresponds to a given tree. The <strong>colour</strong> argument is used to specify a variable and <strong>qplot</strong> will automatically created a legend based on the levels of this categorical variable. We adjust our code to be:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">qplot(age, circumference, data = Orange, colour = Tree)</pre></div></div>

<p>and the graph now looks like:</p>
<div id="attachment_513" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/posts/creating-scatter-plots-using-ggplot2/ggplot-2/" rel="attachment wp-att-513"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2009/11/ggplot-2-300x300.png" alt="Scatterplot Example 2" title="Scatterplot Example 2" width="300" height="300" class="size-medium wp-image-513" /></a><p class="wp-caption-text">Scatterplot Example 2</p></div>
<p>That is a nice improvement on the initial graph as we can visually compare the growth trends for the five trees.</p>
<p>We can build additional elements into our graph, such as adding a smoother to show a trend, by making use of the <strong>geom</strong> argument which is used to specify what type of display is being created. The package has a nice feat that allows us to specify a vector with multiple elements to build up additional elements to the graph. We can add a smoother to the original plot with the code below:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">qplot(age, circumference, data = Orange, geom = c(&quot;point&quot;, &quot;smooth&quot;))</pre></div></div>

<p>This produces the following graph:</p>
<div id="attachment_514" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/posts/creating-scatter-plots-using-ggplot2/ggplot-3/" rel="attachment wp-att-514"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2009/11/ggplot-3-300x300.png" alt="Scatterplot Example 3" title="Scatterplot Example 3" width="300" height="300" class="size-medium wp-image-514" /></a><p class="wp-caption-text">Scatterplot Example 3</p></div>
<p>An alternative would be to change from plotting with symbols to joining the points with lines. This change again makes use of the <strong>geom</strong> argument as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">qplot(age, circumference, data = Orange, colour = Tree, geom = &quot;line&quot;)</pre></div></div>

<p>The graph now looks like:</p>
<div id="attachment_515" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/posts/creating-scatter-plots-using-ggplot2/ggplot-4/" rel="attachment wp-att-515"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2009/11/ggplot-4-300x300.png" alt="Scatterplot Example 4" title="Scatterplot Example 4" width="300" height="300" class="size-medium wp-image-515" /></a><p class="wp-caption-text">Scatterplot Example 4</p></div>
<p>with a separate coloured line for each tree.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/creating-scatter-plots-using-ggplot2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

