<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Software for Exploratory Data Analysis and Statistical Modelling &#187; Data Summary</title>
	<atom:link href="http://www.wekaleamstudios.co.uk/topics/r-environment/data-summary/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.wekaleamstudios.co.uk</link>
	<description>Statistical Modelling with R</description>
	<lastBuildDate>Wed, 01 Feb 2012 19:44:22 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Programming with R &#8211; Processing Football League Data Part II</title>
		<link>http://www.wekaleamstudios.co.uk/posts/programming-with-r-processing-football-league-data-part-ii/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/programming-with-r-processing-football-league-data-part-ii/#comments</comments>
		<pubDate>Fri, 03 Dec 2010 10:26:39 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Manipulation]]></category>
		<category><![CDATA[Data Summary]]></category>
		<category><![CDATA[File Import/Export]]></category>
		<category><![CDATA[S Programming]]></category>
		<category><![CDATA[as.numeric]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data frame]]></category>
		<category><![CDATA[England]]></category>
		<category><![CDATA[football]]></category>
		<category><![CDATA[ifelse]]></category>
		<category><![CDATA[Premiership]]></category>
		<category><![CDATA[results]]></category>
		<category><![CDATA[table]]></category>
		<category><![CDATA[tapply]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1459</guid>
		<description><![CDATA[Following on from the previous post about creating a football result processing function for data from the football-data.co.uk website we will add code to the function to generate a league table based on the results to date. To create the league table we need to count various things such as the number of games played, [...]]]></description>
			<content:encoded><![CDATA[<p>Following on from the previous <a href="http://www.wekaleamstudios.co.uk/posts/programming-with-r-processing-football-league-data-part-i/">post</a> about creating a football result processing function for data from the <a href="http://www.football-data.co.uk">football-data.co.uk</a> website we will add code to the function to generate a league table based on the results to date.<span id="more-1459"></span></p>
<p>To create the league table we need to count various things such as the number of games played, number of wins/draws/losses, goals scored etc. This information is available in the results object that is loaded from a <strong>csv</strong> file in the function as it stands.</p>
<p>To facilitate these calculations we create a data frame with a row for each team in the division and then calculate the statistics required &#8211; this was a reason for ordering the factors in the <strong>HomeTeam</strong> and <strong>AwayTeam</strong> columns of the results table. The data frame is created with the code below:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable = data.frame(Team = teams,
    Games = 0, Win = 0, Draw = 0, Loss = 0,
    HomeGames = 0, HomeWin = 0, HomeDraw = 0, HomeLoss = 0,
    AwayGames = 0, AwayWin = 0, AwayDraw = 0, AwayLoss = 0,
    Points = 0,
    HomeFor = 0, HomeAgainst = 0,
    AwayFor = 0, AwayAgainst = 0,
    For = 0, Against = 0, GoalDifference = 0)</pre></div></div>

<p>There are a number of slots that are may be redundant in a league table but are used for intermediate calculations, such as <strong>HomeWin</strong> and <strong>AwayWin</strong> that are combined to find the total number of victories for a team.</p>
<p>The number of games played by each team home and away are counted using the table command for the two columns respectively.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable$HomeGames = as.numeric(table(tmpResults$HomeTeam))
tmpTable$AwayGames = as.numeric(table(tmpResults$AwayTeam))</pre></div></div>

<p>The labels created by the table command are discarded using the as.numeric function to retain only the number of games. The table command is also used to count the number of wins, draws and losses at home and away for each team. The commands are shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable$HomeWin =
    as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == &quot;H&quot;]))
tmpTable$HomeDraw =
    as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == &quot;D&quot;]))
tmpTable$HomeLoss =
    as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == &quot;A&quot;]))
&nbsp;
tmpTable$AwayWin =
    as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == &quot;A&quot;]))
tmpTable$AwayDraw =
    as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == &quot;D&quot;]))
tmpTable$AwayLoss =
    as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == &quot;H&quot;]))</pre></div></div>

<p>Note that we subset on the values in the <strong>FTR</strong> column, which is full-time result, and then count. The subsetting is reversed when looking at the away fixtures because a victory for the team is now an away win rather than a home win.</p>
<p>This information is then combined to get total games played, won etc.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable$Games = tmpTable$HomeGames + tmpTable$AwayGames
tmpTable$Win = tmpTable$HomeWin + tmpTable$AwayWin
tmpTable$Draw = tmpTable$HomeDraw + tmpTable$AwayDraw
tmpTable$Loss = tmpTable$HomeLoss + tmpTable$AwayLoss</pre></div></div>

<p>The total points is calclated by multiplying the number of wins, draws and losses by the number of points awarded for each match outcome.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable$Points = winPoints * tmpTable$Win +
    drawPoints * tmpTable$Draw + lossPoints * tmpTable$Loss</pre></div></div>

<p>The next set of calculations are to count the number of goals scored, goals conceeded and goal difference. The <strong>tapply</strong> function is used for these calculations.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable$HomeFor =
    as.numeric(tapply(tmpResults$FTHG, tmpResults$HomeTeam, sum, na.rm = TRUE))
tmpTable$HomeAgainst =
    as.numeric(tapply(tmpResults$FTAG, tmpResults$HomeTeam, sum, na.rm = TRUE))
&nbsp;
tmpTable$AwayFor =
    as.numeric(tapply(tmpResults$FTAG, tmpResults$AwayTeam, sum, na.rm = TRUE))
tmpTable$AwayAgainst =
    as.numeric(tapply(tmpResults$FTHG, tmpResults$AwayTeam, sum, na.rm = TRUE))</pre></div></div>

<p>The <strong>tapply</strong> function applies the <strong>sum</strong> to the number of goals scored at home or away, and the number of goals conceeded by each team in the division. These are then combined to create totals home and away:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable$For =
    ifelse(is.na(tmpTable$HomeFor), 0, tmpTable$HomeFor) +
    ifelse(is.na(tmpTable$AwayFor), 0, tmpTable$AwayFor)
tmpTable$Against =
    ifelse(is.na(tmpTable$HomeAgainst), 0, tmpTable$HomeAgainst) +
    ifelse(is.na(tmpTable$AwayAgainst), 0, tmpTable$AwayAgainst)</pre></div></div>

<p>The <strong>ifelse</strong> statement is used to handle situations where a team hasn&#8217;t played a home and/or away fixture yet. The goal difference is easy to calculate:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable$GoalDifference = tmpTable$For - tmpTable$Against</pre></div></div>

<p>Now that all of the statistics have been calculated we sort the table based on the number of points, goal difference and finally alphabetically. There might be different ways that we can order the teams but this is what we will use for the time being:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable =
  tmpTable[order(- tmpTable$Points, - tmpTable$GoalDifference, tmpTable$Team),]</pre></div></div>

<p>The ordering might look odd but we want to ranking from highest to lowest points and goal difference but then in ascending alphabetical order for the teams.</p>
<p>The whole function is now:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">football.process.v2 = function(datafile, country, divname, season, teams, winPoints = 3, drawPoints = 1, lossPoints = 0)
{
## Validation Function Arguments
&nbsp;
if (missing(datafile))
{
stop(&quot;Results csv file not specified.&quot;)
}
&nbsp;
if (missing(country))
{
warning(&quot;Country of league not specified.&quot;)
country = &quot;&quot;
}
&nbsp;
if (missing(divname))
{
warning(&quot;Name of league division not specified.&quot;)
divname = &quot;&quot;
}
&nbsp;
## Import Results
&nbsp;
tmpResults = read.csv(datafile)[,c(&quot;Date&quot;,&quot;HomeTeam&quot;,&quot;AwayTeam&quot;,&quot;FTR&quot;,&quot;FTHG&quot;,&quot;FTAG&quot;)]
&nbsp;
if (missing(teams))
{
warning(&quot;Team names not specified - extracted from results data.&quot;)
teams = sort(unique(c(as.character(tmpResults$HomeTeam), as.character(tmpResults$AwayTeam))))
}
&nbsp;
tmpResults$HomeTeam = factor(tmpResults$HomeTeam, levels = teams)
tmpResults$AwayTeam = factor(tmpResults$AwayTeam, levels = teams)
&nbsp;
## Create Empty League Table
&nbsp;
tmpTable = data.frame(Team = teams,
Games = 0, Win = 0, Draw = 0, Loss = 0,
HomeGames = 0, HomeWin = 0, HomeDraw = 0, HomeLoss = 0,
AwayGames = 0, AwayWin = 0, AwayDraw = 0, AwayLoss = 0,
Points = 0,
HomeFor = 0, HomeAgainst = 0,
AwayFor = 0, AwayAgainst = 0,
For = 0, Against = 0, GoalDifference = 0)
&nbsp;
## Count Number of Games Played
&nbsp;
tmpTable$HomeGames = as.numeric(table(tmpResults$HomeTeam))
tmpTable$AwayGames = as.numeric(table(tmpResults$AwayTeam))
&nbsp;
## Count Number of Wins/Draws/Losses
&nbsp;
tmpTable$HomeWin = as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == &quot;H&quot;]))
tmpTable$HomeDraw = as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == &quot;D&quot;]))
tmpTable$HomeLoss = as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == &quot;A&quot;]))
&nbsp;
tmpTable$AwayWin = as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == &quot;A&quot;]))
tmpTable$AwayDraw = as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == &quot;D&quot;]))
tmpTable$AwayLoss = as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == &quot;H&quot;]))
&nbsp;
tmpTable$Games = tmpTable$HomeGames + tmpTable$AwayGames
tmpTable$Win = tmpTable$HomeWin + tmpTable$AwayWin
tmpTable$Draw = tmpTable$HomeDraw + tmpTable$AwayDraw
tmpTable$Loss = tmpTable$HomeLoss + tmpTable$AwayLoss
tmpTable$Points = winPoints * tmpTable$Win + drawPoints * tmpTable$Draw + lossPoints * tmpTable$Loss
&nbsp;
## Count Goals Scored and Conceeded
&nbsp;
tmpTable$HomeFor = as.numeric(tapply(tmpResults$FTHG, tmpResults$HomeTeam, sum, na.rm = TRUE))
tmpTable$HomeAgainst = as.numeric(tapply(tmpResults$FTAG, tmpResults$HomeTeam, sum, na.rm = TRUE))
&nbsp;
tmpTable$AwayFor = as.numeric(tapply(tmpResults$FTAG, tmpResults$AwayTeam, sum, na.rm = TRUE))
tmpTable$AwayAgainst = as.numeric(tapply(tmpResults$FTHG, tmpResults$AwayTeam, sum, na.rm = TRUE))
&nbsp;
tmpTable$For = ifelse(is.na(tmpTable$HomeFor), 0, tmpTable$HomeFor) +
ifelse(is.na(tmpTable$AwayFor), 0, tmpTable$AwayFor)
tmpTable$Against = ifelse(is.na(tmpTable$HomeAgainst), 0, tmpTable$HomeAgainst) +
ifelse(is.na(tmpTable$AwayAgainst), 0, tmpTable$AwayAgainst)
&nbsp;
tmpTable$GoalDifference = tmpTable$For - tmpTable$Against
&nbsp;
## Sort Table
## By Points
## By Goal Difference
## By Team Name (Alphabetical)
&nbsp;
tmpTable = tmpTable[order(- tmpTable$Points, - tmpTable$GoalDifference, tmpTable$Team),]
&nbsp;
tmpTable = tmpTable[,c(&quot;Team&quot;, &quot;Games&quot;, &quot;Win&quot;, &quot;Draw&quot;, &quot;Loss&quot;, &quot;Points&quot;, &quot;For&quot;, &quot;Against&quot;, &quot;GoalDifference&quot;)]
&nbsp;
## Return Division Information
&nbsp;
tmpSummary = list(Country = country, Division = divname, Season = season, Teams = teams,
Results = tmpResults, Table = tmpTable)
&nbsp;
invisible(tmpSummary)
}</pre></div></div>

<p>There are other functionality that we might want to add to the function.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/programming-with-r-processing-football-league-data-part-ii/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Programming with R &#8211; Processing Football League Data Part I</title>
		<link>http://www.wekaleamstudios.co.uk/posts/programming-with-r-processing-football-league-data-part-i/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/programming-with-r-processing-football-league-data-part-i/#comments</comments>
		<pubDate>Tue, 23 Nov 2010 14:14:45 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Manipulation]]></category>
		<category><![CDATA[Data Summary]]></category>
		<category><![CDATA[File Import/Export]]></category>
		<category><![CDATA[S Programming]]></category>
		<category><![CDATA[csv]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[England]]></category>
		<category><![CDATA[football]]></category>
		<category><![CDATA[list]]></category>
		<category><![CDATA[Premiership]]></category>
		<category><![CDATA[print]]></category>
		<category><![CDATA[read.csv]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1447</guid>
		<description><![CDATA[In this post we will make use of football results data from the football-data.co.uk website to demonstrate creating functions in R to automate a series of standard operations that would be required for results data from various leagues and divisions. The first step is to consider what control options should be available as part of [...]]]></description>
			<content:encoded><![CDATA[<p>In this post we will make use of football results data from the <a href="http://www.football-data.co.uk">football-data.co.uk</a> website to demonstrate creating functions in <strong>R</strong> to automate a series of standard operations that would be required for results data from various leagues and divisions.<span id="more-1447"></span></p>
<p>The first step is to consider what control options should be available as part of the function and here is a list of some arguments that will be used for this implementation of a football result data processing function:</p>
<ul>
<li>The name of a <strong>csv</strong> data file from the <a href="http://www.football-data.co.uk">football-data.co.uk</a> website.</li>
<li>A text string to specify the country and division for the data.</li>
<li>A text string specifying the season.</li>
<li>A list of teams in the division (optional), which could be used to test for data entry errors in the data file.</li>
<li>The number of points for a win, draw or loss. This might seem a strange option initially but different leagues might award different points for the three outcomes.</li>
</ul>
<p>Some of this information might appear optional but is included so that we can write a custom <strong>print</strong> function at a later date to display a meaningful summary of the object (list) that will be created by the function.</p>
<p>The first part of our function is concerned with checking the various values provided to the function arguments. Our skeleton function is as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">football.process.v1 = function(datafile, country, divname, season,
  teams, winPoints = 3, drawPoints = 1, lossPoints = 0)
{
&nbsp;
}</pre></div></div>

<p>Here we have specified default options for three of the arguments with the most likely number of points for each match outcome, i.e. 3 points for a win and 1 point for a draw.</p>
<p>To illustrate the working of the result processing function we will use a small exert from the start of the 2010/2011 English Premiership season which is shown below:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Referee
E0,14/8/2010,Aston Villa,West Ham,3,0,H,2,0,H,M Dean
E0,14/8/2010,Blackburn,Everton,1,0,H,1,0,H,P Dowd
E0,14/8/2010,Bolton,Fulham,0,0,D,0,0,D,S Attwell
E0,14/8/2010,Chelsea,West Brom,6,0,H,2,0,H,M Clattenburg
E0,14/8/2010,Sunderland,Birmingham,2,2,D,1,0,H,A Taylor
E0,14/8/2010,Tottenham,Man City,0,0,D,0,0,D,A Marriner
E0,14/8/2010,Wigan,Blackpool,0,4,A,0,3,A,M Halsey
E0,14/8/2010,Wolves,Stoke,2,1,H,2,0,H,L Probert
E0,15/8/2010,Liverpool,Arsenal,1,1,D,0,0,D,M Atkinson
E0,16/8/2010,Man United,Newcastle,3,0,H,2,0,H,C Foy
E0,21/8/2010,Arsenal,Blackpool,6,0,H,3,0,H,M Jones
E0,21/8/2010,Birmingham,Blackburn,2,1,H,0,0,D,M Oliver
E0,21/8/2010,Everton,Wolves,1,1,D,1,0,H,L Mason
E0,21/8/2010,Stoke,Tottenham,1,2,A,1,2,A,C Foy
E0,21/8/2010,West Brom,Sunderland,1,0,H,0,0,D,K Friend
E0,21/8/2010,West Ham,Bolton,1,3,A,0,0,D,A Marriner
E0,21/8/2010,Wigan,Chelsea,0,6,A,0,1,A,M Dean
E0,22/8/2010,Fulham,Man United,2,2,D,0,1,A,P Walton
E0,22/8/2010,Newcastle,Aston Villa,6,0,H,3,0,H,M Atkinson
E0,23/8/2010,Man City,Liverpool,3,0,H,1,0,H,P Dowd
E0,28/8/2010,Blackburn,Arsenal,1,2,A,1,1,D,C Foy
E0,28/8/2010,Blackpool,Fulham,2,2,D,0,1,A,M Oliver
E0,28/8/2010,Chelsea,Stoke,2,0,H,1,0,H,M Atkinson
E0,28/8/2010,Man United,West Ham,3,0,H,1,0,H,M Clattenburg
E0,28/8/2010,Tottenham,Wigan,0,1,A,0,0,D,P Dowd
E0,28/8/2010,Wolves,Newcastle,1,1,D,1,0,H,S Attwell
E0,29/8/2010,Aston Villa,Everton,1,0,H,1,0,H,M Jones
E0,29/8/2010,Bolton,Birmingham,2,2,D,0,1,A,K Friend
E0,29/8/2010,Liverpool,West Brom,1,0,H,0,0,D,L Probert
E0,29/8/2010,Sunderland,Man City,1,0,H,0,0,D,M Dean</pre></div></div>

<p>This is stored in a file <strong>E0test.csv</strong> so that we can use the <strong>read.csv</strong> function to import the results data and then process it.</p>
<p>The first series of commands that we add to the function are for checking various function arguments specified by the user to ensure that they are sensible. First up we check whether a results data file has been specified as we cannot do any processing without any results. The simple test is for whether a file name has been specified:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">if (missing(datafile))
{
    stop(&quot;Results csv file not specified.&quot;)
}</pre></div></div>

<p>It might be sensible to check whether the object <strong>datafile</strong> is actually a character string specifying a file, but this hasn&#8217;t been done for now. We then check whether the country name and division have been specified and set them to blank strings if they haven&#8217;t been set by the user.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">if (missing(country))
{
    warning(&quot;Country of league not specified.&quot;)
    country = &quot;&quot;
}
&nbsp;
if (missing(divname))
{
    warning(&quot;Name of league division not specified.&quot;)
    divname = &quot;&quot;
}</pre></div></div>

<p>Next up we import the data file and only save the columns of interest (at this point of the development of the function at least. There are many more columns of information that we need in the raw data from the website,</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpResults =
    read.csv(datafile)[,c(&quot;Date&quot;,&quot;HomeTeam&quot;,&quot;AwayTeam&quot;,&quot;FTR&quot;,&quot;FTHG&quot;,&quot;FTAG&quot;)]</pre></div></div>

<p>The square brackets are used to subset on a part set of columns and only save these. Then we check whether the team names have been specified by the user and if not extract them from the data provided:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">if (missing(teams))
{
    warning(&quot;Team names not specified - extracted from results data.&quot;)
    teams = sort(unique(c(as.character(tmpResults$HomeTeam),
        as.character(tmpResults$AwayTeam))))
}</pre></div></div>

<p>The sort function is used to order the team names alphabetically which is the order often used in league tables, especially when no games have been played. We then convert the columns <strong>HomeTeam</strong> and <strong>AwayTeam</strong> into factors, which allows teams that haven&#8217;t played a fixture yet to be included in the table.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpResults$HomeTeam = factor(tmpResults$HomeTeam, levels = teams)
tmpResults$AwayTeam = factor(tmpResults$AwayTeam, levels = teams)</pre></div></div>

<p>To round off the first part of creating the result processing function we create a list object to return at the end of the function.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpSummary = list(Country = country, Division = divname,
    Season = season, Teams = teams, Results = tmpResults)</pre></div></div>

<p>The function so far:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">football.process.v1 = function(datafile, country, divname, season, teams, winPoints = 3, drawPoints = 1, lossPoints = 0)
{
## Validation Function Arguments
&nbsp;
if (missing(datafile))
{
stop(&quot;Results csv file not specified.&quot;)
}
&nbsp;
if (missing(country))
{
warning(&quot;Country of league not specified.&quot;)
country = &quot;&quot;
}
&nbsp;
if (missing(divname))
{
warning(&quot;Name of league division not specified.&quot;)
divname = &quot;&quot;
}
&nbsp;
## Import Results
&nbsp;
tmpResults = read.csv(datafile)[,c(&quot;Date&quot;,&quot;HomeTeam&quot;,&quot;AwayTeam&quot;,&quot;FTR&quot;,&quot;FTHG&quot;,&quot;FTAG&quot;)]
&nbsp;
if (missing(teams))
{
warning(&quot;Team names not specified - extracted from results data.&quot;)
teams = sort(unique(c(as.character(tmpResults$HomeTeam), as.character(tmpResults$AwayTeam))))
}
&nbsp;
tmpResults$HomeTeam = factor(tmpResults$HomeTeam, levels = teams)
tmpResults$AwayTeam = factor(tmpResults$AwayTeam, levels = teams)
&nbsp;
## Return Division Information
&nbsp;
tmpSummary = list(Country = country, Division = divname, Season = season, Teams = teams,
Results = tmpResults)
&nbsp;
invisible(tmpSummary)
}</pre></div></div>

<p>We then test this function with the data file shown above. First up we create our own list of teams in the English Premiership for 2010/2011 and specify some of the other function arguments while using the defaults for points.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; E0teams.1011 = c(&quot;Arsenal&quot;, &quot;Aston Villa&quot;, &quot;Birmingham&quot;, &quot;Blackburn&quot;,
+ &quot;Blackpool&quot;, &quot;Bolton&quot;, &quot;Chelsea&quot;, &quot;Everton&quot;, &quot;Fulham&quot;, &quot;Liverpool&quot;,
+ &quot;Man City&quot;, &quot;Man United&quot;, &quot;Newcastle&quot;, &quot;Stoke&quot;, &quot;Sunderland&quot;,
+ &quot;Tottenham&quot;, &quot;West Brom&quot;, &quot;West Ham&quot;, &quot;Wigan&quot;, &quot;Wolves&quot;)
&gt; print(football.process.v1(&quot;E0test.csv&quot;, &quot;England&quot;, &quot;Premiership&quot;,
    &quot;2010-2011&quot;, E0teams.1011))
$Country
[1] &quot;England&quot;
&nbsp;
$Division
[1] &quot;Premiership&quot;
&nbsp;
$Season
[1] &quot;2010-2011&quot;
&nbsp;
$Teams
 [1] &quot;Arsenal&quot;     &quot;Aston Villa&quot; &quot;Birmingham&quot;  &quot;Blackburn&quot;   &quot;Blackpool&quot;  
 [6] &quot;Bolton&quot;      &quot;Chelsea&quot;     &quot;Everton&quot;     &quot;Fulham&quot;      &quot;Liverpool&quot;  
[11] &quot;Man City&quot;    &quot;Man United&quot;  &quot;Newcastle&quot;   &quot;Stoke&quot;       &quot;Sunderland&quot; 
[16] &quot;Tottenham&quot;   &quot;West Brom&quot;   &quot;West Ham&quot;    &quot;Wigan&quot;       &quot;Wolves&quot;     
&nbsp;
$Results
        Date    HomeTeam    AwayTeam FTR FTHG FTAG
1  14/8/2010 Aston Villa    West Ham   H    3    0
2  14/8/2010   Blackburn     Everton   H    1    0
3  14/8/2010      Bolton      Fulham   D    0    0
4  14/8/2010     Chelsea   West Brom   H    6    0
5  14/8/2010  Sunderland  Birmingham   D    2    2
6  14/8/2010   Tottenham    Man City   D    0    0
7  14/8/2010       Wigan   Blackpool   A    0    4
8  14/8/2010      Wolves       Stoke   H    2    1
9  15/8/2010   Liverpool     Arsenal   D    1    1
10 16/8/2010  Man United   Newcastle   H    3    0
11 21/8/2010     Arsenal   Blackpool   H    6    0
12 21/8/2010  Birmingham   Blackburn   H    2    1
13 21/8/2010     Everton      Wolves   D    1    1
14 21/8/2010       Stoke   Tottenham   A    1    2
15 21/8/2010   West Brom  Sunderland   H    1    0
16 21/8/2010    West Ham      Bolton   A    1    3
17 21/8/2010       Wigan     Chelsea   A    0    6
18 22/8/2010      Fulham  Man United   D    2    2
19 22/8/2010   Newcastle Aston Villa   H    6    0
20 23/8/2010    Man City   Liverpool   H    3    0
21 28/8/2010   Blackburn     Arsenal   A    1    2
22 28/8/2010   Blackpool      Fulham   D    2    2
23 28/8/2010     Chelsea       Stoke   H    2    0
24 28/8/2010  Man United    West Ham   H    3    0
25 28/8/2010   Tottenham       Wigan   A    0    1
26 28/8/2010      Wolves   Newcastle   D    1    1
27 29/8/2010 Aston Villa     Everton   H    1    0
28 29/8/2010      Bolton  Birmingham   D    2    2
29 29/8/2010   Liverpool   West Brom   H    1    0
30 29/8/2010  Sunderland    Man City   H    1    0</pre></div></div>

<p>Other useful resources are provided on the <a href="http://www.wekaleamstudios.co.uk/supplementary-material/">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/programming-with-r-processing-football-league-data-part-i/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Charting the performance of cricket all-rounders &#8211; IT Botham</title>
		<link>http://www.wekaleamstudios.co.uk/posts/charting-the-performance-of-cricket-all-rounders-it-botham/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/charting-the-performance-of-cricket-all-rounders-it-botham/#comments</comments>
		<pubDate>Mon, 16 Aug 2010 19:59:54 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Summary]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Grammar of Graphics]]></category>
		<category><![CDATA[all-rounder]]></category>
		<category><![CDATA[botham]]></category>
		<category><![CDATA[catches]]></category>
		<category><![CDATA[cricket]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[histogram]]></category>
		<category><![CDATA[runs]]></category>
		<category><![CDATA[wickets]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1321</guid>
		<description><![CDATA[Cricket is a sport that generates a large volume of performance data and corresponding debate about the relative qualities of various players over their careers and in relation to their contemporaries. The cricinfo website has an extensive database of statistics for professional cricketers that can be searched to access the information in various formats. As [...]]]></description>
			<content:encoded><![CDATA[<p>Cricket is a sport that generates a large volume of performance data and corresponding debate about the relative qualities of various players over their careers and in relation to their contemporaries. The <a href="http://www.cricinfo.com/">cricinfo</a> website has an extensive database of statistics for professional cricketers that can be searched to access the information in various formats.<span id="more-1321"></span></p>
<p>As an initial example we will consider the English legend Sir Ian Botham who played 102 test matches for England between his debut in 1977 until his final game in 1992.</p>
<p>The first obvious breakdown is to consider how Botham performed against the six countries that he played against during his test career. A summary of his statistics are shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;"> Opposition Matches Bat Inns Runs NO Bowl Inns Wicket Catch
  Australia      36       49 1673  2       66     148    57
      India      14       16 1201  0       23      59    14
New Zealand      15       22  846  2       28      64    14
   Pakistan      14       20  647  1       18      40    14
  Sri Lanka       3        3   41  0        6      11     2
West Indies      20       37  792  1       27      61    19</pre></div></div>

<p>Botham only played three matches against Sri Lanka so it is difficult to properly assess his performance against them. If the above table is stored in a data frame <strong>itb.opp</strong> then we can create a histogram of the total runs (or wickets) by opposition country:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(itb.opp, aes(Opposition, Runs)) + geom_bar() + xlab(&quot;Country&quot;) +
  ylab(&quot;Total Runs&quot;)</pre></div></div>

<p>This code produces the following graph:</p>
<div id="attachment_1355" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Total-Runs.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Total-Runs-300x300.png" alt="IT Botham Total Runs by Opposition" title="IT Botham Total Runs" width="300" height="300" class="size-medium wp-image-1355" /></a><p class="wp-caption-text">IT Botham Total Runs by Opposition</p></div>
<p>The total wickes graph is produced by the next code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(itb.opp, aes(Opposition, Wicket)) + geom_bar() + xlab(&quot;Country&quot;) +
  ylab(&quot;Total Wickets&quot;)</pre></div></div>

<div id="attachment_1356" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Total-Wickets.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Total-Wickets-300x300.png" alt="IT Botham Total Wickets by Opposition" title="IT Botham Total Wickets" width="300" height="300" class="size-medium wp-image-1356" /></a><p class="wp-caption-text">IT Botham Total Wickets by Opposition</p></div>
<p>We may now want to delve deeper into the performance against different nations to take into account the number of games or innings where Botham batted or bowled. The traditional way to assess performance is to calculate batting and bowling averages and we can do this by opposition which provides the following data frame:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; itb.opp.sum
 Opposition Discipline  Average
  Australia    Batting 29.35088
      India    Batting 70.64706
New Zealand    Batting 42.30000
   Pakistan    Batting 32.35000
  Sri Lanka    Batting 13.66667
West Indies    Batting 21.40541
  Australia    Bowling 27.65541
      India    Bowling 26.40678
New Zealand    Bowling 23.43750
   Pakistan    Bowling 31.77500
  Sri Lanka    Bowling 28.18182
West Indies    Bowling 35.18033</pre></div></div>

<p>This can be converted into a dot plot so we can see whether Botham had a high batting average than bowling average, which is often taken to be one of the signs of an all-rounder.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(itb.opp.sum, aes(Average, Opposition, colour = Discipline)) +
  geom_point()+ xlab(&quot;Average&quot;) + ylab(&quot;&quot;)</pre></div></div>

<p>The graph is shown here:</p>
<div id="attachment_1362" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Averages-Country.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Averages-Country-300x300.png" alt="IT Botham Batting and Bowling Averages by Opposition" title="IT Botham Batting and Bowling Averages" width="300" height="300" class="size-medium wp-image-1362" /></a><p class="wp-caption-text">IT Botham Batting and Bowling Averages by Opposition</p></div>
<p>We can see the differences in performance based on the opposition. Botham&#8217;s performance against the West Indies, by far the strongest team during most of his international career, were worse than against the other countries. However, his averages were far from embarassing when compared to other players at the time. The graph also shows that Botham enjoyed batting and bowling against India.</p>
<p>We can divide this data further based on whether the matches were played in England or outside of England and this data is shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; itb.opp.ha.sum
  Opposition Venue Discipline  Average
   Australia  Away    Batting 30.22581
       India  Away    Batting 61.55556
 New Zealand  Away    Batting 50.44444
    Pakistan  Away    Batting 16.00000
   Sri Lanka  Away    Batting 13.00000
 West Indies  Away    Batting 14.17647
   Australia  Home    Batting 28.30769
       India  Home    Batting 80.87500
 New Zealand  Home    Batting 35.63636
    Pakistan  Home    Batting 34.16667
   Sri Lanka  Home    Batting 14.00000
 West Indies  Home    Batting 27.55000
   Australia  Away    Bowling 28.44928
       India  Away    Bowling 25.53333
 New Zealand  Away    Bowling 27.44444
    Pakistan  Away    Bowling 45.00000
   Sri Lanka  Away    Bowling 21.66667
 West Indies  Away    Bowling 39.50000
   Australia  Home    Bowling 26.96203
       India  Home    Bowling 27.31034
 New Zealand  Home    Bowling 20.51351
    Pakistan  Home    Bowling 31.07895
   Sri Lanka  Home    Bowling 30.62500
 West Indies  Home    Bowling 31.97143</pre></div></div>

<p>A dot plot is created from this data with a separate panel for each of the six opposition countries and the averages divided into batting and bowling performances. The coloured dots in the graph indicated whether the average is for matches at home or away.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">ggplot(itb.opp.ha.sum, aes(Average, Discipline, colour = Venue)) +
  geom_point() + facet_wrap( ~ Opposition) +
  xlab(&quot;Batting Average&quot;) + ylab(&quot;&quot;)</pre></div></div>

<p>This graph is shown below:</p>
<div id="attachment_1366" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Averages-Country-HomeAway.png"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2010/08/ITB-Averages-Country-HomeAway-300x300.png" alt="IT Botham Batting and Bowling Averages by Country and Home/Away" title="IT Botham Batting and Bowling Averages" width="300" height="300" class="size-medium wp-image-1366" /></a><p class="wp-caption-text">IT Botham Batting and Bowling Averages by Country and Home/Away</p></div>
<p>We can see that the difference between home and away peformance is, in general, not very large for bowling averages but in some cases there is a noticeable difference in batting averages. When looking at Botham&#8217;s performances against the West Indies his statistics at home are much better than his away performance, suggesting that his main struggles against the strong West Indies team were in the Caribbean. This might be due to his swing bowling being more suitable to English conditions compared to pitches in the West Indies.</p>
<p>To round off this brief look at the career of IT Botham let us consider some other important statistics, in particular games where he performed with the bat and ball.</p>
<ul>
<li>Overall Botham scored 14 hundreds and 22 fifties out of 161 innings so he reached fifty runs every five innings or so.</li>
<li>He also took 27 five wicket hauls and 17 four wicket hauls so he took four or more wickets every four innings or so.</li>
<li>He took 120 catches.</li>
</ul>
<p>Individual matches of excellence include five games with a century and at least five wickets:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">Year  Opposition       Ground Venue Runs Wicket
1978 New Zealand Christchurch  Away  133      8
1978    Pakistan       Lord's  Home  108      8
1980       India       Mumbai  Away  114     13
1981   Australia        Leeds  Home  199      7
1984 New Zealand   Wellington  Away  138      6</pre></div></div>

<p>These performances and others show why Botham was considered such a great player as he produced some sustained periods of excellent all-round cricket rather than having one discipline more dominant for a long period of time.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/charting-the-performance-of-cricket-all-rounders-it-botham/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>R and Tolerance Intervals</title>
		<link>http://www.wekaleamstudios.co.uk/posts/r-and-tolerance-intervals/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/r-and-tolerance-intervals/#comments</comments>
		<pubDate>Mon, 19 Apr 2010 20:19:31 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Summary]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Probability Distributions]]></category>
		<category><![CDATA[normtol.int]]></category>
		<category><![CDATA[tolerance]]></category>
		<category><![CDATA[Tolerance Intervals]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=905</guid>
		<description><![CDATA[Confidence intervals and prediction intervals are used by statisticians on a regular basis. Another useful interval is the tolerance interval that describes the range of values for a distribution with confidence limits calculated to a particular percentile of the distribution. The R package tolerance can be used to create a variety of tolerance intervals of [...]]]></description>
			<content:encoded><![CDATA[<p>Confidence intervals and prediction intervals are used by statisticians on a regular basis. Another useful interval is the tolerance interval that describes the range of values for a distribution with confidence limits calculated to a particular percentile of the distribution. The <strong>R</strong> package <strong>tolerance</strong> can be used to create a variety of tolerance intervals of interest.<span id="more-905"></span></p>
<p>These tolerance limits, taken from the estimated interval, are limits within which a stated proportion of the population is expected to occur. The function <strong>normtol.int</strong> from the <strong>tolerance</strong> package can be used to calculate a tolerance interval for data from a normal distribution.</p>
<p>The function arguments include the data itself in a vector denoted <strong>x</strong>. The confidence level associated with the tolerance interval is specified by <strong>alpha</strong>, where <strong>alpha</strong> is the difference between 100% and the confidence level &#8211; <strong>alpha</strong> is 0.05 for 95% confidence. The argument <strong>P</strong> is the proportion of the data to be included in the tolerance interval. The <strong>side</strong> argument determines whether a one-sided or two-sided interval is required.</p>
<p>Consider a simulated set of data from a manufacturing process loaded into R, stored as vector object <strong>obs</strong>, as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">obs = c(102.17, 102.45, 106.23, 98.16, 100.82, 101.40, 90.51, 102.51, 97.93,
  96.98, 101.74, 104.34, 103.50, 94.72, 102.80, 103.92, 97.43, 102.76, 100.03,
  107.12, 104.96, 105.32, 87.06, 97.89, 100.23)</pre></div></div>

<p>A 95% tolerance interval for 90% of data of this type, based on the 25 observations above is created with this code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; normtol.int(x = obs, alpha = 0.05, P = 0.90, side = 2)
  alpha   P    x.bar 2-sided.lower 2-sided.upper
1  0.05 0.9 100.5192      90.07606      110.9623</pre></div></div>

<p>The <strong>alpha</strong> and <strong>P</strong> are as noted above and the average of the data is reported along with the lower and upper tolerance intervals in this case as we asked for a two-sided interval. This can be easily changed to cover 95% rather than 90% of the data:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; normtol.int(x = obs, alpha = 0.05, P = 0.95, side = 2)
  alpha    P    x.bar 2-sided.lower 2-sided.upper
1  0.05 0.95 100.5192      88.07543      112.9630</pre></div></div>

<p>The package <strong>tolerance</strong> can create intervals for other data distributions.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/r-and-tolerance-intervals/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Box and Whisker Plots for Summarising Data</title>
		<link>http://www.wekaleamstudios.co.uk/posts/box-and-whisker-plots-for-summarising-data/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/box-and-whisker-plots-for-summarising-data/#comments</comments>
		<pubDate>Tue, 11 Aug 2009 21:10:10 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Summary]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Lattice Graphics]]></category>
		<category><![CDATA[Box and Whisker Plot]]></category>
		<category><![CDATA[bwplot]]></category>
		<category><![CDATA[summary]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=249</guid>
		<description><![CDATA[We have considered using a histogram to summarise univariate data but there are other types of plot such as the box and whisker plot that can be used summarised univariate data. The box and whisker plot is a graphical method for summarising numerical data based on a five-number summary. These five numbers are the minimum, [...]]]></description>
			<content:encoded><![CDATA[<p>We have considered using a histogram to summarise univariate data but there are other types of plot such as the <strong>box and whisker</strong> plot that can be used summarised univariate data. The <strong>box and whisker</strong> plot is a graphical method for summarising numerical data based on a five-number summary. These five numbers are the minimum, lower quartile, median, upper quartile and maximum value.<span id="more-249"></span></p>
<p>The <strong>lattice</strong> library has a function <strong>bwplot</strong> that can be used to create a box and whisker plot for a some data and using the standard mechanism individual plots can be produced for different factors levels to divide up the data into meaningful groups.</p>
<p>As an example we can use the olive oil data to produce a box and whisker summary of the <strong>palmitic</strong> variable for each of the areas in the data:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">library(lattice)
bwplot(Area ~ palmitic, data = olive.df)</pre></div></div>

<p>The first argument is a formula to describe the variables to include in the plot and because <strong>Area</strong> is a factor the function interprets this to mean that we want a separate summary for each of the levels of this factor. The graph produced looks like this:<br />
<div id="attachment_397" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.wekaleamstudios.co.uk/?attachment_id=397" rel="attachment wp-att-397"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2009/08/bwplot1-300x300.png" alt="Olive Oil Data Box and Whisker Summary" title="Box and Whisker Plot" width="300" height="300" class="size-medium wp-image-397" /></a><p class="wp-caption-text">Olive Oil Data Box and Whisker Summary</p></div></p>
<p>This type of plot can also be useful when exploring residuals from a fitted model.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/box-and-whisker-plots-for-summarising-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using Histograms to Summarise Data</title>
		<link>http://www.wekaleamstudios.co.uk/posts/using-histograms-to-summarise-data/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/using-histograms-to-summarise-data/#comments</comments>
		<pubDate>Mon, 08 Jun 2009 20:44:22 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Base Graphics]]></category>
		<category><![CDATA[Data Summary]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[Lattice Graphics]]></category>
		<category><![CDATA[hist]]></category>
		<category><![CDATA[histogram]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=218</guid>
		<description><![CDATA[It is not only possible to use tabular displays to summarise a data set and we will often be interested in using a graphical display as this might be a more effective way to visualise our data rather than using statistics such as the mean or standard deviation. The histogram is a commonly used graphical [...]]]></description>
			<content:encoded><![CDATA[<p>It is not only possible to use tabular displays to summarise a data set and we will often be interested in using a graphical display as this might be a more effective way to visualise our data rather than using statistics such as the mean or standard deviation.<span id="more-218"></span></p>
<p>The histogram is a commonly used graphical display used to summarised univariate data and it provides a visual indication of the location and variation in the data. Histograms are constructed by dividing the data into ranges and count the number of data points that occur in each range and the height of the bar is based on this information.</p>
<p>We can create a histogram using either the <strong>base</strong> graphics or <strong>lattice</strong> graphics in <strong>R</strong>. The function <strong>hist</strong> is part of the <strong>base</strong> graphics and the first argument we specify in the function call is the actual data to be used in the histogram. An example of creating a histogram would use the following code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">hist(olive.df$palmitic, xlab = &quot;Palmitic&quot;, main = &quot;Histogram&quot;)</pre></div></div>

<p>In this example we have also specified a label for the x-axis as well as the main title. The resulting graph looks like this:<br />
<div id="attachment_225" class="wp-caption aligncenter" style="width: 310px"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2009/05/histogram1-300x300.png" alt="Demonstration of using a histogram to summarise data" title="Histogram Example" width="300" height="300" class="size-medium wp-image-225" /><p class="wp-caption-text">Demonstration of using a histogram to summarise data</p></div></p>
<p>We can make use of the <strong>histogram</strong> function in the <strong>lattice</strong> library to create this plot and the syntax that we use is slightly different.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">histogram( ~ palmitic, data = olive.df)</pre></div></div>

<p>The first argument is a model formula that specifies that data to be used for the histogram as the independent variable component of the formula and the data argument is used to specify a data frame in which the function will look for the data. The histogram looks slightly different using this library:<br />
<div id="attachment_229" class="wp-caption aligncenter" style="width: 310px"><img src="http://www.wekaleamstudios.co.uk/wp-content/uploads/2009/05/histogram2-300x300.png" alt="Demonstration of using a histogram to summarise data" title="Lattice Histogram Example" width="300" height="300" class="size-medium wp-image-229" /><p class="wp-caption-text">Demonstration of using a histogram to summarise data</p></div></p>
<p>There are other types of graph that can be used to summarise univariate data which include the bow and whisker plot, density plot, strip plot or dot plot. These will be covered in subsequent posts either using the <strong>base</strong> graphics system or <strong>lattice</strong> graphics.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/using-histograms-to-summarise-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Working with Probability Distributions</title>
		<link>http://www.wekaleamstudios.co.uk/posts/working-with-probability-distributions/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/working-with-probability-distributions/#comments</comments>
		<pubDate>Sun, 31 May 2009 08:42:30 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Summary]]></category>
		<category><![CDATA[Binomial]]></category>
		<category><![CDATA[cdf]]></category>
		<category><![CDATA[Chi-squared]]></category>
		<category><![CDATA[Normal]]></category>
		<category><![CDATA[pdf]]></category>
		<category><![CDATA[Poisson]]></category>
		<category><![CDATA[Probability Distribution]]></category>
		<category><![CDATA[quantiles]]></category>
		<category><![CDATA[random sampling]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=178</guid>
		<description><![CDATA[Probability distributions have a central role in Statistics and the R software has functions to work with a large range of distributions &#8211; the syntax has been selected to provide some consistency based on the type of information required about a distribution. There are four functions that are defined for each distribution that is available [...]]]></description>
			<content:encoded><![CDATA[<p>Probability distributions have a central role in Statistics and the <strong>R</strong> software has functions to work with a large range of distributions &#8211; the syntax has been selected to provide some consistency based on the type of information required about a distribution.<span id="more-178"></span></p>
<p>There are four functions that are defined for each distribution that is available within <strong>R</strong>. These functions are:</p>
<ul>
<li>The density function &#8211; name starts with a d.</li>
<li>The cumulative density function &#8211; name starts with a p.</li>
<li>The quantile function &#8211; name starts with a q.</li>
<li>Random number generation &#8211; name starts with a r.</li>
</ul>
<p>Both discrete and continuous distributions are available in <strong>R</strong>. Distributions that we can access include: Beta, Binomial, Chi-squared, F, Logistic, Normal, Poisson, Student&#8217;s t and Weibull.</p>
<p>There is a <em>base</em> name for each of the distributions and we use the suffix letter mentioned above to access the requisite information. If we consider the Normal distribution as an example then <strong>dnorm</strong> is the function that will provide the density:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; dnorm(1.96, mean = 0, sd = 1)
[1] 0.05844094</pre></div></div>

<p>The <strong>mean</strong> and <strong>sd</strong> arguments are used to specify a particular pair of parameters for the Normal distribution. The cumulative distribution function is <strong>pnorm</strong> and the syntax is very similar to the <strong>dnorm</strong> function:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; pnorm(1.96, mean = 0, sd = 1)
[1] 0.9750021</pre></div></div>

<p>The default option is for the function to return the cumulative probability for values less than the specified figure. The quantile function, <strong>qnorm</strong>, allows us to work back from probabilities to values on the original data scale:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; qnorm(0.95, mean = 0, sd = 1)
[1] 1.644854</pre></div></div>

<p>As with the previous functions we can definition the parameters of the distribution where required. The last option of interest is the function that allows us to generate random samples for a particular distribution, which in the case of the Normal distribution is <strong>rnorm</strong>. Here we specify the number of samples to be drawn from the distribution along with the parameters of the distribution:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; rnorm(n = 20, mean = 0, sd = 1)
 [1] -1.1322606 -2.8320170 -0.5768220  1.0569513  1.0824524  1.4925396 -0.3010086 -0.4345893  2.6813322
[10]  0.3774106  1.7226911  0.5922038  0.0770510  1.4015955 -0.9998051  0.1924921  0.7181194  1.0107967
[19]  1.3224979 -0.1511634</pre></div></div>

<p>The previous command samples twenty observations from the standard Normal distribution.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/working-with-probability-distributions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Vector Calculations to avoid Explicit Loops</title>
		<link>http://www.wekaleamstudios.co.uk/posts/vector-calculations-to-avoid-explicit-loops/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/vector-calculations-to-avoid-explicit-loops/#comments</comments>
		<pubDate>Sat, 23 May 2009 11:09:05 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Summary]]></category>
		<category><![CDATA[S Programming]]></category>
		<category><![CDATA[apply]]></category>
		<category><![CDATA[data frame]]></category>
		<category><![CDATA[lapply]]></category>
		<category><![CDATA[max]]></category>
		<category><![CDATA[mean]]></category>
		<category><![CDATA[sapply]]></category>
		<category><![CDATA[tapply]]></category>
		<category><![CDATA[vector calculations]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=109</guid>
		<description><![CDATA[The S programming language has facilities for applying a function to all the individual elements of a vector, matrix or data frame which avoid the need to make explicit use of loops. In fact using loops in R is not recommended as this will slow down the calculations, but there will of course be some [...]]]></description>
			<content:encoded><![CDATA[<p>The <strong>S</strong> programming language has facilities for applying a function to all the individual elements of a vector, matrix or data frame which avoid the need to make explicit use of loops. In fact using loops in R is not recommended as this will slow down the calculations, but there will of course be some situations where it is unavoidable.<span id="more-109"></span></p>
<p>There is a function called <strong>apply</strong> that can be used to run a specific function on each of the rows or columns individually. For example we could calculate row or column means or variances using the <strong>apply</strong> or we could define a more complicated function that is more appropriate for the statistics that we want to calculate. If we take a look at the Olive oil data used in some of the other posts we might be interested in calculating variable (columns in this case) means and we would use this code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">apply(olive.df[,c(&quot;palmitic&quot;, &quot;palmitoleic&quot;, &quot;stearic&quot;, &quot;oleic&quot;, &quot;linoleic&quot;,
  &quot;linolenic&quot;, &quot;arachidic&quot;, &quot;eicosenoic&quot;)], 2, mean)</pre></div></div>

<p>The first thing we do is indicate which columns that we are interested in as the Region and Area are not important for these mean calculations &#8211; the square brackets are used to specify a subset of our data frame and we provide a vector of column names after the comma. The output from this function call is:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">   palmitic palmitoleic     stearic       oleic    linoleic   linolenic   arachidic  eicosenoic 
 1231.74126   126.09441   228.86538  7311.74825   980.52797    31.88811    58.09790    16.28147</pre></div></div>

<p>We could quite easily adjust this function call to use a different function on the data. Let&#8217;s say that we are interested in the maximum values for each variable then we would replace <strong>mean</strong> with <strong>max</strong>:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">apply(olive.df[,c(&quot;palmitic&quot;, &quot;palmitoleic&quot;, &quot;stearic&quot;, &quot;oleic&quot;, &quot;linoleic&quot;,
  &quot;linolenic&quot;, &quot;arachidic&quot;, &quot;eicosenoic&quot;)], 2, max)</pre></div></div>

<p>which returns:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">   palmitic palmitoleic     stearic       oleic    linoleic   linolenic   arachidic  eicosenoic 
       1753         280         375        8410        1470          74         105          58</pre></div></div>

<p>There are other associated functions &#8211; <strong>tapply</strong>, <strong>lapply</strong> and <strong>sapply</strong> &#8211; that perform on a similar routine on different types and format of data which will be discussed in subsequent posts.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/vector-calculations-to-avoid-explicit-loops/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cross-tabulation of Data</title>
		<link>http://www.wekaleamstudios.co.uk/posts/cross-tabulation-of-data/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/cross-tabulation-of-data/#comments</comments>
		<pubDate>Fri, 15 May 2009 19:50:13 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Manipulation]]></category>
		<category><![CDATA[Data Summary]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[columns]]></category>
		<category><![CDATA[contingency table]]></category>
		<category><![CDATA[Cross tabulation]]></category>
		<category><![CDATA[crosstab]]></category>
		<category><![CDATA[data frame]]></category>
		<category><![CDATA[rows]]></category>
		<category><![CDATA[table]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=103</guid>
		<description><![CDATA[The contingency table is used to summarise data when there are factors in the data set and we are interested in counting the number of occurrences of each combination of factor variables. In R there are different ways that these types of table can be produced and manipulated as required. Fast Tube by Casper The [...]]]></description>
			<content:encoded><![CDATA[<p>The contingency table is used to summarise data when there are factors in the data set and we are interested in counting the number of occurrences of each combination of factor variables. In <strong>R</strong> there are different ways that these types of table can be produced and manipulated as required.<span id="more-103"></span></p>
<p><!--[Fast Tube]--><span id="fJR9-g2WyKw" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/cross-tabulation-of-data/#fJR9-g2WyKw"><img src="http://i.ytimg.com/vi/fJR9-g2WyKw/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>The main two functions that are used to produce contingency tables are <strong>table</strong> and <strong>xtabs</strong>. We can use these two functions to get one, two or higher dimension tables that summarise the number of records that correspond to the combination of variables used to create the table.</p>
<p>The simplest case is where we are interested into a summary based on a single variable and the syntax is straightforward. The function <strong>table</strong> takes a single argument that corresponds to a vector of data. For example, if we are working with a data frame based on an unbalanced design and wanted to count the number of observations corresponding to each treatment we might run some code like:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">table(temp.design3$Treatment)</pre></div></div>

<p>which would produce a simple summary table:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">A B C D 
7 7 3 7</pre></div></div>

<p>If there was a second factor in the data set corresponding to different plots, labelled 1 to 4, then we could generate a two dimensional contingency table by adding a second argument to the function call like:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">table(temp.design3$Treatment, temp.design3$Plot)</pre></div></div>

<p>and the output would be of the form:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">    1 2 3 4
  A 2 2 2 1
  B 2 2 2 1
  C 1 1 1 0
  D 2 2 2 1</pre></div></div>

<p>The function <strong>xtabs</strong> can be used to create the same contingency tables but the function works using a formula in a similar vein to the modelling functions. So to get the one dimensional table we would write code similar to this:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">xtabs(~ Plot, data = temp.design3)</pre></div></div>

<p>which would summarise the data by plot:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">Plot
1 2 3 4 
7 7 7 3</pre></div></div>

<p>Note that the output is slightly different to using the <strong>table</strong> function. The two dimensional table would be created like this:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">xtabs(~ Treatment + Plot, data = temp.design3)</pre></div></div>

<p>and the output would be:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">         Plot
Treatment 1 2 3 4
        A 2 2 2 1
        B 2 2 2 1
        C 1 1 1 0
        D 2 2 2 1</pre></div></div>

<p>These functions can be extended to higher dimensions and the output is based on 2&#215;2 tables for each combination of the other variables.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/cross-tabulation-of-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Producing Data Summaries</title>
		<link>http://www.wekaleamstudios.co.uk/posts/producing-data-summaries/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/producing-data-summaries/#comments</comments>
		<pubDate>Mon, 11 May 2009 20:15:02 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Summary]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[covariance]]></category>
		<category><![CDATA[max]]></category>
		<category><![CDATA[maximum]]></category>
		<category><![CDATA[mean]]></category>
		<category><![CDATA[min]]></category>
		<category><![CDATA[minimum]]></category>
		<category><![CDATA[missing]]></category>
		<category><![CDATA[range]]></category>
		<category><![CDATA[standard deviation]]></category>
		<category><![CDATA[summary]]></category>
		<category><![CDATA[trimmed mean]]></category>
		<category><![CDATA[var]]></category>
		<category><![CDATA[variance]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=97</guid>
		<description><![CDATA[The first stage of most investigations is to produce summaries of the data to identify any unusual records and to get a overall feel for the contents of the data. This initial data analysis usually involves tabulation and plotting of data and there are a variety of functions available in R to generate the required [...]]]></description>
			<content:encoded><![CDATA[<p>The first stage of most investigations is to produce summaries of the data to identify any unusual records and to get a overall feel for the contents of the data. This initial data analysis usually involves tabulation and plotting of data and there are a variety of functions available in R to generate the required summaries of interest.<span id="more-97"></span></p>
<p><!--[Fast Tube]--><span id="MAvyu_vSIq0" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/producing-data-summaries/#MAvyu_vSIq0"><img src="http://i.ytimg.com/vi/MAvyu_vSIq0/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>In this post we will consider the numerical and tabular summaries of data of various types, e.g. numeric, categorical etc. The function <strong>mean</strong> calculates the average value of a vector of numeric data. As an example we could run the following:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; mean(CO2$conc)
[1] 435</pre></div></div>

<p>The <strong>$</strong> indicates that we are interested in a specific column from the CO2 data frame. A trimmed mean can be calculated by specifying the percentage of data at either end (minimum and maximum) of the data range. For the previous example we could get a 10% trimmed mean with this code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; mean(CO2$conc, trim=0.05)
[1] 423.1579</pre></div></div>

<p>If there was missing data then the function will return <strong>NA</strong> so to get around this we can instruct the mean function to ignore missing data by the <strong>na.rm</strong> argument</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; mean(CO2$conc, na.rm = TRUE)</pre></div></div>

<p>Other functions of interest are <strong>min</strong>, <strong>max</strong> and <strong>range</strong> which compute the values that we would expect based on the names of these functions:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; min(CO2$conc)
[1] 95
&gt; max(CO2$conc)
[1] 1000
&gt; range(CO2$conc)
[1]   95 1000</pre></div></div>

<p>The <strong>range</strong> function returns a vector with two elements corresponding to the minimum and maximum values of the data respectively.</p>
<p>The variance and standard deviation are other useful statistics that would be calculated for a data variable &#8211; to get the standard deviation we use the square root function on the value returned by the variance function:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; var(CO2$conc)
[1] 87571.08
&gt; sqrt(var(CO2$conc))
[1] 295.9241</pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/producing-data-summaries/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

