<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Software for Exploratory Data Analysis and Statistical Modelling &#187; Data Manipulation</title>
	<atom:link href="http://www.wekaleamstudios.co.uk/topics/r-environment/data-manipulation/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.wekaleamstudios.co.uk</link>
	<description>Statistical Modelling with R</description>
	<lastBuildDate>Wed, 01 Feb 2012 19:44:22 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Programming with R &#8211; Processing Football League Data Part II</title>
		<link>http://www.wekaleamstudios.co.uk/posts/programming-with-r-processing-football-league-data-part-ii/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/programming-with-r-processing-football-league-data-part-ii/#comments</comments>
		<pubDate>Fri, 03 Dec 2010 10:26:39 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Manipulation]]></category>
		<category><![CDATA[Data Summary]]></category>
		<category><![CDATA[File Import/Export]]></category>
		<category><![CDATA[S Programming]]></category>
		<category><![CDATA[as.numeric]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data frame]]></category>
		<category><![CDATA[England]]></category>
		<category><![CDATA[football]]></category>
		<category><![CDATA[ifelse]]></category>
		<category><![CDATA[Premiership]]></category>
		<category><![CDATA[results]]></category>
		<category><![CDATA[table]]></category>
		<category><![CDATA[tapply]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1459</guid>
		<description><![CDATA[Following on from the previous post about creating a football result processing function for data from the football-data.co.uk website we will add code to the function to generate a league table based on the results to date. To create the league table we need to count various things such as the number of games played, [...]]]></description>
			<content:encoded><![CDATA[<p>Following on from the previous <a href="http://www.wekaleamstudios.co.uk/posts/programming-with-r-processing-football-league-data-part-i/">post</a> about creating a football result processing function for data from the <a href="http://www.football-data.co.uk">football-data.co.uk</a> website we will add code to the function to generate a league table based on the results to date.<span id="more-1459"></span></p>
<p>To create the league table we need to count various things such as the number of games played, number of wins/draws/losses, goals scored etc. This information is available in the results object that is loaded from a <strong>csv</strong> file in the function as it stands.</p>
<p>To facilitate these calculations we create a data frame with a row for each team in the division and then calculate the statistics required &#8211; this was a reason for ordering the factors in the <strong>HomeTeam</strong> and <strong>AwayTeam</strong> columns of the results table. The data frame is created with the code below:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable = data.frame(Team = teams,
    Games = 0, Win = 0, Draw = 0, Loss = 0,
    HomeGames = 0, HomeWin = 0, HomeDraw = 0, HomeLoss = 0,
    AwayGames = 0, AwayWin = 0, AwayDraw = 0, AwayLoss = 0,
    Points = 0,
    HomeFor = 0, HomeAgainst = 0,
    AwayFor = 0, AwayAgainst = 0,
    For = 0, Against = 0, GoalDifference = 0)</pre></div></div>

<p>There are a number of slots that are may be redundant in a league table but are used for intermediate calculations, such as <strong>HomeWin</strong> and <strong>AwayWin</strong> that are combined to find the total number of victories for a team.</p>
<p>The number of games played by each team home and away are counted using the table command for the two columns respectively.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable$HomeGames = as.numeric(table(tmpResults$HomeTeam))
tmpTable$AwayGames = as.numeric(table(tmpResults$AwayTeam))</pre></div></div>

<p>The labels created by the table command are discarded using the as.numeric function to retain only the number of games. The table command is also used to count the number of wins, draws and losses at home and away for each team. The commands are shown here:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable$HomeWin =
    as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == &quot;H&quot;]))
tmpTable$HomeDraw =
    as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == &quot;D&quot;]))
tmpTable$HomeLoss =
    as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == &quot;A&quot;]))
&nbsp;
tmpTable$AwayWin =
    as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == &quot;A&quot;]))
tmpTable$AwayDraw =
    as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == &quot;D&quot;]))
tmpTable$AwayLoss =
    as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == &quot;H&quot;]))</pre></div></div>

<p>Note that we subset on the values in the <strong>FTR</strong> column, which is full-time result, and then count. The subsetting is reversed when looking at the away fixtures because a victory for the team is now an away win rather than a home win.</p>
<p>This information is then combined to get total games played, won etc.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable$Games = tmpTable$HomeGames + tmpTable$AwayGames
tmpTable$Win = tmpTable$HomeWin + tmpTable$AwayWin
tmpTable$Draw = tmpTable$HomeDraw + tmpTable$AwayDraw
tmpTable$Loss = tmpTable$HomeLoss + tmpTable$AwayLoss</pre></div></div>

<p>The total points is calclated by multiplying the number of wins, draws and losses by the number of points awarded for each match outcome.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable$Points = winPoints * tmpTable$Win +
    drawPoints * tmpTable$Draw + lossPoints * tmpTable$Loss</pre></div></div>

<p>The next set of calculations are to count the number of goals scored, goals conceeded and goal difference. The <strong>tapply</strong> function is used for these calculations.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable$HomeFor =
    as.numeric(tapply(tmpResults$FTHG, tmpResults$HomeTeam, sum, na.rm = TRUE))
tmpTable$HomeAgainst =
    as.numeric(tapply(tmpResults$FTAG, tmpResults$HomeTeam, sum, na.rm = TRUE))
&nbsp;
tmpTable$AwayFor =
    as.numeric(tapply(tmpResults$FTAG, tmpResults$AwayTeam, sum, na.rm = TRUE))
tmpTable$AwayAgainst =
    as.numeric(tapply(tmpResults$FTHG, tmpResults$AwayTeam, sum, na.rm = TRUE))</pre></div></div>

<p>The <strong>tapply</strong> function applies the <strong>sum</strong> to the number of goals scored at home or away, and the number of goals conceeded by each team in the division. These are then combined to create totals home and away:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable$For =
    ifelse(is.na(tmpTable$HomeFor), 0, tmpTable$HomeFor) +
    ifelse(is.na(tmpTable$AwayFor), 0, tmpTable$AwayFor)
tmpTable$Against =
    ifelse(is.na(tmpTable$HomeAgainst), 0, tmpTable$HomeAgainst) +
    ifelse(is.na(tmpTable$AwayAgainst), 0, tmpTable$AwayAgainst)</pre></div></div>

<p>The <strong>ifelse</strong> statement is used to handle situations where a team hasn&#8217;t played a home and/or away fixture yet. The goal difference is easy to calculate:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable$GoalDifference = tmpTable$For - tmpTable$Against</pre></div></div>

<p>Now that all of the statistics have been calculated we sort the table based on the number of points, goal difference and finally alphabetically. There might be different ways that we can order the teams but this is what we will use for the time being:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpTable =
  tmpTable[order(- tmpTable$Points, - tmpTable$GoalDifference, tmpTable$Team),]</pre></div></div>

<p>The ordering might look odd but we want to ranking from highest to lowest points and goal difference but then in ascending alphabetical order for the teams.</p>
<p>The whole function is now:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">football.process.v2 = function(datafile, country, divname, season, teams, winPoints = 3, drawPoints = 1, lossPoints = 0)
{
## Validation Function Arguments
&nbsp;
if (missing(datafile))
{
stop(&quot;Results csv file not specified.&quot;)
}
&nbsp;
if (missing(country))
{
warning(&quot;Country of league not specified.&quot;)
country = &quot;&quot;
}
&nbsp;
if (missing(divname))
{
warning(&quot;Name of league division not specified.&quot;)
divname = &quot;&quot;
}
&nbsp;
## Import Results
&nbsp;
tmpResults = read.csv(datafile)[,c(&quot;Date&quot;,&quot;HomeTeam&quot;,&quot;AwayTeam&quot;,&quot;FTR&quot;,&quot;FTHG&quot;,&quot;FTAG&quot;)]
&nbsp;
if (missing(teams))
{
warning(&quot;Team names not specified - extracted from results data.&quot;)
teams = sort(unique(c(as.character(tmpResults$HomeTeam), as.character(tmpResults$AwayTeam))))
}
&nbsp;
tmpResults$HomeTeam = factor(tmpResults$HomeTeam, levels = teams)
tmpResults$AwayTeam = factor(tmpResults$AwayTeam, levels = teams)
&nbsp;
## Create Empty League Table
&nbsp;
tmpTable = data.frame(Team = teams,
Games = 0, Win = 0, Draw = 0, Loss = 0,
HomeGames = 0, HomeWin = 0, HomeDraw = 0, HomeLoss = 0,
AwayGames = 0, AwayWin = 0, AwayDraw = 0, AwayLoss = 0,
Points = 0,
HomeFor = 0, HomeAgainst = 0,
AwayFor = 0, AwayAgainst = 0,
For = 0, Against = 0, GoalDifference = 0)
&nbsp;
## Count Number of Games Played
&nbsp;
tmpTable$HomeGames = as.numeric(table(tmpResults$HomeTeam))
tmpTable$AwayGames = as.numeric(table(tmpResults$AwayTeam))
&nbsp;
## Count Number of Wins/Draws/Losses
&nbsp;
tmpTable$HomeWin = as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == &quot;H&quot;]))
tmpTable$HomeDraw = as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == &quot;D&quot;]))
tmpTable$HomeLoss = as.numeric(table(tmpResults$HomeTeam[tmpResults$FTR == &quot;A&quot;]))
&nbsp;
tmpTable$AwayWin = as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == &quot;A&quot;]))
tmpTable$AwayDraw = as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == &quot;D&quot;]))
tmpTable$AwayLoss = as.numeric(table(tmpResults$AwayTeam[tmpResults$FTR == &quot;H&quot;]))
&nbsp;
tmpTable$Games = tmpTable$HomeGames + tmpTable$AwayGames
tmpTable$Win = tmpTable$HomeWin + tmpTable$AwayWin
tmpTable$Draw = tmpTable$HomeDraw + tmpTable$AwayDraw
tmpTable$Loss = tmpTable$HomeLoss + tmpTable$AwayLoss
tmpTable$Points = winPoints * tmpTable$Win + drawPoints * tmpTable$Draw + lossPoints * tmpTable$Loss
&nbsp;
## Count Goals Scored and Conceeded
&nbsp;
tmpTable$HomeFor = as.numeric(tapply(tmpResults$FTHG, tmpResults$HomeTeam, sum, na.rm = TRUE))
tmpTable$HomeAgainst = as.numeric(tapply(tmpResults$FTAG, tmpResults$HomeTeam, sum, na.rm = TRUE))
&nbsp;
tmpTable$AwayFor = as.numeric(tapply(tmpResults$FTAG, tmpResults$AwayTeam, sum, na.rm = TRUE))
tmpTable$AwayAgainst = as.numeric(tapply(tmpResults$FTHG, tmpResults$AwayTeam, sum, na.rm = TRUE))
&nbsp;
tmpTable$For = ifelse(is.na(tmpTable$HomeFor), 0, tmpTable$HomeFor) +
ifelse(is.na(tmpTable$AwayFor), 0, tmpTable$AwayFor)
tmpTable$Against = ifelse(is.na(tmpTable$HomeAgainst), 0, tmpTable$HomeAgainst) +
ifelse(is.na(tmpTable$AwayAgainst), 0, tmpTable$AwayAgainst)
&nbsp;
tmpTable$GoalDifference = tmpTable$For - tmpTable$Against
&nbsp;
## Sort Table
## By Points
## By Goal Difference
## By Team Name (Alphabetical)
&nbsp;
tmpTable = tmpTable[order(- tmpTable$Points, - tmpTable$GoalDifference, tmpTable$Team),]
&nbsp;
tmpTable = tmpTable[,c(&quot;Team&quot;, &quot;Games&quot;, &quot;Win&quot;, &quot;Draw&quot;, &quot;Loss&quot;, &quot;Points&quot;, &quot;For&quot;, &quot;Against&quot;, &quot;GoalDifference&quot;)]
&nbsp;
## Return Division Information
&nbsp;
tmpSummary = list(Country = country, Division = divname, Season = season, Teams = teams,
Results = tmpResults, Table = tmpTable)
&nbsp;
invisible(tmpSummary)
}</pre></div></div>

<p>There are other functionality that we might want to add to the function.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/programming-with-r-processing-football-league-data-part-ii/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Programming with R &#8211; Processing Football League Data Part I</title>
		<link>http://www.wekaleamstudios.co.uk/posts/programming-with-r-processing-football-league-data-part-i/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/programming-with-r-processing-football-league-data-part-i/#comments</comments>
		<pubDate>Tue, 23 Nov 2010 14:14:45 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Manipulation]]></category>
		<category><![CDATA[Data Summary]]></category>
		<category><![CDATA[File Import/Export]]></category>
		<category><![CDATA[S Programming]]></category>
		<category><![CDATA[csv]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[England]]></category>
		<category><![CDATA[football]]></category>
		<category><![CDATA[list]]></category>
		<category><![CDATA[Premiership]]></category>
		<category><![CDATA[print]]></category>
		<category><![CDATA[read.csv]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1447</guid>
		<description><![CDATA[In this post we will make use of football results data from the football-data.co.uk website to demonstrate creating functions in R to automate a series of standard operations that would be required for results data from various leagues and divisions. The first step is to consider what control options should be available as part of [...]]]></description>
			<content:encoded><![CDATA[<p>In this post we will make use of football results data from the <a href="http://www.football-data.co.uk">football-data.co.uk</a> website to demonstrate creating functions in <strong>R</strong> to automate a series of standard operations that would be required for results data from various leagues and divisions.<span id="more-1447"></span></p>
<p>The first step is to consider what control options should be available as part of the function and here is a list of some arguments that will be used for this implementation of a football result data processing function:</p>
<ul>
<li>The name of a <strong>csv</strong> data file from the <a href="http://www.football-data.co.uk">football-data.co.uk</a> website.</li>
<li>A text string to specify the country and division for the data.</li>
<li>A text string specifying the season.</li>
<li>A list of teams in the division (optional), which could be used to test for data entry errors in the data file.</li>
<li>The number of points for a win, draw or loss. This might seem a strange option initially but different leagues might award different points for the three outcomes.</li>
</ul>
<p>Some of this information might appear optional but is included so that we can write a custom <strong>print</strong> function at a later date to display a meaningful summary of the object (list) that will be created by the function.</p>
<p>The first part of our function is concerned with checking the various values provided to the function arguments. Our skeleton function is as follows:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">football.process.v1 = function(datafile, country, divname, season,
  teams, winPoints = 3, drawPoints = 1, lossPoints = 0)
{
&nbsp;
}</pre></div></div>

<p>Here we have specified default options for three of the arguments with the most likely number of points for each match outcome, i.e. 3 points for a win and 1 point for a draw.</p>
<p>To illustrate the working of the result processing function we will use a small exert from the start of the 2010/2011 English Premiership season which is shown below:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,HTAG,HTR,Referee
E0,14/8/2010,Aston Villa,West Ham,3,0,H,2,0,H,M Dean
E0,14/8/2010,Blackburn,Everton,1,0,H,1,0,H,P Dowd
E0,14/8/2010,Bolton,Fulham,0,0,D,0,0,D,S Attwell
E0,14/8/2010,Chelsea,West Brom,6,0,H,2,0,H,M Clattenburg
E0,14/8/2010,Sunderland,Birmingham,2,2,D,1,0,H,A Taylor
E0,14/8/2010,Tottenham,Man City,0,0,D,0,0,D,A Marriner
E0,14/8/2010,Wigan,Blackpool,0,4,A,0,3,A,M Halsey
E0,14/8/2010,Wolves,Stoke,2,1,H,2,0,H,L Probert
E0,15/8/2010,Liverpool,Arsenal,1,1,D,0,0,D,M Atkinson
E0,16/8/2010,Man United,Newcastle,3,0,H,2,0,H,C Foy
E0,21/8/2010,Arsenal,Blackpool,6,0,H,3,0,H,M Jones
E0,21/8/2010,Birmingham,Blackburn,2,1,H,0,0,D,M Oliver
E0,21/8/2010,Everton,Wolves,1,1,D,1,0,H,L Mason
E0,21/8/2010,Stoke,Tottenham,1,2,A,1,2,A,C Foy
E0,21/8/2010,West Brom,Sunderland,1,0,H,0,0,D,K Friend
E0,21/8/2010,West Ham,Bolton,1,3,A,0,0,D,A Marriner
E0,21/8/2010,Wigan,Chelsea,0,6,A,0,1,A,M Dean
E0,22/8/2010,Fulham,Man United,2,2,D,0,1,A,P Walton
E0,22/8/2010,Newcastle,Aston Villa,6,0,H,3,0,H,M Atkinson
E0,23/8/2010,Man City,Liverpool,3,0,H,1,0,H,P Dowd
E0,28/8/2010,Blackburn,Arsenal,1,2,A,1,1,D,C Foy
E0,28/8/2010,Blackpool,Fulham,2,2,D,0,1,A,M Oliver
E0,28/8/2010,Chelsea,Stoke,2,0,H,1,0,H,M Atkinson
E0,28/8/2010,Man United,West Ham,3,0,H,1,0,H,M Clattenburg
E0,28/8/2010,Tottenham,Wigan,0,1,A,0,0,D,P Dowd
E0,28/8/2010,Wolves,Newcastle,1,1,D,1,0,H,S Attwell
E0,29/8/2010,Aston Villa,Everton,1,0,H,1,0,H,M Jones
E0,29/8/2010,Bolton,Birmingham,2,2,D,0,1,A,K Friend
E0,29/8/2010,Liverpool,West Brom,1,0,H,0,0,D,L Probert
E0,29/8/2010,Sunderland,Man City,1,0,H,0,0,D,M Dean</pre></div></div>

<p>This is stored in a file <strong>E0test.csv</strong> so that we can use the <strong>read.csv</strong> function to import the results data and then process it.</p>
<p>The first series of commands that we add to the function are for checking various function arguments specified by the user to ensure that they are sensible. First up we check whether a results data file has been specified as we cannot do any processing without any results. The simple test is for whether a file name has been specified:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">if (missing(datafile))
{
    stop(&quot;Results csv file not specified.&quot;)
}</pre></div></div>

<p>It might be sensible to check whether the object <strong>datafile</strong> is actually a character string specifying a file, but this hasn&#8217;t been done for now. We then check whether the country name and division have been specified and set them to blank strings if they haven&#8217;t been set by the user.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">if (missing(country))
{
    warning(&quot;Country of league not specified.&quot;)
    country = &quot;&quot;
}
&nbsp;
if (missing(divname))
{
    warning(&quot;Name of league division not specified.&quot;)
    divname = &quot;&quot;
}</pre></div></div>

<p>Next up we import the data file and only save the columns of interest (at this point of the development of the function at least. There are many more columns of information that we need in the raw data from the website,</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpResults =
    read.csv(datafile)[,c(&quot;Date&quot;,&quot;HomeTeam&quot;,&quot;AwayTeam&quot;,&quot;FTR&quot;,&quot;FTHG&quot;,&quot;FTAG&quot;)]</pre></div></div>

<p>The square brackets are used to subset on a part set of columns and only save these. Then we check whether the team names have been specified by the user and if not extract them from the data provided:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">if (missing(teams))
{
    warning(&quot;Team names not specified - extracted from results data.&quot;)
    teams = sort(unique(c(as.character(tmpResults$HomeTeam),
        as.character(tmpResults$AwayTeam))))
}</pre></div></div>

<p>The sort function is used to order the team names alphabetically which is the order often used in league tables, especially when no games have been played. We then convert the columns <strong>HomeTeam</strong> and <strong>AwayTeam</strong> into factors, which allows teams that haven&#8217;t played a fixture yet to be included in the table.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpResults$HomeTeam = factor(tmpResults$HomeTeam, levels = teams)
tmpResults$AwayTeam = factor(tmpResults$AwayTeam, levels = teams)</pre></div></div>

<p>To round off the first part of creating the result processing function we create a list object to return at the end of the function.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">tmpSummary = list(Country = country, Division = divname,
    Season = season, Teams = teams, Results = tmpResults)</pre></div></div>

<p>The function so far:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">football.process.v1 = function(datafile, country, divname, season, teams, winPoints = 3, drawPoints = 1, lossPoints = 0)
{
## Validation Function Arguments
&nbsp;
if (missing(datafile))
{
stop(&quot;Results csv file not specified.&quot;)
}
&nbsp;
if (missing(country))
{
warning(&quot;Country of league not specified.&quot;)
country = &quot;&quot;
}
&nbsp;
if (missing(divname))
{
warning(&quot;Name of league division not specified.&quot;)
divname = &quot;&quot;
}
&nbsp;
## Import Results
&nbsp;
tmpResults = read.csv(datafile)[,c(&quot;Date&quot;,&quot;HomeTeam&quot;,&quot;AwayTeam&quot;,&quot;FTR&quot;,&quot;FTHG&quot;,&quot;FTAG&quot;)]
&nbsp;
if (missing(teams))
{
warning(&quot;Team names not specified - extracted from results data.&quot;)
teams = sort(unique(c(as.character(tmpResults$HomeTeam), as.character(tmpResults$AwayTeam))))
}
&nbsp;
tmpResults$HomeTeam = factor(tmpResults$HomeTeam, levels = teams)
tmpResults$AwayTeam = factor(tmpResults$AwayTeam, levels = teams)
&nbsp;
## Return Division Information
&nbsp;
tmpSummary = list(Country = country, Division = divname, Season = season, Teams = teams,
Results = tmpResults)
&nbsp;
invisible(tmpSummary)
}</pre></div></div>

<p>We then test this function with the data file shown above. First up we create our own list of teams in the English Premiership for 2010/2011 and specify some of the other function arguments while using the defaults for points.</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; E0teams.1011 = c(&quot;Arsenal&quot;, &quot;Aston Villa&quot;, &quot;Birmingham&quot;, &quot;Blackburn&quot;,
+ &quot;Blackpool&quot;, &quot;Bolton&quot;, &quot;Chelsea&quot;, &quot;Everton&quot;, &quot;Fulham&quot;, &quot;Liverpool&quot;,
+ &quot;Man City&quot;, &quot;Man United&quot;, &quot;Newcastle&quot;, &quot;Stoke&quot;, &quot;Sunderland&quot;,
+ &quot;Tottenham&quot;, &quot;West Brom&quot;, &quot;West Ham&quot;, &quot;Wigan&quot;, &quot;Wolves&quot;)
&gt; print(football.process.v1(&quot;E0test.csv&quot;, &quot;England&quot;, &quot;Premiership&quot;,
    &quot;2010-2011&quot;, E0teams.1011))
$Country
[1] &quot;England&quot;
&nbsp;
$Division
[1] &quot;Premiership&quot;
&nbsp;
$Season
[1] &quot;2010-2011&quot;
&nbsp;
$Teams
 [1] &quot;Arsenal&quot;     &quot;Aston Villa&quot; &quot;Birmingham&quot;  &quot;Blackburn&quot;   &quot;Blackpool&quot;  
 [6] &quot;Bolton&quot;      &quot;Chelsea&quot;     &quot;Everton&quot;     &quot;Fulham&quot;      &quot;Liverpool&quot;  
[11] &quot;Man City&quot;    &quot;Man United&quot;  &quot;Newcastle&quot;   &quot;Stoke&quot;       &quot;Sunderland&quot; 
[16] &quot;Tottenham&quot;   &quot;West Brom&quot;   &quot;West Ham&quot;    &quot;Wigan&quot;       &quot;Wolves&quot;     
&nbsp;
$Results
        Date    HomeTeam    AwayTeam FTR FTHG FTAG
1  14/8/2010 Aston Villa    West Ham   H    3    0
2  14/8/2010   Blackburn     Everton   H    1    0
3  14/8/2010      Bolton      Fulham   D    0    0
4  14/8/2010     Chelsea   West Brom   H    6    0
5  14/8/2010  Sunderland  Birmingham   D    2    2
6  14/8/2010   Tottenham    Man City   D    0    0
7  14/8/2010       Wigan   Blackpool   A    0    4
8  14/8/2010      Wolves       Stoke   H    2    1
9  15/8/2010   Liverpool     Arsenal   D    1    1
10 16/8/2010  Man United   Newcastle   H    3    0
11 21/8/2010     Arsenal   Blackpool   H    6    0
12 21/8/2010  Birmingham   Blackburn   H    2    1
13 21/8/2010     Everton      Wolves   D    1    1
14 21/8/2010       Stoke   Tottenham   A    1    2
15 21/8/2010   West Brom  Sunderland   H    1    0
16 21/8/2010    West Ham      Bolton   A    1    3
17 21/8/2010       Wigan     Chelsea   A    0    6
18 22/8/2010      Fulham  Man United   D    2    2
19 22/8/2010   Newcastle Aston Villa   H    6    0
20 23/8/2010    Man City   Liverpool   H    3    0
21 28/8/2010   Blackburn     Arsenal   A    1    2
22 28/8/2010   Blackpool      Fulham   D    2    2
23 28/8/2010     Chelsea       Stoke   H    2    0
24 28/8/2010  Man United    West Ham   H    3    0
25 28/8/2010   Tottenham       Wigan   A    0    1
26 28/8/2010      Wolves   Newcastle   D    1    1
27 29/8/2010 Aston Villa     Everton   H    1    0
28 29/8/2010      Bolton  Birmingham   D    2    2
29 29/8/2010   Liverpool   West Brom   H    1    0
30 29/8/2010  Sunderland    Man City   H    1    0</pre></div></div>

<p>Other useful resources are provided on the <a href="http://www.wekaleamstudios.co.uk/supplementary-material/">Supplementary Material</a> page.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/programming-with-r-processing-football-league-data-part-i/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Useful functions for data frames</title>
		<link>http://www.wekaleamstudios.co.uk/posts/useful-functions-for-data-frames/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/useful-functions-for-data-frames/#comments</comments>
		<pubDate>Mon, 09 Aug 2010 20:38:15 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Manipulation]]></category>
		<category><![CDATA[data frame]]></category>
		<category><![CDATA[head]]></category>
		<category><![CDATA[str]]></category>
		<category><![CDATA[tail]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=1344</guid>
		<description><![CDATA[The R software system is primarily command line based so when there are large sets of data it is not easy to browse the data frames. There are various useful functions for working with data frames. For example, after loading data from a text file we might want to view the first few lines of [...]]]></description>
			<content:encoded><![CDATA[<p>The <strong>R</strong> software system is primarily command line based so when there are large sets of data it is not easy to browse the data frames. There are various useful functions for working with data frames.<span id="more-1344"></span></p>
<p>For example, after loading data from a text file we might want to view the first few lines of a set of data. The functions <strong>head</strong> and <strong>tail</strong> <em>return the first or last parts of a vector, matrix, table, data frame or function</em>.</p>
<p>Consider the <strong>Orange</strong> data set that is available in <strong>R</strong>. We can view the first few lines</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; head(Orange)
  Tree  age circumference
1    1  118            30
2    1  484            58
3    1  664            87
4    1 1004           115
5    1 1231           120
6    1 1372           142</pre></div></div>

<p>or the last few lines:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; tail(Orange)
   Tree  age circumference
30    5  484            49
31    5  664            81
32    5 1004           125
33    5 1231           142
34    5 1372           174
35    5 1582           177</pre></div></div>

<p>Another useful function is <strong>str</strong>, which <em>compactly displays the internal structure of an R object</em>. On this set of data we get:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; str(Orange)
Classes ‘nfnGroupedData’, ‘nfGroupedData’, ‘groupedData’ and 'data.frame':      35 obs. of  3 variables:
 $ Tree         : Ord.factor w/ 5 levels &quot;3&quot;&lt;&quot;1&quot;&lt;&quot;5&quot;&lt;&quot;2&quot;&lt;..: 2 2 2 2 2 2 2 4 4 4 ...
 $ age          : num  118 484 664 1004 1231 ...
 $ circumference: num  30 58 87 115 120 142 145 33 69 111 ...
 - attr(*, &quot;formula&quot;)=Class 'formula' length 3 circumference ~ age | Tree
  .. ..- attr(*, &quot;.Environment&quot;)=&lt;environment: R_EmptyEnv&gt; 
 - attr(*, &quot;labels&quot;)=List of 2
  ..$ x: chr &quot;Time since December 31, 1968&quot;
  ..$ y: chr &quot;Trunk circumference&quot;
 - attr(*, &quot;units&quot;)=List of 2
  ..$ x: chr &quot;(days)&quot;
  ..$ y: chr &quot;(mm)&quot;</pre></div></div>

<p>There is quite a bit of additional information attached to this data frame, mainly due to it having more than one class.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/useful-functions-for-data-frames/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Creating Date Objects using Character Strings</title>
		<link>http://www.wekaleamstudios.co.uk/posts/creating-date-objects-using-character-strings/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/creating-date-objects-using-character-strings/#comments</comments>
		<pubDate>Thu, 10 Sep 2009 19:03:34 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Manipulation]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=421</guid>
		<description><![CDATA[The use of dates can frequently be problematic because there is such a wide range of format used to store data information. The R system has various facilities for defining and working with dates and can handle a wide range of formats that might be encountered in a set of data. The function as.Date can [...]]]></description>
			<content:encoded><![CDATA[<p>The use of dates can frequently be problematic because there is such a wide range of format used to store data information. The R system has various facilities for defining and working with dates and can handle a wide range of formats that might be encountered in a set of data.<span id="more-421"></span></p>
<p>The function <strong>as.Date</strong> can convert a string into a Date object using a user specified format, which can take various forms. The first argument to the function is a character string with the date and the second argument is a second string that provides information about the specific format. The format specifiers use a percent symbol followed by a character, e.g. <strong>%d</strong> indicates that a day is being specified and <strong>%m</strong> corresponds to a month etc. If the parts of the date are separated by a symbol such as a slash or a dash then these are included in the format specifier.</p>
<p>If the date is in the form 01/04/2009 then the format string would be <strong>%d/%m/%Y</strong>. The upper case <strong>Y</strong> indicates that the year includes century information which is a safer format to use so that there is no ambiguity in any calculations. The following code will convert the date from a character string into a Date object:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; date1 = &quot;01/04/2009&quot;
&gt; as.Date(date1, format = &quot;%d/%m/%Y&quot;)
[1] &quot;2009-04-01&quot;</pre></div></div>

<p>A simple alternative would be where a dash is used to separate the day, month and year:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; date2 = &quot;07-04-2002&quot;
&gt; as.Date(date2, format = &quot;%d-%m-%Y&quot;)
[1] &quot;2002-04-07&quot;</pre></div></div>

<p>We can compare two dates to see whether one is before the other:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; date1 &lt; date2
[1] TRUE</pre></div></div>

<p>There are other operations that we can perform on dates.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/creating-date-objects-using-character-strings/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using R to Access Data in a MySQL database</title>
		<link>http://www.wekaleamstudios.co.uk/posts/using-r-to-access-data-in-a-mysql-database/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/using-r-to-access-data-in-a-mysql-database/#comments</comments>
		<pubDate>Fri, 24 Jul 2009 07:57:39 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Manipulation]]></category>
		<category><![CDATA[Database Connectivity]]></category>
		<category><![CDATA[data frame]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[dbConnect]]></category>
		<category><![CDATA[dbDisconnect]]></category>
		<category><![CDATA[dbListTables]]></category>
		<category><![CDATA[dbReadTable]]></category>
		<category><![CDATA[dbRemoveTable]]></category>
		<category><![CDATA[dbWriteTable]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[RMySQL]]></category>
		<category><![CDATA[table]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=359</guid>
		<description><![CDATA[The R import/export manual discusses various approaches to handling data and mentions that R is not suitable for working with large data sets because data objects are stored in memory during a session. There are situations where using a database to hold the data and making use of one of the R libraries for database [...]]]></description>
			<content:encoded><![CDATA[<p>The <strong>R</strong> import/export manual discusses various approaches to handling data and mentions that <strong>R</strong> is not suitable for working with large data sets because data objects are stored in memory during a session. There are situations where using a database to hold the data and making use of one of the R libraries for database connectivity to access the data or to save the data.<span id="more-359"></span></p>
<p><strong>MySQL</strong> is a popular database system and there is a library <strong>RMySQL</strong> that can be used to access this database &#8211; it is important to ensure that the version of <strong>MySQL</strong> matches with the R library. If there isn&#8217;t a match then the system might exhibit erratic behaviour.</p>
<p>The first step is to make the <strong>RMySQL</strong> library available in the working session:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">library(RMySQL)</pre></div></div>

<p>If this command runs without problems then we need to create a connection object for our session:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">con = dbConnect(MySQL(), dbname = &quot;test&quot;, user = &quot;user1&quot;,
  password = &quot;pwd1&quot;)</pre></div></div>

<p>The first argument to this function <em>creates and initializes a MySQL client. It returns an driver object that allows you to connect to one or several MySQL servers</em>. The <strong>dbname</strong> argument is used to specify the name of the database and the <strong>user</strong> and <strong>password</strong> arguments should be self-explanatory.</p>
<p>After creating our connection successfully we can get <strong>R</strong> to list the tables that are stored in this database:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; dbListTables(con)
[1] &quot;co2&quot;</pre></div></div>

<p>In this example there is only one table and its name is returned by the <strong>dbListTables</strong> function. To read the data from this table we use the <strong>dbReadTable</strong> function and specify the connection object as well as the name of the table in the <strong>MySQL</strong> database:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; dbReadTable(con, &quot;co2&quot;)
   Plant        Type  Treatment conc uptake
1    Qn1      Quebec nonchilled   95   16.0
2    Qn1      Quebec nonchilled  175   30.4
3    Qn1      Quebec nonchilled  250   34.8
...</pre></div></div>

<p>We can save the table to a data frame object rather than the default action of printing to the console.</p>
<p>After undertaken some analysis we might want to save a data set to the database and the <strong>dbWriteTable</strong> function is used:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; dbWriteTable(con, &quot;CO2&quot;, data.frame(CO2),
  overwrite = TRUE)
[1] TRUE</pre></div></div>

<p>The first argument is the connection object, the second is the name that the table will be referred to in the database and the third argument is the data to be saved. In this case we have used the <strong>overwrite</strong> argument to copy over any existing table of the same name.</p>
<p>We can delete a table using the <strong>dbRemoveTable</strong> function:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; dbRemoveTable(con, &quot;CO2&quot;)
[1] TRUE</pre></div></div>

<p>and when we reach the end of our need for the connection, the <strong>dbDisconnect</strong> function will remove the connection that we have been using:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; dbDisconnect(con)
[1] TRUE</pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/using-r-to-access-data-in-a-mysql-database/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Sequences and Other Regular Arrangements of Data</title>
		<link>http://www.wekaleamstudios.co.uk/posts/sequences-and-other-regular-arrangements-of-data/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/sequences-and-other-regular-arrangements-of-data/#comments</comments>
		<pubDate>Tue, 26 May 2009 19:08:28 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Manipulation]]></category>
		<category><![CDATA[expand.grid]]></category>
		<category><![CDATA[factor]]></category>
		<category><![CDATA[grid]]></category>
		<category><![CDATA[rep]]></category>
		<category><![CDATA[repeat]]></category>
		<category><![CDATA[seq]]></category>
		<category><![CDATA[sequence]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=140</guid>
		<description><![CDATA[In Statistical analysis there are frequently situations where regular structures occur, such as in designed experiments, and R has facilities for generating data frames in a simple way. The function expand.grid can be used to create a design by specifying a series of factors and the levels for these factors. A data frame with all [...]]]></description>
			<content:encoded><![CDATA[<p>In Statistical analysis there are frequently situations where regular structures occur, such as in designed experiments, and <strong>R</strong> has facilities for generating data frames in a simple way.<span id="more-140"></span></p>
<p>The function <strong>expand.grid</strong> can be used to create a design by specifying a series of factors and the levels for these factors. A data frame with all the combinations of the factors levels will be created. For example, if we had a two factor experiment where the first factor had four levels labelled A, B, C and D and the second factor had three levels labelled I, II, and III then we could create the data frame for the design using this code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">expand.grid(Factor1 = c(&quot;A&quot;, &quot;B&quot;, &quot;C&quot;, &quot;D&quot;), Factor2 = c(&quot;I&quot;, &quot;II&quot;, &quot;III&quot;))</pre></div></div>

<p>which would produce the following output:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">   Factor1 Factor2
1        A       I
2        B       I
3        C       I
4        D       I
5        A      II
6        B      II
7        C      II
8        D      II
9        A     III
10       B     III
11       C     III
12       D     III</pre></div></div>

<p>It is also possible to create various other sequences using the <strong>seq</strong> and <strong>rep</strong> commands. To create the numbers from 1 to 10 we could run this code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; 1:10
 [1]  1  2  3  4  5  6  7  8  9 10</pre></div></div>

<p>Alternatively the <strong>seq</strong> function provides greater control over start and end values and the step between each variable. A couple of examples are shown below:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; seq(1, 5)
[1] 1 2 3 4 5
&gt; seq(10, 1, -2)
[1] 10  8  6  4  2</pre></div></div>

<p>The negative step indicates that the sequence is decreasing.</p>
<p>Another common pattern is where we might want to repeat a number a given number of times. To get ten replicates of the number one we use the <strong>rep</strong> function:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; rep(1, 10)
 [1] 1 1 1 1 1 1 1 1 1 1</pre></div></div>

<p>If we want to repeat a sequence multiple times then we provide the sequence as the first argument to the <strong>rep</strong> function. So to repeat the numbers one to ten twice we would write:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; rep(1:10, 2)
 [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10</pre></div></div>

<p>So the <strong>1:10</strong> evaluates first to the numbers one to ten and the whole thing is repeated twice. A further arrangement where we might want to repeat each element of the sequence a given number of times is accessed by nesting a <strong>rep</strong> call inside a <strong>rep</strong> function. The second argument becomes a vector of the same length as the first argument. As an example:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; rep(1:5, rep(3, 5))
 [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5</pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/sequences-and-other-regular-arrangements-of-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Transformations to Create New Variables</title>
		<link>http://www.wekaleamstudios.co.uk/posts/transformations-to-create-new-variables/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/transformations-to-create-new-variables/#comments</comments>
		<pubDate>Mon, 18 May 2009 20:21:05 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Manipulation]]></category>
		<category><![CDATA[data frame]]></category>
		<category><![CDATA[log]]></category>
		<category><![CDATA[logarithm]]></category>
		<category><![CDATA[new variable]]></category>
		<category><![CDATA[scale]]></category>
		<category><![CDATA[scaling]]></category>
		<category><![CDATA[sqrt]]></category>
		<category><![CDATA[square root]]></category>
		<category><![CDATA[transformation]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=107</guid>
		<description><![CDATA[There are many situations where we might be interested in creating a new variable by transforming one of the variables already in the data frame. The R programming language can be used for either simple transformations or more complicated mathematical expressions where necessary. Fast Tube by Casper There are many situations where the logarithmic scale [...]]]></description>
			<content:encoded><![CDATA[<p>There are many situations where we might be interested in creating a new variable by transforming one of the variables already in the data frame. The <strong>R</strong> programming language can be used for either simple transformations or more complicated mathematical expressions where necessary.<span id="more-107"></span></p>
<p><!--[Fast Tube]--><span id="rZYGhFtQ9Nk" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/transformations-to-create-new-variables/#rZYGhFtQ9Nk"><img src="http://i.ytimg.com/vi/rZYGhFtQ9Nk/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>There are many situations where the logarithmic scale is used for data and if we have data on its original scale then we can use the <strong>log</strong> function in <strong>R</strong> to create a new variable. The default base for the <strong>log</strong> function is the natural logarithm. Referring back to the olive oil data set used in previous posts if we wanted to create a new variable that is the logarithm of one of the numeric variables then we could use code such as:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">log(olive.df$palmitic)</pre></div></div>

<p>which produces the following output:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">  [1] 6.980076 6.992096 6.814543 6.873164 6.957497 6.814543 6.826545 7.003065 6.986566 6.944087
 [11] 6.957497 6.943122 6.979145 6.774224 6.858565 7.051856 6.849066 7.153052 6.867974 6.858565
 [21] 6.979145 6.902743 6.962243 6.970730 6.970730 7.181592 7.186144 7.214504 7.228388 7.166266
...</pre></div></div>

<p>If we had wanted to to take the square root of this variable instead we would use the <strong>sqrt</strong> function:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">sqrt(olive.df$palmitic)</pre></div></div>

<p>and the output would be:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">  [1] 32.78719 32.98485 30.18278 31.08054 32.41913 30.18278 30.36445 33.16625 32.89377 32.20248
 [11] 32.41913 32.18695 32.77194 29.58040 30.85450 33.98529 30.70831 35.74913 31.00000 30.85450
 [21] 32.77194 31.54362 32.49615 32.63434 32.63434 36.26293 36.34556 36.86462 37.12142 35.98611
...</pre></div></div>

<p>It is possible to create more complicated expressions based on different transformations using the <strong>R</strong> programming language.</p>
<p>For example we might want to <em>centre</em> a vector of data by removing the mean value for the data and scale to unit variance. Based on the palmitic variable used in the previous examples we could use the <strong>scale</strong> function to perform this operation:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">&gt; scale(olive.df$palmitic)
              [,1]
  [1,] -0.92970611
  [2,] -0.85259700
  [3,] -1.90246724
...
[571,] -1.43388109
[572,] -1.61182519
attr(,&quot;scaled:center&quot;)
[1] 1231.741
attr(,&quot;scaled:scale&quot;)
[1] 168.5923</pre></div></div>

<p>The mean and standard deviations used for the scaling are saved as attributes by the function.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/transformations-to-create-new-variables/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Cross-tabulation of Data</title>
		<link>http://www.wekaleamstudios.co.uk/posts/cross-tabulation-of-data/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/cross-tabulation-of-data/#comments</comments>
		<pubDate>Fri, 15 May 2009 19:50:13 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Manipulation]]></category>
		<category><![CDATA[Data Summary]]></category>
		<category><![CDATA[Exploratory Data Analysis]]></category>
		<category><![CDATA[columns]]></category>
		<category><![CDATA[contingency table]]></category>
		<category><![CDATA[Cross tabulation]]></category>
		<category><![CDATA[crosstab]]></category>
		<category><![CDATA[data frame]]></category>
		<category><![CDATA[rows]]></category>
		<category><![CDATA[table]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=103</guid>
		<description><![CDATA[The contingency table is used to summarise data when there are factors in the data set and we are interested in counting the number of occurrences of each combination of factor variables. In R there are different ways that these types of table can be produced and manipulated as required. Fast Tube by Casper The [...]]]></description>
			<content:encoded><![CDATA[<p>The contingency table is used to summarise data when there are factors in the data set and we are interested in counting the number of occurrences of each combination of factor variables. In <strong>R</strong> there are different ways that these types of table can be produced and manipulated as required.<span id="more-103"></span></p>
<p><!--[Fast Tube]--><span id="fJR9-g2WyKw" style="display:block;"><a title="Click here to watch this video!" href="http://www.wekaleamstudios.co.uk/posts/cross-tabulation-of-data/#fJR9-g2WyKw"><img src="http://i.ytimg.com/vi/fJR9-g2WyKw/0.jpg" alt="Fast Tube" border="0" width="320" height="240" /></a><br /><small>Fast Tube by <a title="Casper's Blog" href="http://blog.caspie.net/">Casper</a></small></span><!--[/Fast Tube]--></p>
<p>The main two functions that are used to produce contingency tables are <strong>table</strong> and <strong>xtabs</strong>. We can use these two functions to get one, two or higher dimension tables that summarise the number of records that correspond to the combination of variables used to create the table.</p>
<p>The simplest case is where we are interested into a summary based on a single variable and the syntax is straightforward. The function <strong>table</strong> takes a single argument that corresponds to a vector of data. For example, if we are working with a data frame based on an unbalanced design and wanted to count the number of observations corresponding to each treatment we might run some code like:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">table(temp.design3$Treatment)</pre></div></div>

<p>which would produce a simple summary table:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">A B C D 
7 7 3 7</pre></div></div>

<p>If there was a second factor in the data set corresponding to different plots, labelled 1 to 4, then we could generate a two dimensional contingency table by adding a second argument to the function call like:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">table(temp.design3$Treatment, temp.design3$Plot)</pre></div></div>

<p>and the output would be of the form:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">    1 2 3 4
  A 2 2 2 1
  B 2 2 2 1
  C 1 1 1 0
  D 2 2 2 1</pre></div></div>

<p>The function <strong>xtabs</strong> can be used to create the same contingency tables but the function works using a formula in a similar vein to the modelling functions. So to get the one dimensional table we would write code similar to this:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">xtabs(~ Plot, data = temp.design3)</pre></div></div>

<p>which would summarise the data by plot:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">Plot
1 2 3 4 
7 7 7 3</pre></div></div>

<p>Note that the output is slightly different to using the <strong>table</strong> function. The two dimensional table would be created like this:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">xtabs(~ Treatment + Plot, data = temp.design3)</pre></div></div>

<p>and the output would be:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">         Plot
Treatment 1 2 3 4
        A 2 2 2 1
        B 2 2 2 1
        C 1 1 1 0
        D 2 2 2 1</pre></div></div>

<p>These functions can be extended to higher dimensions and the output is based on 2&#215;2 tables for each combination of the other variables.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/cross-tabulation-of-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Working with Subsets of Data</title>
		<link>http://www.wekaleamstudios.co.uk/posts/working-with-subsets-of-data/</link>
		<comments>http://www.wekaleamstudios.co.uk/posts/working-with-subsets-of-data/#comments</comments>
		<pubDate>Fri, 08 May 2009 18:30:45 +0000</pubDate>
		<dc:creator>Ralph</dc:creator>
				<category><![CDATA[Data Manipulation]]></category>
		<category><![CDATA[columns]]></category>
		<category><![CDATA[data.frame]]></category>
		<category><![CDATA[ggobi]]></category>
		<category><![CDATA[rows]]></category>
		<category><![CDATA[subset]]></category>

		<guid isPermaLink="false">http://www.wekaleamstudios.co.uk/?p=83</guid>
		<description><![CDATA[There are often situations where we might be interested in a subset of our complete data and there are simple mechanisms for viewing and editing particular subsets of a data frame or other objects in R. We might be interested in using one of the variables to select a particular subset. A square bracket notation [...]]]></description>
			<content:encoded><![CDATA[<p>There are often situations where we might be interested in a subset of our complete data and there are simple mechanisms for viewing and editing particular subsets of a data frame or other objects in <strong>R</strong>.<span id="more-83"></span></p>
<p>We might be interested in using one of the variables to select a particular subset. A square bracket notation is used after the name of an object to indicate that we are interested in specific rows or columns of the data and there are a large number of options that could be used. For example, if we consider the olive oil data set used in <strong>ggobi</strong> demonstrations, we could view the data for one of the regions using the following code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">olive.df[olive.df$Area == &quot;North-Apulia&quot;,]</pre></div></div>

<p>which would give the following output:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">   Region         Area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
1       1 North-Apulia     1075          75     226  7823      672        36        60         29
2       1 North-Apulia     1088          73     224  7709      781        31        61         29
3       1 North-Apulia      911          54     246  8113      549        31        63         29
...</pre></div></div>

<p>Within the square brackets in this example we test a condition which runs a vector of TRUE and FALSE values and <strong>R</strong> interprets our intention as viewing only those rows where the condition returned TRUE. The comma is used to separate between rows and columns in this case as we have two dimensions in the data frame. All columns are included as there is no expression after the comma in the example above.</p>
<p>It is possible to work with multiple conditions so for the olive oil data we could select one of the other regions, South Apulia, and only data points where stearic variable is greater than 250 units. The could used in this case would be:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">olive.df[olive.df$Area == &quot;South-Apulia&quot; &amp; olive.df$stearic &gt; 250,]</pre></div></div>

<p>to give output:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">    Region         Area palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
88       1 South-Apulia     1410         232     280  6715     1233        32        60         24
89       1 South-Apulia     1509         209     257  6647     1240        42        62         30
90       1 South-Apulia     1317         197     256  7036     1067        40        60         22
...</pre></div></div>

<p>If we were interested in a particular column of data then we would specify the name of the column(s) after the comma in the square brackets. For example we could view the palmitic column only with this code:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">olive.df[,&quot;palmitic&quot;]</pre></div></div>

<p>Multiple columns could be selected by providing a vector of column names, such as:</p>

<div class="wp_syntax"><div class="code"><pre class="text" style="font-family:monospace;">olive.df[,c(&quot;palmitic&quot;,&quot;oleic&quot;)]</pre></div></div>

<p>More complicated expressions are possible with a bit of imagination. For example if we wanted to view the even numbered rows only then we could use the condition seq(2,10,2) which would provide the numbers 2, 4, 6, 8 and 10.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.wekaleamstudios.co.uk/posts/working-with-subsets-of-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

