Measuring the length of time to run a function

March 16th, 2010

When writing R code it is useful to be able to assess the amount of time that a particular function takes to run. We might be interested in measuring the increase in time required by our function as the size of the data increases.

To illustrate using the system.time function to calculate the time taken to run an expression consider a set of football results where we are using a logistic regression model to determine the factors that change the probability of a home win. If we fit a logistic regression model using the glm function to our data set with variables for the home and away team we can embed the function call inside the system.time function.

If the data is stored in the data frame called results.df the function call to fit the logistic regression model would be something of this form:

glm(HomeWin ~ Home + Away, data = results.df)

The function call would be:

> system.time(glm(HomeWin ~ Home + Away, data = results.df))
   user  system elapsed 
   1.62    0.08    1.72

The output is measured in seconds and is based on a set of data with 1,000 match results. We could extend the data set to 2,000 match results to see how the time to fit the model increases. If the new data set is stored in the data frame results.df2 then the function call would be:

> system.time(glm(HomeWin ~ Home + Away, data = results.df2))
   user  system elapsed 
   4.37    0.14    4.55

The time to run the function is increase by a factor of 2.7 (approx.) based on these two runs. This use of system.time provides some elementary information about the time taken for the expression to be evaluated.

Posted by Ralph at 11:50 pm 4 Comments »

4 responses to “Measuring the length of time to run a function”

fabians says:

March 17, 2010 at 9:21 am

if you really want to profile your code, doing system.time just once is a really bad idea, especially for tasks that only take a couple of seconds, because the variability is really large – wrap it in a replicate statement to get more precision, because a single measurement isn’t reliable:

> n x1 x2 y replicate(10, system.time(glm(y~ x1 + x2, family = binomial())))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
user.self 1.67 1.22 1.30 1.22 1.23 1.48 1.44 1.35 1.32 1.5
sys.self 0.01 0.02 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.0
elapsed 1.74 1.22 1.34 1.21 1.34 1.50 1.48 1.36 1.38 1.5
Tal Galili says:

March 17, 2010 at 10:15 am

Hi Ralph,

BTW, another great function is rbenchmark, see:
http://code.google.com/p/rbenchmark/
Ralph says:

March 17, 2010 at 8:32 pm

Thanks for your comment Fabians,

Naturally running the analysis a single time will not provide information about the range of run times. Subsequent posts will look at alternative ways for profiling code and address the variability in runtime with increasing number of rows and/or columns in a data set.

When I run the analysis on ten separate occasions the timings observed were 1.74, 1.63, 1.62, 1.63, 1.66, 1.59, 1.64, 1.60, 1.60 and 1.64 highlighting your point about assessing variability.

Ralph
My Data Mining Weblog » Readings on Data Mining for Big Data says:

March 18, 2010 at 3:09 am

[…] Measuring the length of time to run a function « Software for Exploratory Data Analysis and St… […]