When writing **R** code it is useful to be able to assess the amount of time that a particular function takes to run. We might be interested in measuring the increase in time required by our function as the size of the data increases.

To illustrate using the **system.time** function to calculate the time taken to run an expression consider a set of football results where we are using a logistic regression model to determine the factors that change the probability of a home win. If we fit a logistic regression model using the **glm** function to our data set with variables for the home and away team we can embed the function call inside the **system.time** function.

If the data is stored in the data frame called **results.df** the function call to fit the logistic regression model would be something of this form:

glm(HomeWin ~ Home + Away, data = results.df) |

The function call would be:

> system.time(glm(HomeWin ~ Home + Away, data = results.df)) user system elapsed 1.62 0.08 1.72 |

The output is measured in seconds and is based on a set of data with 1,000 match results. We could extend the data set to 2,000 match results to see how the time to fit the model increases. If the new data set is stored in the data frame **results.df2** then the function call would be:

> system.time(glm(HomeWin ~ Home + Away, data = results.df2)) user system elapsed 4.37 0.14 4.55 |

The time to run the function is increase by a factor of 2.7 (approx.) based on these two runs. This use of **system.time** provides some elementary information about the time taken for the expression to be evaluated.

if you really want to profile your code, doing system.time just once is a really bad idea, especially for tasks that only take a couple of seconds, because the variability is really large – wrap it in a replicate statement to get more precision, because a single measurement isn’t reliable:

> n x1 x2 y replicate(10, system.time(glm(y~ x1 + x2, family = binomial())))

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

user.self 1.67 1.22 1.30 1.22 1.23 1.48 1.44 1.35 1.32 1.5

sys.self 0.01 0.02 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.0

elapsed 1.74 1.22 1.34 1.21 1.34 1.50 1.48 1.36 1.38 1.5

Hi Ralph,

BTW, another great function is rbenchmark, see:

http://code.google.com/p/rbenchmark/

Thanks for your comment Fabians,

Naturally running the analysis a single time will not provide information about the range of run times. Subsequent posts will look at alternative ways for profiling code and address the variability in runtime with increasing number of rows and/or columns in a data set.

When I run the analysis on ten separate occasions the timings observed were 1.74, 1.63, 1.62, 1.63, 1.66, 1.59, 1.64, 1.60, 1.60 and 1.64 highlighting your point about assessing variability.

Ralph

[…] Measuring the length of time to run a function « Software for Exploratory Data Analysis and St… […]