Code performance in R: Which part of the code is slow?

2021-04-26by Mira Céline Klein

This is the first part of our series about code performance in R.

Let's assume you have written some code, it's working, it computes the results you need, but it is really slow. If you don't want to get slowed down in your work, you have no other choice than improving the code's performance. But how to start? The best approach is to find out where to start optimizing.

It is not always obvious which part of the code makes it so slow, or which of multiple alternatives is fastest. There is the risk to spending a lot of time optimizing the wrong part of the code. Fortunately, there are ways to systematically test how long a computation takes. An easy way is the function system.time. Just wrap your code into this function, and you will (in addition to the actual results of that code) get the time your code took to run.

Let's draw a lot of random numbers as an example:

system.time(runif(n = 1000000))

##      user      system       elapsed 
##      0.024     0.003        0.027

The value for user shows the "time charged for the execution of user instructions of the calling process" (see ?proc.time).

system.time is only useful for computations that take a bit longer (at least some seconds). If a single execution takes only microseconds, the result of system.time will vary widely due to tiny tasks your system happens to be doing in the background. It would then be better to run the code several times and look at the mean or median of the computation times. Luckily, the microbenchmark package already provides a tool for this. In addition, you can compare different pieces of code. microbenchmark runs an expression several times (100 times per default) and computes some summary statistics for the computation times. Let's try it out in order to compare two different ways of subsetting:

library(microbenchmark)

microbenchmark(
   "dollarSign" = iris$Sepal.Length[10],
   "brackets" = iris[10, "Sepal.Length"]
)

## Unit: nanoseconds
## expr       min    lq     mean median     uq   max neval
## dollarSign 584   651  1036.28  906.5 1042.5 15651   100
## brackets   8391 9054 11322.02 9355.0 9885.0 49936   100

Although both versions lead to the same result, the bracket version clearly took longer on average. But a look at min and max reveals that a single try could have led to a different result! Optimizing a whole script by spending hours with adding system.time or microbenchmark to every single function may sound a bit ineffective to you. A better way would be to use the profiler from the profvis package in RStudio. Just wrap your code into profvis() and run it. The profiler interface will open and give you a very clear overview of which part of the code takes how much time. In the following example, we simulate some data, create two plots, and compute a statistical model twice using two different functions.

library(profvis)

profvis({
  library(dplyr)
  library(ggplot2)
  
  # Simulate data
  n <- 5000000
  dat <- data.frame(norm = rnorm(n),
                    unif = runif(n),
                    poisson = rpois(n, lambda = 5))
  
  # Compute more variables
  dat <- dat %>%
    mutate(var1 = norm + unif,
           var2 = poisson - unif + min(poisson - unif),
           var3 = 3 * unif - 0.5 * norm)
  
  # Plots
  ggplot(dat, aes(x = var1, y = var3)) +
    geom_point() +
    geom_smooth(method = lm)
  ggplot(dat, aes(var1)) +
    geom_histogram() +
    geom_vline(xintercept = 0, color = "red")
  
  # Models
  modLm <- lm(var1 ~ var2 + var3, data = dat)
  summary(modLm)
  modGlm <- glm(var1 ~ var2 + var3, data = dat,
                family = gaussian(link = "identity"))
  summary(modGlm)
})

Note: The computation times always fluctuate due to random influences. This means that the results of system.time, microbenchmark, and the profiler will vary a bit with each execution.

Further parts of the article series:

Blog

2021-08-05by Mira Céline Klein

Code performance in R: Working with large datasets

Blog

2021-06-30by Mira Céline Klein

Code performance in R: Parallelization

Blog

2021-05-05by Mira Céline Klein