Code performance in R: Working with large datasets

2021-08-05by Mira Céline Klein

This is the fourth part of our series about code performance in R. In the first part, I introduced methods to measure which part of a given code is slow. The second part lists general techniques to make R code faster. The third part deals with parallelization. In this part we are going to have a look at the challenges that come with large datasets.

Whether your dataset is "large" not only depends on the number of rows, but also on the method you are going to use. It's easy to compute the mean or sum of as many as 10,000 numbers, but a nonlinear regression with many variables can already take some time with a sample size of 1,000.

Sometimes it may help to parallelize (see part 3 of the series). But with large datasets, you can use parallelization only up to the point where working memory becomes the limiting factor. In addition, there may be tasks that cannot be parallelized at all. In these cases, the strategies from part 2 of this series may be helpful, and there are some more ways:

Sampling and data splitting

Some computations will not only become very slow, but even impossible for large datasets, for example, due to working memory. But the good news is that it's often totally sufficient to work on a sample - for instance, to compute summary statistics or estimate a regression model. At least during code development, this can be very useful. Another option is to divide your data into multiple parts, do your computations on each part separately, and recombine them (e.g., by averaging regression coefficients). Sometimes you can even execute those computations in parallel, even if working memory was not sufficient to do it on the whole dataset. This sounds counterintuitive, but the reason is the following: Many methods (e.g., regression analysis) work with matrices. They often grow quadratically with the number of observations, and so do the computational costs and the required working memory. Therefore, doing it with half the sample size requires only more or less a quarter of the resources, not half.

Free working memory!

If you run into working memory problems, it helps to check if there are large objects in your workspace that you don't need anymore. Just remove them with rm followed by a so-called "garbage collection" to return the memory to the operating system (gc)

Garbage collections take place automatically on a regular basis, but this ensures that it happens right away.

Base R vs. dplyr vs. data.table

Especially for data handling, dplyr is much more elegant than base R, and often faster. But there is an even faster alternative: the data.table package. The difference is already visible for very small operations such as selecting columns or computing the mean for subgroups:

library(data.table)
library(dplyr)

cols <- c("Sepal.Length", "Sepal.Width")
irisDt <- as.data.table(iris)

# Select columns
microbenchmark("dplyr" = iris %>% select(cols),
               "data.table" = irisDt[, cols, with = FALSE])

## Unit: microseconds
## expr            min        lq      mean    median        uq      max neval
## dplyr      1890.744 2099.4445 2770.3403 2401.4760 3132.7005 9259.750   100
## data.table   62.763   76.5215  179.3211  110.4575  147.2455 5923.169   100

# Compute grouped mean
microbenchmark("dplyr" = iris %>%
                 group_by(Species) %>%
                 summarise(mean(Sepal.Length)),
               "data.table" = irisDt[,.(meanSL = mean(Sepal.Length)),
                                     by = Species])

## Unit: microseconds
## expr            min       lq      mean   median        uq       max neval
## dplyr      3758.252 4686.548 5769.8606 5533.120 6430.0995 14503.304   100
## data.table  415.039  512.455  665.5811  613.622  718.2905  1646.667   100

The differences for more time-consuming operations are equally impressive.

Use a database

Instead of loading your complete data into R before each analysis, you can store it in a database, e.g., an SQL database. This has several advantages:

When you retrieve data from the database, you can specify which rows and columns you need in your database query. You don't need to load the whole dataset into the working memory.
You can even do some data handling steps in the query (e.g., sorting, grouped computations).
You can also create and store preprocessed datasets (e.g., aggregated or combined datasets) in the database and access them from R. Databases are in general quite good and fast for these kinds of computations.

Further parts of the article series:

2021-06-30 by Mira Céline Klein

Code performance in R: Parallelization

2021-05-05 by Mira Céline Klein

Code performance in R: How to make code faster

2021-04-26 by Mira Céline Klein