R data.table Example

Author

Matthew DeHaven

Published

March 31, 2024

Loading data.table

We are going to do some comparisons with tidyverse so go ahead and load that too.

library(tidyverse)
library(data.table)

Air Traffic Data

nycflights13 is a package which stores a bunch of datasets related to flights in and out of NYC airports in 2013.

If you didn’t install this last time, run

renv::install("nycflights13")

Then load it into our session.

library(nycflights13)

So we’ve got data about the airlines, about airports, number of flights, planes, and weather.

Data Work

Both weather and flights are fairly large datasets.

dim(weather)
[1] 26115    15
dim(flights)
[1] 336776     19

They can be merged by using “time_hour” and “origin” columns.

Let’s merge these two datasets using tidyverse, data.table, and data.table with keys. We’ll measure how long each takes.

tidyverse full_join

start <- Sys.time() ## Get current time from computer

tb <- flights |>
  left_join(weather, by = join_by(origin, time_hour))

end <- Sys.time()

difftime(end, start, units = "secs")
Time difference of 0.05079699 secs

data.table merge

weather_dt <- as.data.table(weather)
flights_dt <- as.data.table(flights)

start <- Sys.time()

dt <- flights_dt |>
  merge(weather_dt, by = c("origin", "time_hour"), all.x = TRUE)

end <- Sys.time()

difftime(end, start, units = "secs")
Time difference of 0.157572 secs

data.table merge with keys

setkey(weather_dt, "origin", "time_hour")
setkey(flights_dt, "origin", "time_hour")

start <- Sys.time()

dt2 <- flights_dt |>
  merge(weather_dt, by = c("time_hour", "origin"), all.x = T)

end <- Sys.time()

difftime(end, start, units = "secs")
Time difference of 0.1394958 secs

In this example data.table is actually slower than the tidyverse! That probably wouldn’t remain true if the datasets were 10x larger. You can see the speed up for data.table once you set a key before merging.

Comparing Summarizing speeds

tidyverse summarize

start <- Sys.time() 

tb <- flights |>
  group_by(origin, dest, flight) |>
  drop_na(dep_delay, arr_delay) |>
  summarize(
    avg_dep_delay = mean(dep_delay),
    avg_arr_delay = mean(arr_delay)
  )
`summarise()` has grouped output by 'origin', 'dest'. You can override using
the `.groups` argument.
end <- Sys.time()

difftime(end, start, units = "secs")
Time difference of 0.2266259 secs

data.table summarize

flights_dt <- as.data.table(flights)

start <- Sys.time()

dt <- flights_dt[
  !is.na(dep_delay) & !is.na(arr_delay),
  .(avg_dep_delay = mean(dep_delay), avg_arr_delay = mean(arr_delay)),
  by = .(origin, dest, flight)
  ]

end <- Sys.time()

difftime(end, start, units = "secs")
Time difference of 0.02212501 secs

data.table summarize with keys

setkey(flights_dt, "origin", "dest", "flight")

start <- Sys.time()

dt2 <- flights_dt[
  !is.na(dep_delay) & !is.na(arr_delay),
  .(avg_dep_delay = mean(dep_delay), avg_arr_delay = mean(arr_delay)),
  by = .(origin, dest, flight)
  ]

end <- Sys.time()

difftime(end, start, units = "secs")
Time difference of 0.1122601 secs

Now we see that data.table was 10x faster than the tidyverse. Setting a key doesn’t speed things up by much, however.