library(tidyverse)
library(data.table)R data.table Example
Loading data.table
We are going to do some comparisons with tidyverse so go ahead and load that too.
Air Traffic Data
nycflights13 is a package which stores a bunch of datasets related to flights in and out of NYC airports in 2013.
If you didn’t install this last time, run
renv::install("nycflights13")Then load it into our session.
library(nycflights13)So we’ve got data about the airlines, about airports, number of flights, planes, and weather.
Data Work
Both weather and flights are fairly large datasets.
dim(weather)[1] 26115 15
dim(flights)[1] 336776 19
They can be merged by using “time_hour” and “origin” columns.
Let’s merge these two datasets using tidyverse, data.table, and data.table with keys. We’ll measure how long each takes.
tidyverse full_join
start <- Sys.time() ## Get current time from computer
tb <- flights |>
left_join(weather, by = join_by(origin, time_hour))
end <- Sys.time()
difftime(end, start, units = "secs")Time difference of 0.05079699 secs
data.table merge
weather_dt <- as.data.table(weather)
flights_dt <- as.data.table(flights)
start <- Sys.time()
dt <- flights_dt |>
merge(weather_dt, by = c("origin", "time_hour"), all.x = TRUE)
end <- Sys.time()
difftime(end, start, units = "secs")Time difference of 0.157572 secs
data.table merge with keys
setkey(weather_dt, "origin", "time_hour")
setkey(flights_dt, "origin", "time_hour")
start <- Sys.time()
dt2 <- flights_dt |>
merge(weather_dt, by = c("time_hour", "origin"), all.x = T)
end <- Sys.time()
difftime(end, start, units = "secs")Time difference of 0.1394958 secs
In this example data.table is actually slower than the tidyverse! That probably wouldn’t remain true if the datasets were 10x larger. You can see the speed up for data.table once you set a key before merging.
Comparing Summarizing speeds
tidyverse summarize
start <- Sys.time()
tb <- flights |>
group_by(origin, dest, flight) |>
drop_na(dep_delay, arr_delay) |>
summarize(
avg_dep_delay = mean(dep_delay),
avg_arr_delay = mean(arr_delay)
)`summarise()` has grouped output by 'origin', 'dest'. You can override using
the `.groups` argument.
end <- Sys.time()
difftime(end, start, units = "secs")Time difference of 0.2266259 secs
data.table summarize
flights_dt <- as.data.table(flights)
start <- Sys.time()
dt <- flights_dt[
!is.na(dep_delay) & !is.na(arr_delay),
.(avg_dep_delay = mean(dep_delay), avg_arr_delay = mean(arr_delay)),
by = .(origin, dest, flight)
]
end <- Sys.time()
difftime(end, start, units = "secs")Time difference of 0.02212501 secs
data.table summarize with keys
setkey(flights_dt, "origin", "dest", "flight")
start <- Sys.time()
dt2 <- flights_dt[
!is.na(dep_delay) & !is.na(arr_delay),
.(avg_dep_delay = mean(dep_delay), avg_arr_delay = mean(arr_delay)),
by = .(origin, dest, flight)
]
end <- Sys.time()
difftime(end, start, units = "secs")Time difference of 0.1122601 secs
Now we see that data.table was 10x faster than the tidyverse. Setting a key doesn’t speed things up by much, however.