library(tidyverse)
library(data.table)
R data.table Example
Loading data.table
We are going to do some comparisons with tidyverse
so go ahead and load that too.
Air Traffic Data
nycflights13
is a package which stores a bunch of datasets related to flights in and out of NYC airports in 2013.
If you didn’t install this last time, run
::install("nycflights13") renv
Then load it into our session.
library(nycflights13)
So we’ve got data about the airlines, about airports, number of flights, planes, and weather.
Data Work
Both weather and flights are fairly large datasets.
dim(weather)
[1] 26115 15
dim(flights)
[1] 336776 19
They can be merged by using “time_hour” and “origin” columns.
Let’s merge these two datasets using tidyverse
, data.table
, and data.table
with keys. We’ll measure how long each takes.
tidyverse full_join
<- Sys.time() ## Get current time from computer
start
<- flights |>
tb left_join(weather, by = join_by(origin, time_hour))
<- Sys.time()
end
difftime(end, start, units = "secs")
Time difference of 0.05079699 secs
data.table merge
<- as.data.table(weather)
weather_dt <- as.data.table(flights)
flights_dt
<- Sys.time()
start
<- flights_dt |>
dt merge(weather_dt, by = c("origin", "time_hour"), all.x = TRUE)
<- Sys.time()
end
difftime(end, start, units = "secs")
Time difference of 0.157572 secs
data.table merge with keys
setkey(weather_dt, "origin", "time_hour")
setkey(flights_dt, "origin", "time_hour")
<- Sys.time()
start
<- flights_dt |>
dt2 merge(weather_dt, by = c("time_hour", "origin"), all.x = T)
<- Sys.time()
end
difftime(end, start, units = "secs")
Time difference of 0.1394958 secs
In this example data.table
is actually slower than the tidyverse
! That probably wouldn’t remain true if the datasets were 10x larger. You can see the speed up for data.table
once you set a key before merging.
Comparing Summarizing speeds
tidyverse summarize
<- Sys.time()
start
<- flights |>
tb group_by(origin, dest, flight) |>
drop_na(dep_delay, arr_delay) |>
summarize(
avg_dep_delay = mean(dep_delay),
avg_arr_delay = mean(arr_delay)
)
`summarise()` has grouped output by 'origin', 'dest'. You can override using
the `.groups` argument.
<- Sys.time()
end
difftime(end, start, units = "secs")
Time difference of 0.2266259 secs
data.table summarize
<- as.data.table(flights)
flights_dt
<- Sys.time()
start
<- flights_dt[
dt !is.na(dep_delay) & !is.na(arr_delay),
avg_dep_delay = mean(dep_delay), avg_arr_delay = mean(arr_delay)),
.(= .(origin, dest, flight)
by
]
<- Sys.time()
end
difftime(end, start, units = "secs")
Time difference of 0.02212501 secs
data.table summarize with keys
setkey(flights_dt, "origin", "dest", "flight")
<- Sys.time()
start
<- flights_dt[
dt2 !is.na(dep_delay) & !is.na(arr_delay),
avg_dep_delay = mean(dep_delay), avg_arr_delay = mean(arr_delay)),
.(= .(origin, dest, flight)
by
]
<- Sys.time()
end
difftime(end, start, units = "secs")
Time difference of 0.1122601 secs
Now we see that data.table
was 10x faster than the tidyverse
. Setting a key doesn’t speed things up by much, however.