R tidyverse Example

Author

Matthew DeHaven

Published

March 31, 2024

Loading the tidyverse

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Reading in data

One of the benefits of using R is the ease of reading in data.

Some commonly used packages for different formats,

  • readr::read_csv() for csvs

  • data.table::fread() for csvs as well

  • readxl::read_excel() for Excel xlsx and xls files

  • haven::read_dta() for Stata files

  • haven::read_sas() for SAS files

and many more.

airports <- readr::read_csv("https://github.com/tidyverse/nycflights13/raw/main/data-raw/airports.csv")
Rows: 1458 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): faa, name, dst, tzone
dbl (4): lat, lon, alt, tz

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
airports
# A tibble: 1,458 × 8
   faa   name                             lat    lon   alt    tz dst   tzone    
   <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
 1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
 2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
 3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
 4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
 5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
 6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
 7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
 8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
 9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
# ℹ 1,448 more rows

This is just an example to show how you can use readr::read_csv() pointed at a csv file.

Note that you can pass the url to a csv and the function will automatically download and then parse it. Very convenient!

But if your data is within your project you will probably have something that looks more like:

data <- readr::read_csv("./data/filename.csv")

Air Traffic Data

nycflights13 is a package which stores a bunch of datasets related to flights in and out of NYC airports in 2013. We will use it to practice our tidyverse functions.

First, we must install the data package.

renv::install("nycflights13")

Then load it into our session.

library(nycflights13)

Attaching package: 'nycflights13'
The following object is masked _by_ '.GlobalEnv':

    airports

Note how the package overwrote our airports dataset we read in earlier.

That’s okay becuase they are the same dataset, but you should be careful reading in new packages!

## You can just call data(package = "nycflights13")
## I'm just making it print nicely for the document 
data(package = "nycflights13")$results |> as_tibble() |> select(Package, Item, Title)
# A tibble: 5 × 3
  Package      Item     Title              
  <chr>        <chr>    <chr>              
1 nycflights13 airlines Airline names.     
2 nycflights13 airports Airport metadata   
3 nycflights13 flights  Flights data       
4 nycflights13 planes   Plane metadata.    
5 nycflights13 weather  Hourly weather data

So we’ve got data about the airlines, about airports, number of flights, planes, and weather.

Data exploration

head(airlines)
# A tibble: 6 × 2
  carrier name                    
  <chr>   <chr>                   
1 9E      Endeavor Air Inc.       
2 AA      American Airlines Inc.  
3 AS      Alaska Airlines Inc.    
4 B6      JetBlue Airways         
5 DL      Delta Air Lines Inc.    
6 EV      ExpressJet Airlines Inc.
head(airports)
# A tibble: 6 × 8
  faa   name                             lat   lon   alt    tz dst   tzone      
  <chr> <chr>                          <dbl> <dbl> <dbl> <dbl> <chr> <chr>      
1 04G   Lansdowne Airport               41.1 -80.6  1044    -5 A     America/Ne…
2 06A   Moton Field Municipal Airport   32.5 -85.7   264    -6 A     America/Ch…
3 06C   Schaumburg Regional             42.0 -88.1   801    -6 A     America/Ch…
4 06N   Randall Airport                 41.4 -74.4   523    -5 A     America/Ne…
5 09J   Jekyll Island Airport           31.1 -81.4    11    -5 A     America/Ne…
6 0A9   Elizabethton Municipal Airport  36.4 -82.2  1593    -5 A     America/Ne…
head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
head(planes)
# A tibble: 6 × 9
  tailnum  year type               manufacturer model engines seats speed engine
  <chr>   <int> <chr>              <chr>        <chr>   <int> <int> <int> <chr> 
1 N10156   2004 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
2 N102UW   1998 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
3 N103US   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
4 N104UW   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
5 N10575   2002 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
6 N105UW   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
head(weather)
# A tibble: 6 × 15
  origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
  <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4         NA
2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06        NA
3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5         NA
4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7         NA
5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7         NA
6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5         NA
# ℹ 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
#   time_hour <dttm>

One other nice function for checking out datasets is glimpse().

glimpse(flights)
Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

This maximizes the space in your current consol for showing data, but breaks the alignment of table values.

Merging two datasets

We saw in the slides that we can use our family of join functions.

plane_flights <- flights |>
  left_join(planes)
Joining with `by = join_by(year, tailnum)`

Note that we didn’t specify a “by=” variable(s). The _join functions will be smart and look for variables with the same name across the two datasets, in this case, “year” and “tailnum”.

But that’s incorrect here! “year” in the flights dataset is the year of the flight (always 2013 in this data). But in the planes data it is the year of the plane’s construction.

Instead, we should be explicit and just merge on “tailnum”.

plane_flights <- flights |>
  left_join(planes, by = join_by(tailnum))
plane_flights
# A tibble: 336,776 × 27
   year.x month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1   2013     1     1      517            515         2      830            819
 2   2013     1     1      533            529         4      850            830
 3   2013     1     1      542            540         2      923            850
 4   2013     1     1      544            545        -1     1004           1022
 5   2013     1     1      554            600        -6      812            837
 6   2013     1     1      554            558        -4      740            728
 7   2013     1     1      555            600        -5      913            854
 8   2013     1     1      557            600        -3      709            723
 9   2013     1     1      557            600        -3      838            846
10   2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 19 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, year.y <int>, type <chr>,
#   manufacturer <chr>, model <chr>, engines <int>, seats <int>, speed <int>,
#   engine <chr>

This is close to what we would like, but now note how we have two “year” variales: “year.x” and “year.y”

We could rename the column in each using rename() before merging, or we can use the “suffix =” argument in left_join to attach more meaningful suffixes than the default “.x” and “.y”.

plane_flights <- flights |>
  left_join(planes, by = join_by(tailnum), suffix = c("_flight", "_plane"))
plane_flights
# A tibble: 336,776 × 27
   year_flight month   day dep_time sched_dep_time dep_delay arr_time
         <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1        2013     1     1      517            515         2      830
 2        2013     1     1      533            529         4      850
 3        2013     1     1      542            540         2      923
 4        2013     1     1      544            545        -1     1004
 5        2013     1     1      554            600        -6      812
 6        2013     1     1      554            558        -4      740
 7        2013     1     1      555            600        -5      913
 8        2013     1     1      557            600        -3      709
 9        2013     1     1      557            600        -3      838
10        2013     1     1      558            600        -2      753
# ℹ 336,766 more rows
# ℹ 20 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
#   year_plane <int>, type <chr>, manufacturer <chr>, model <chr>,
#   engines <int>, seats <int>, speed <int>, engine <chr>

Departure delays

What’s the average departure and arrival delay across all of these flights?

plane_flights |>
  summarise(
    avg_dep_delay = mean(dep_delay),
    avg_arr_delay = mean(arr_delay)
  )
# A tibble: 1 × 2
  avg_dep_delay avg_arr_delay
          <dbl>         <dbl>
1            NA            NA
Important

Why did we get NA?

Remember, NAs are infectious, every calculation they touch becomes NA.

Let’s see what the NA values are,

plane_flights |>
  filter(is.na(arr_delay))
# A tibble: 9,430 × 27
   year_flight month   day dep_time sched_dep_time dep_delay arr_time
         <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1        2013     1     1     1525           1530        -5     1934
 2        2013     1     1     1528           1459        29     2002
 3        2013     1     1     1740           1745        -5     2158
 4        2013     1     1     1807           1738        29     2251
 5        2013     1     1     1939           1840        59       29
 6        2013     1     1     1952           1930        22     2358
 7        2013     1     1     2016           1930        46       NA
 8        2013     1     1       NA           1630        NA       NA
 9        2013     1     1       NA           1935        NA       NA
10        2013     1     1       NA           1500        NA       NA
# ℹ 9,420 more rows
# ℹ 20 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
#   year_plane <int>, type <chr>, manufacturer <chr>, model <chr>,
#   engines <int>, seats <int>, speed <int>, engine <chr>

Looks like some of these flights were canceled (never departed), but some departed…and never landed? seems bad.

For now, we will just drop NA values in these two columns.

plane_flights |>
  drop_na(dep_delay, arr_delay) |>
  summarise(
    avg_dep_delay = mean(dep_delay),
    avg_arr_delay = mean(arr_delay)
  )
# A tibble: 1 × 2
  avg_dep_delay avg_arr_delay
          <dbl>         <dbl>
1          12.6          6.90

Seems like there is a bit more of a departure delay than arrival.

Let’s get more information about the distribution of departure delays.

plane_flights |>
  drop_na(dep_delay) |>
  summarise(
    min = min(dep_delay),
    pctl_25 = quantile(dep_delay, 0.25),
    mean = mean(dep_delay),
    median = median(dep_delay),
    pctl_75 =  quantile(dep_delay, 0.75),
    max = max(dep_delay),
    n = n()
  )
# A tibble: 1 × 7
    min pctl_25  mean median pctl_75   max      n
  <dbl>   <dbl> <dbl>  <dbl>   <dbl> <dbl>  <int>
1   -43      -5  12.6     -2      11  1301 328521

Quite a long tail to the right!

What’s causing the delays?

Maybe older planes have more delays?

plane_flights |>
  mutate(age_5years = (2013 - year_plane) %% 5) |>
  group_by(age_5years) |>
  drop_na(dep_delay) |>
  summarise(
    min = min(dep_delay),
    pctl_25 = quantile(dep_delay, 0.25),
    mean = mean(dep_delay),
    median = median(dep_delay),
    pctl_75 =  quantile(dep_delay, 0.75),
    max = max(dep_delay),
    n = n()
  )
# A tibble: 6 × 8
  age_5years   min pctl_25  mean median pctl_75   max     n
       <dbl> <dbl>   <dbl> <dbl>  <dbl>   <dbl> <dbl> <int>
1          0   -24      -5 13.3      -1      12   911 61539
2          1   -32      -5 12.9      -2      11  1014 62760
3          2   -27      -5 13.4      -1      12  1301 55477
4          3   -33      -5 12.6      -1      11   960 47844
5          4   -43      -5 13.6      -1      12   825 47176
6         NA   -25      -6  9.88     -3       7  1137 53725

Doesn’t really look like it!

Maybe it’s weather

Let’s merge the weather data onto our flights instead. We’ll only select a few columns to simplify things. It looks like we can merge on the “time_hour” variable.

weather_flights <- left_join(
  flights |> select(dep_delay, time_hour, flight, tailnum, origin, dest),
  weather |> select(time_hour, temp, wind_speed, precip, visib),
  by = join_by(time_hour)
)
Warning in left_join(select(flights, dep_delay, time_hour, flight, tailnum, : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 5 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

We got a warning! We weren’t explicit about what merge relatinship we expected, and it found a ‘many-to-many’ in the data merge id.

weather_flights <- left_join(
  flights |> select(dep_delay, time_hour, flight, tailnum, origin, dest),
  weather |> select(time_hour, temp, wind_speed, precip, visib),
  by = join_by(time_hour),
  relationship = "many-to-one"
)
Error in `left_join()`:
! Each row in `x` must match at most 1 row in `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.

Now it gave me an actual error! It detected a many-to-many merge, but I told it was only a many-to-one.

What’s the problem?

weather |> filter(time_hour == flights$time_hour[[1]])
# A tibble: 3 × 15
  origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
  <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
1 EWR     2013     1     1     5  39.0  28.0  64.4      260       12.7      NA  
2 JFK     2013     1     1     5  39.0  27.0  61.6      260       15.0      NA  
3 LGA     2013     1     1     5  39.9  25.0  54.8      250       15.0      21.9
# ℹ 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
#   time_hour <dttm>

I forgot that there were multiple airport locations…

weather_flights <- left_join(
  flights |> select(dep_delay, origin, time_hour, flight, tailnum, origin, dest),
  weather |> select(origin, time_hour, temp, wind_speed, precip, visib),
  by = join_by(origin, time_hour),
  relationship = "many-to-one"
)
weather_flights
# A tibble: 336,776 × 10
   dep_delay origin time_hour           flight tailnum dest   temp wind_speed
       <dbl> <chr>  <dttm>               <int> <chr>   <chr> <dbl>      <dbl>
 1         2 EWR    2013-01-01 05:00:00   1545 N14228  IAH    39.0       12.7
 2         4 LGA    2013-01-01 05:00:00   1714 N24211  IAH    39.9       15.0
 3         2 JFK    2013-01-01 05:00:00   1141 N619AA  MIA    39.0       15.0
 4        -1 JFK    2013-01-01 05:00:00    725 N804JB  BQN    39.0       15.0
 5        -6 LGA    2013-01-01 06:00:00    461 N668DN  ATL    39.9       16.1
 6        -4 EWR    2013-01-01 05:00:00   1696 N39463  ORD    39.0       12.7
 7        -5 EWR    2013-01-01 06:00:00    507 N516JB  FLL    37.9       11.5
 8        -3 LGA    2013-01-01 06:00:00   5708 N829AS  IAD    39.9       16.1
 9        -3 JFK    2013-01-01 06:00:00     79 N593JB  MCO    37.9       13.8
10        -2 LGA    2013-01-01 06:00:00    301 N3ALAA  ORD    39.9       16.1
# ℹ 336,766 more rows
# ℹ 2 more variables: precip <dbl>, visib <dbl>

Now it works!

Let’s see if temperature cuases delays.

weather_flights |>
  mutate(temp_buckets  = floor(temp / 10) * 10 ) |>
  drop_na(dep_delay) |>
  group_by(temp_buckets) |>
  summarise(
    min = min(dep_delay),
    pctl_25 = quantile(dep_delay, 0.25),
    mean = mean(dep_delay),
    median = median(dep_delay),
    pctl_75 =  quantile(dep_delay, 0.75),
    max = max(dep_delay),
    n = n()
  )
# A tibble: 11 × 8
   temp_buckets   min pctl_25  mean median pctl_75   max     n
          <dbl> <dbl>   <dbl> <dbl>  <dbl>   <dbl> <dbl> <int>
 1           10   -18      -5 13.5      -1    11     478  2895
 2           20   -33      -5 11.0      -1    10     853 14108
 3           30   -43      -5 11.3      -1    10     911 55943
 4           40   -30      -5 10.3      -2     8    1301 49990
 5           50   -32      -5  9.97     -2     7     896 50970
 6           60   -23      -5 10.8      -2     8    1137 61162
 7           70   -26      -5 16.3      -1    15     899 55519
 8           80   -23      -4 20.0       0    21    1005 31170
 9           90   -16      -4 18.7       1    21     404  5172
10          100    -5       1 26.3      15    44.5   138    47
11           NA   -18      -4 13.5      -1    14     264  1545

More delays when it’s hot? That’s a surprise.

weather_flights |>
  mutate(wind_speed_buckets  = floor(wind_speed / 10) * 10 ) |>
  drop_na(dep_delay) |>
  group_by(wind_speed_buckets) |>
  summarise(
    min = min(dep_delay),
    pctl_25 = quantile(dep_delay, 0.25),
    mean = mean(dep_delay),
    median = median(dep_delay),
    pctl_75 =  quantile(dep_delay, 0.75),
    max = max(dep_delay),
    n = n()
  )
# A tibble: 6 × 8
  wind_speed_buckets   min pctl_25  mean median pctl_75   max      n
               <dbl> <dbl>   <dbl> <dbl>  <dbl>   <dbl> <dbl>  <int>
1                  0   -43   -5     11.0     -2     8    1301 146283
2                 10   -27   -5     13.6     -1    12    1014 160644
3                 20   -32   -5     16.4     -1    17     825  19097
4                 30   -13   -3     28.6      2    31     960    861
5                 40   -11   -2.75  12.1     -1    12.5   110     30
6                 NA   -18   -4     13.0     -1    13     264   1606

This seems more promising! High wind makes it hard to takeoff!

Want more flight data?

Try the package anyflights which can construct datasets like we just used for any airport, any year!