R Data Wrangling:
the tidyverse

Matthew DeHaven

Course Home Page

2024-01-01

Lecture Goals

  • Introduction to environments
  • Introduction to our first package: dplyr
  • Using the rest of the tidyverse

What is a package?

A package is

  • a set of functions

  • defined outside the current project (usually by someone else).

For R, most packages are hosted on CRAN (Comprehensive R Archive Network), which makes installing packages easy.

You can also easily install packages directly from github.

Packages are incredibly useful and will be a part of any project you work on.

Installing and Loading Packages

The base command for installing packages is

install.packages("nameOfPackage")

This will download the package from CRAN, and store it on your computer.

Installing and Loading Packages

If you then want to use a function from the package, you have two options:

  1. Explicit reference for each function
nameOfPackage::some_function()
  1. Loading the whole package
library(nameOfPackage)

some_function()

Neither of these options will work if the package isn’t installed!

Package Versions

Packages are functions managed by someone else.

Sometimes those people update the functions.

  • Great! Probably the new functions are better!
  • But sometimes this is catastrophic for you.

If your code relies entirely on v0.9 of some_function() which always returned a boolean,

and now v1.0 returns an integer,

all of your code using some_function() just broke.

Environments

Environments will let us be explicit about which package versions we are using.

  • Allow you to keep track of what packages you use, and their versions.

  • Load those specific versions of packages at any later date.

This makes them incredibly important for reproducibility.

R environments: renv

We will be using renv to handle all of our packages.

Environments

Whenver you start a project,

  • Install renv
  • Initialize the environment
  • Install the rest of your packages

Setting up renv

If you have never installed renv before,

install.packages(renv)

Then, we tell renv to start tracking packages for our project.

This is as simple as,

renv::init()

Setting up renv

renv::init()

To start with, the only packge being tracked is renv itself, and the version of R we are using.

Go ahead and restart the R Session. You should get a new message from renv at the start of the new session.

VS Code can’t find jsonlite, rlang

We also get a warning message at the start of the new session:

VSCode R Session Watcher requires jsonlite, rlang. Please install manually in order to use VSCode-R.

Why is this happening?

renv has created an isolated environment for our project.

The only package our system can see is renv itself!

All of the other packages we installed when setting up R and VS Code still exist on our machine, but they cannot be accessed.

Fixing VS Code renv issue

To fix this, we just need to tell renv to install the packages.

renv::install("languageserver")  ## Takes a few minutes the first time
renv::install("httpgd")

If you’ve used renv to install these packages before,

you should see the message “linked from cache”.

renv is smart enough to know it has installed these packages before and instead of redownloading them,

simply provides a link to the right version on our machine.

Reload the session. No more VS Code warning message!

renv Files

.Rprofile

  • This activates renv at the start of any R session.

renv/

  • Where packages for the isolated environment are installed.

renv.lock

  • A small file listing packages and their version numbers.

renv Files

renv.lock is the only piece needed to recreate your environment!

So renv/ doesn’t get commited to git, which is good, because it can be large.

The tidyverse

Introducing the tidyverse

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.1

Developed in part by Hadley Wickham (a name we will see again) and supported by the company behind RStudio.

The most popular collection of R packages.

Specifically designed for readable data science.

Installing the tidyverse

renv::install("tidyverse")

Once it is installed for our environment, we can call,

library(tidyverse)

Which prints a nice message showing the packages loaded.

Some tidyverse packages

tibble

  • A replacement for data.frame object, with better printing

dplyr

  • Data science functions for adding new columns, filtering data, summarizing, etc.

lubridate

  • Makes it easier to work with dates.

stringr

  • Better regular expression and string matching.

readr

  • Import and export functions (for CSVs)

tibble printing

data.frame(
  x = 1:10,
  y = rnorm(10)
)
    x           y
1   1 -0.01960776
2   2 -1.09146477
3   3 -0.26890158
4   4  0.11244840
5   5 -1.02307010
6   6 -0.27465501
7   7  3.46283567
8   8 -2.47424493
9   9 -0.99634981
10 10 -1.30513531
tibble(
  x = 1:25,
  y = rnorm(25)
)
# A tibble: 25 × 2
       x       y
   <int>   <dbl>
 1     1  0.0447
 2     2 -1.07  
 3     3 -0.741 
 4     4  0.245 
 5     5  1.23  
 6     6  1.05  
 7     7  0.550 
 8     8  1.47  
 9     9 -0.950 
10    10 -0.130 
# ℹ 15 more rows

Shows variable types, dimensions.

Limits printing to not flood console.

tibble list columns

A powerful feature of the tibble type is being able to store any object in a tibble column.

tibble(
  x = c(1, 2),
  example = list(
    c(a = 5, b = 5),
    c(a = 4, b = 10)
  )
)
# A tibble: 2 × 2
      x example  
  <dbl> <list>   
1     1 <dbl [2]>
2     2 <dbl [2]>

tibble list columns

This means you can have a tibble with

  • a column of data sets (nested tibbles)
  • a column of fitted models
  • a column of model summary stats (R^2, etc.)

dplyr

A set of verbs to work with tibbles.

We will work with a built in dataset: mtcars.

tb <- as_tibble(mtcars)
tb
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

You can see a list of the built in datasets by calling data().

dplyr filter

Filtering datasets allows you to select rows based off of a logical condition.

The first argument is the tibble, the second the condition.

filter(tb, cyl == 6)
# A tibble: 7 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
3  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
4  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
5  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
6  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
7  19.7     6  145    175  3.62  2.77  15.5     0     1     5     6

dplyr filter

You can filter on multiple conditions using the & operator.

filter(tb, cyl == 6 & gear == 5)
# A tibble: 1 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  19.7     6   145   175  3.62  2.77  15.5     0     1     5     6

Piping

dplyr verbs are designed to be used with a pipe.

A pipe in programming takes the output from the left, and plugs it in to the first argument on the right.

In R, the pipe operation is: |>

tb |> filter(cyl == 6)
# A tibble: 7 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
3  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
4  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
5  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
6  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
7  19.7     6  145    175  3.62  2.77  15.5     0     1     5     6

You may also see %>% used as a pipe. This is from the magrittr package, and was used before R had a base pipe.

Chaining pipes

tb |> filter(cyl == 6) |> filter(gear == 5)
# A tibble: 1 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  19.7     6   145   175  3.62  2.77  15.5     0     1     5     6

This is even more readable with a return after each pipe.

tb |>
  filter(cyl == 6) |>
  filter(gear == 5)

The idea behind piping is that you can read them from

left-to-right, or top-to-down, which is more natural.

Rather than usual function calls, which are read inside-to-out.

filter(filter(tb, cyl == 6), gear == 5)

dplyr mutate

Possibly the most useful data science verb.

Mutate lets you create new columns in your tibble.

tb |>
  mutate(cyl_sq = cyl ^ 2)
# A tibble: 32 × 12
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb cyl_sq
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4     36
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4     36
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1     16
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1     36
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2     64
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1     36
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4     64
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2     16
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2     16
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4     36
# ℹ 22 more rows

dplyr multiple mutates

If you want to do multiple mutates, you can either

  • add another mutate function
  • add another declaration separated by a comma
tb |>
  mutate(cyl_sq = cyl ^ 2) |>
  mutate(cyl_cb = cyl ^ 3)
tb |>
  mutate(
    cyl_sq = cyl ^ 2,
    cyl_cb = cyl ^ 3
  )

dplyr mutate functions

Mutate works with any function that takes a returns a vector of the same length.

tb |>
  mutate(cyl_sqrt = sqrt(cyl))
# A tibble: 32 × 12
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb cyl_sqrt
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4     2.45
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4     2.45
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1     2   
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1     2.45
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2     2.83
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1     2.45
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4     2.83
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2     2   
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2     2   
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4     2.45
# ℹ 22 more rows

This is very useful, as it can work with custom functions you write as well.

dplyr select

We may not need all of the columns of our data.

select() lets you choose which columns to keep.

tb |>
  mutate(cyl_sq = cyl ^ 2) |>
  select(mpg, cyl, cyl_sq)
# A tibble: 32 × 3
     mpg   cyl cyl_sq
   <dbl> <dbl>  <dbl>
 1  21       6     36
 2  21       6     36
 3  22.8     4     16
 4  21.4     6     36
 5  18.7     8     64
 6  18.1     6     36
 7  14.3     8     64
 8  24.4     4     16
 9  22.8     4     16
10  19.2     6     36
# ℹ 22 more rows

dplyr summarize

Summarize lets you collapse your data using functions that take a vector and return a single value.

tb |>
  summarize(mpg_avg = mean(mpg))
# A tibble: 1 × 1
  mpg_avg
    <dbl>
1    20.1

We could summarize multiple columns or operations at once.

tb |>
  summarize(
    mpg_avg = mean(mpg),
    mpg_max = max(mpg),
    mpg_min = min(mpg),
    n = n()
    )
# A tibble: 1 × 4
  mpg_avg mpg_max mpg_min     n
    <dbl>   <dbl>   <dbl> <int>
1    20.1    33.9    10.4    32

dplyr group_by

What if we want summaries for each quantity of cylinders (cyl)?

We could filter the data first, then summarise, and repeat for each cylinder value…

… or we tell dplyr to group the data by cyl.

tb |>
  group_by(cyl) |>
  summarise(
    mpg_avg = mean(mpg),
    mpg_max = max(mpg),
    mpg_min = min(mpg),
    n = n()
    )
# A tibble: 3 × 5
    cyl mpg_avg mpg_max mpg_min     n
  <dbl>   <dbl>   <dbl>   <dbl> <int>
1     4    26.7    33.9    21.4    11
2     6    19.7    21.4    17.8     7
3     8    15.1    19.2    10.4    14

dplyr main verbs

  • filter()

  • mutate()

  • select()

  • summarize()

  • group_by

  • the pipe: |>

With all of these, you can do most data wrangling!

Other useful dplyr verbs

  • arrange() sort your data by a column
  • distinct() keep only unique rows
  • rename() rename a column

Non-tibble functions

  • if_else() vectorized if-else, useful within mutates
  • case_when() use instead of nested if_else statements
  • lag() and lead() shift values forward or back

dplyr documentation

Merging

dplyr _join functions

dplyr has a family of functions for merging two datasets.

Here are two example data.frames that we will use to understand the differences.

tb1 <- tibble(
  id = c("A", "B", "C", "D", "E"),
  x = c(4, 9, 2, 3, 8)
)
tb2 <- tibble(
  id = c("D", "E", "F", "G", "H"),
  y = c(7, 4, 1, 2, 6)
)

All of the join functions have the same basic structure.

[]_join(x, y, by = ___)

Here are all of the options:

full_join(tb1, tb2, by = "id")
left_join(tb1, tb2, by = "id")
right_join(tb1, tb2, by = "id")
inner_join(tb1, tb2, by = "id")
semi_join(tb1, tb2, by = "id")
anti_join(tb1, tb2, by = "id")

dplyr full_join()

Keeps all observations from both the x (left) and y (right) datasets.

full_join(tb1, tb2, by = "id")
# A tibble: 8 × 3
  id        x     y
  <chr> <dbl> <dbl>
1 A         4    NA
2 B         9    NA
3 C         2    NA
4 D         3     7
5 E         8     4
6 F        NA     1
7 G        NA     2
8 H        NA     6

dplyr left_join()

Keeps all observations from the x (left) dataset,

but only the matches from the y (right) dataset.

left_join(tb1, tb2, by = "id")
# A tibble: 5 × 3
  id        x     y
  <chr> <dbl> <dbl>
1 A         4    NA
2 B         9    NA
3 C         2    NA
4 D         3     7
5 E         8     4

dplyr right_join()

Keeps all observations from the y (right) dataset,

but only the matches from the x (left) dataset.

right_join(tb1, tb2, by = "id")
# A tibble: 5 × 3
  id        x     y
  <chr> <dbl> <dbl>
1 D         3     7
2 E         8     4
3 F        NA     1
4 G        NA     2
5 H        NA     6

dplyr inner_join()

Keeps only matches between the two datasets.

inner_join(tb1, tb2, by = "id")
# A tibble: 2 × 3
  id        x     y
  <chr> <dbl> <dbl>
1 D         3     7
2 E         8     4

dplyr filtering joins

  • semi_join()

Keep observations in x (left) with a possible match in y (right).

semi_join(tb1, tb2, by = "id")
# A tibble: 2 × 2
  id        x
  <chr> <dbl>
1 D         3
2 E         8

Note: only observations in x are returned, no values from y.

dplyr filtering joins

  • anti_join()

Keep observations in x (left) without a possible match in y (right).

anti_join(tb1, tb2, by = "id")
# A tibble: 3 × 2
  id        x
  <chr> <dbl>
1 A         4
2 B         9
3 C         2

Note: only observations in x are returned, no values from y.

Merging by different column names

Often you will have slightly different identifying column names between two datasets.

tb1 <- tibble(
  id = c("A", "B", "C", "D", "E"),
  x = c(4, 9, 2, 3, 8)
)
tb2 <- tibble(
  ID = c("D", "E", "F", "G", "H"),
  y = c(7, 4, 1, 2, 6)
)

We can handle this using join_by() in the “by” argument.

inner_join(tb1, tb2, by = join_by(id == ID))
# A tibble: 2 × 3
  id        x     y
  <chr> <dbl> <dbl>
1 D         3     7
2 E         8     4

With the join_by() function you can do inequality, rolling, or overlapping joins (see the documentation).

tidy Data

“tidy” data

  • Each variable is a column
  • Each observation is a row

This sounds almost tautological.

But a lot of data is not stored this way.

Which of the following tables of stock prices is tidy?

# A tibble: 3 × 3
  stock `2024-01-01` `2024-01-02`
  <chr>        <dbl>        <dbl>
1 A              100          105
2 B               98           95
3 C               99          103
# A tibble: 6 × 3
  stock date       price
  <chr> <date>     <dbl>
1 A     2024-01-01   100
2 A     2024-01-02   105
3 B     2024-01-01    98
4 B     2024-01-02    95
5 C     2024-01-01    99
6 C     2024-01-02   103

Why tidy?

Imagine we want to add another variable—volume—to the stock price tables.

# A tibble: 6 × 3
  stock   `2024-01-01` `2024-01-02`
  <chr>          <dbl>        <dbl>
1 A_price          100          105
2 B_price           98           95
3 C_price           99          103
4 A_vol             50           70
5 B_vol             30           55
6 C_vol             23           60
# A tibble: 6 × 4
  stock date       price volume
  <chr> <date>     <dbl>  <dbl>
1 A     2024-01-01   100     50
2 A     2024-01-02   105     70
3 B     2024-01-01    98     30
4 B     2024-01-02    95     55
5 C     2024-01-01    99     23
6 C     2024-01-02   103     60

This is very easy for tidy data.

But it is hard for nontidy data.

  • Do we add a “_price” and “_vol” to each stock row?
  • Or to each date column?
  • Or store them as two tables?

Making everything tidy

The tidyverse obviously has strong support for tidy data.

In particular, the package tidyr.

We will cover two crucial functions:

  • tidyr::pivot_longer()
  • tidyr::pivot_wider()

pivot_longer

Pivot functions let you transform a data.frame between wide and long formats.

Let’s look at the stock price data from earlier.

pivot_longer

nontidy_tb
# A tibble: 3 × 3
  stock `2024-01-01` `2024-01-02`
  <chr>        <dbl>        <dbl>
1 A              100          105
2 B               98           95
3 C               99          103

Main argument is “cols =”, which takes the columns to lengthen.

nontidy_tb |>
  pivot_longer(cols = c("2024-01-01", "2024-01-02"))
# A tibble: 6 × 3
  stock name       value
  <chr> <chr>      <dbl>
1 A     2024-01-01   100
2 A     2024-01-02   105
3 B     2024-01-01    98
4 B     2024-01-02    95
5 C     2024-01-01    99
6 C     2024-01-02   103

pivot_longer

nontidy_tb
# A tibble: 3 × 3
  stock `2024-01-01` `2024-01-02`
  <chr>        <dbl>        <dbl>
1 A              100          105
2 B               98           95
3 C               99          103

Instead, we could list the columns not to lengthen.

nontidy_tb |>
  pivot_longer(cols = -c("stock"))
# A tibble: 6 × 3
  stock name       value
  <chr> <chr>      <dbl>
1 A     2024-01-01   100
2 A     2024-01-02   105
3 B     2024-01-01    98
4 B     2024-01-02    95
5 C     2024-01-01    99
6 C     2024-01-02   103

pivot_longer names

pivot_longer defaults to naming new columns “name” and “value”, but you can override these.

nontidy_tb |>
  pivot_longer(cols = -c("stock"), names_to = "date", values_to = "price")
# A tibble: 6 × 3
  stock date       price
  <chr> <chr>      <dbl>
1 A     2024-01-01   100
2 A     2024-01-02   105
3 B     2024-01-01    98
4 B     2024-01-02    95
5 C     2024-01-01    99
6 C     2024-01-02   103

pivot_longer data types

We should also convert the date column to the date-type.

nontidy_tb |>
  pivot_longer(cols = -c("stock"), names_to = "date", values_to = "price") |>
  mutate(date = as.Date(date))
# A tibble: 6 × 3
  stock date       price
  <chr> <date>     <dbl>
1 A     2024-01-01   100
2 A     2024-01-02   105
3 B     2024-01-01    98
4 B     2024-01-02    95
5 C     2024-01-01    99
6 C     2024-01-02   103

pivot_wider

Sometimes, you will want your data in wide format.

While this isn’t always “tidy” it can be useful for some calculations, and then you can go back to long.

pivot_wider

tidy_tb
# A tibble: 6 × 3
  stock date       price
  <chr> <date>     <dbl>
1 A     2024-01-01   100
2 A     2024-01-02   105
3 B     2024-01-01    98
4 B     2024-01-02    95
5 C     2024-01-01    99
6 C     2024-01-02   103

The two key arguments: “names_from =” and “values_from =”.

tidy_tb |>
  pivot_wider(names_from = "stock", values_from = "price")
# A tibble: 2 × 4
  date           A     B     C
  <date>     <dbl> <dbl> <dbl>
1 2024-01-01   100    98    99
2 2024-01-02   105    95   103

Pivoting

With

  • pivot_longer
  • pivot_wider

it’s easy to convert from long to wide data foramts.

With these tools, you can “tidy” most any dataset.

Once data is “tidy” it’s easy to lengthen or widen the data for any calculations.

Summary

  • renv for managing packages and versions

  • The tidyverse for easy data science

    • tibbles as better data.frames

    • dplyr for working with tibbles

    • |> for piping

    • tidyr for pivoting data long or wide

Live Session

  • Practice reading in a csv

  • Tidying it with dplyr and tidyr

  • Creating summary tables