2024-01-01
dplyr
tidyverse
A package is
a set of functions
defined outside the current project (usually by someone else).
For R, most packages are hosted on CRAN (Comprehensive R Archive Network), which makes installing packages easy.
You can also easily install packages directly from github.
Packages are incredibly useful and will be a part of any project you work on.
The base command for installing packages is
This will download the package from CRAN, and store it on your computer.
If you then want to use a function from the package, you have two options:
Neither of these options will work if the package isn’t installed!
Packages are functions managed by someone else.
Sometimes those people update the functions.
If your code relies entirely on v0.9 of some_function()
which always returned a boolean,
and now v1.0 returns an integer,
all of your code using some_function()
just broke.
Environments will let us be explicit about which package versions we are using.
Allow you to keep track of what packages you use, and their versions.
Load those specific versions of packages at any later date.
This makes them incredibly important for reproducibility.
R environments: renv
We will be using renv
to handle all of our packages.
Whenver you start a project,
renv
renv
If you have never installed renv
before,
Then, we tell renv
to start tracking packages for our project.
This is as simple as,
renv
To start with, the only packge being tracked is renv
itself, and the version of R we are using.
Go ahead and restart the R Session. You should get a new message from renv
at the start of the new session.
jsonlite
, rlang
We also get a warning message at the start of the new session:
VSCode R Session Watcher requires jsonlite, rlang. Please install manually in order to use VSCode-R.
Why is this happening?
renv
has created an isolated environment for our project.
The only package our system can see is renv
itself!
All of the other packages we installed when setting up R and VS Code still exist on our machine, but they cannot be accessed.
renv
issueTo fix this, we just need to tell renv
to install the packages.
If you’ve used renv
to install these packages before,
you should see the message “linked from cache”.
renv
is smart enough to know it has installed these packages before and instead of redownloading them,
simply provides a link to the right version on our machine.
Reload the session. No more VS Code warning message!
renv
Files.Rprofile
renv
at the start of any R session.renv/
renv.lock
renv
Filesrenv.lock
is the only piece needed to recreate your environment!
So renv/
doesn’t get commited to git
, which is good, because it can be large.
tidyverse
tidyverse
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.1
Developed in part by Hadley Wickham (a name we will see again) and supported by the company behind RStudio.
The most popular collection of R packages.
Specifically designed for readable data science.
tidyverse
Once it is installed for our environment, we can call,
Which prints a nice message showing the packages loaded.
tidyverse
packagestibble
dplyr
lubridate
stringr
readr
tibble
printingShows variable types, dimensions.
Limits printing to not flood console.
tibble
list columnsA powerful feature of the tibble type is being able to store any object in a tibble column.
tibble
list columnsThis means you can have a tibble with
dplyr
A set of verbs to work with tibbles.
We will work with a built in dataset: mtcars
.
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
You can see a list of the built in datasets by calling data()
.
dplyr
filterFiltering datasets allows you to select rows based off of a logical condition.
The first argument is the tibble, the second the condition.
# A tibble: 7 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
4 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
5 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
6 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
7 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
dplyr
filterYou can filter on multiple conditions using the &
operator.
dplyr
verbs are designed to be used with a pipe.
A pipe in programming takes the output from the left, and plugs it in to the first argument on the right.
In R, the pipe operation is: |>
# A tibble: 7 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
4 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
5 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
6 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
7 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
You may also see %>%
used as a pipe. This is from the magrittr
package, and was used before R had a base pipe.
# A tibble: 1 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
This is even more readable with a return after each pipe.
dplyr
mutatePossibly the most useful data science verb.
Mutate lets you create new columns in your tibble.
# A tibble: 32 × 12
mpg cyl disp hp drat wt qsec vs am gear carb cyl_sq
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 36
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 36
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 16
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 36
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 64
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 36
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 64
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 16
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 16
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 36
# ℹ 22 more rows
dplyr
multiple mutatesIf you want to do multiple mutates, you can either
dplyr
mutate functionsMutate works with any function that takes a returns a vector of the same length.
# A tibble: 32 × 12
mpg cyl disp hp drat wt qsec vs am gear carb cyl_sqrt
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2.45
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 2.45
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 2
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 2.45
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 2.83
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 2.45
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 2.83
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 2.45
# ℹ 22 more rows
This is very useful, as it can work with custom functions you write as well.
dplyr
selectWe may not need all of the columns of our data.
select()
lets you choose which columns to keep.
dplyr
summarizeSummarize lets you collapse your data using functions that take a vector and return a single value.
dplyr
group_byWhat if we want summaries for each quantity of cylinders (cyl)?
We could filter the data first, then summarise, and repeat for each cylinder value…
dplyr
main verbsfilter()
mutate()
select()
summarize()
group_by
|>
With all of these, you can do most data wrangling!
dplyr
verbsarrange()
sort your data by a columndistinct()
keep only unique rowsrename()
rename a columnNon-tibble functions
if_else()
vectorized if-else, useful within mutatescase_when()
use instead of nested if_else statementslag()
and lead()
shift values forward or backdplyr
_join functionsdplyr
has a family of functions for merging two datasets.
Here are two example data.frames that we will use to understand the differences.
dplyr
full_join()Keeps all observations from both the x (left) and y (right) datasets.
dplyr
left_join()Keeps all observations from the x (left) dataset,
but only the matches from the y (right) dataset.
dplyr
right_join()Keeps all observations from the y (right) dataset,
but only the matches from the x (left) dataset.
dplyr
inner_join()Keeps only matches between the two datasets.
dplyr
filtering joinssemi_join()
Keep observations in x (left) with a possible match in y (right).
Note: only observations in x are returned, no values from y.
dplyr
filtering joinsanti_join()
Keep observations in x (left) without a possible match in y (right).
Note: only observations in x are returned, no values from y.
Often you will have slightly different identifying column names between two datasets.
We can handle this using join_by()
in the “by” argument.
With the join_by()
function you can do inequality, rolling, or overlapping joins (see the documentation).
This sounds almost tautological.
But a lot of data is not stored this way.
Which of the following tables of stock prices is tidy?
# A tibble: 3 × 3
stock `2024-01-01` `2024-01-02`
<chr> <dbl> <dbl>
1 A 100 105
2 B 98 95
3 C 99 103
# A tibble: 6 × 3
stock date price
<chr> <date> <dbl>
1 A 2024-01-01 100
2 A 2024-01-02 105
3 B 2024-01-01 98
4 B 2024-01-02 95
5 C 2024-01-01 99
6 C 2024-01-02 103
Imagine we want to add another variable—volume—to the stock price tables.
# A tibble: 6 × 3
stock `2024-01-01` `2024-01-02`
<chr> <dbl> <dbl>
1 A_price 100 105
2 B_price 98 95
3 C_price 99 103
4 A_vol 50 70
5 B_vol 30 55
6 C_vol 23 60
# A tibble: 6 × 4
stock date price volume
<chr> <date> <dbl> <dbl>
1 A 2024-01-01 100 50
2 A 2024-01-02 105 70
3 B 2024-01-01 98 30
4 B 2024-01-02 95 55
5 C 2024-01-01 99 23
6 C 2024-01-02 103 60
This is very easy for tidy data.
But it is hard for nontidy data.
The tidyverse
obviously has strong support for tidy data.
In particular, the package tidyr
.
We will cover two crucial functions:
tidyr::pivot_longer()
tidyr::pivot_wider()
pivot_longer
Pivot functions let you transform a data.frame between wide and long formats.
Let’s look at the stock price data from earlier.
pivot_longer
# A tibble: 3 × 3
stock `2024-01-01` `2024-01-02`
<chr> <dbl> <dbl>
1 A 100 105
2 B 98 95
3 C 99 103
pivot_longer
# A tibble: 3 × 3
stock `2024-01-01` `2024-01-02`
<chr> <dbl> <dbl>
1 A 100 105
2 B 98 95
3 C 99 103
Instead, we could list the columns not to lengthen.
pivot_longer
namespivot_longer
defaults to naming new columns “name” and “value”, but you can override these.
pivot_longer
data typesWe should also convert the date column to the date-type.
pivot_wider
Sometimes, you will want your data in wide format.
While this isn’t always “tidy” it can be useful for some calculations, and then you can go back to long.
pivot_wider
# A tibble: 6 × 3
stock date price
<chr> <date> <dbl>
1 A 2024-01-01 100
2 A 2024-01-02 105
3 B 2024-01-01 98
4 B 2024-01-02 95
5 C 2024-01-01 99
6 C 2024-01-02 103
With
pivot_longer
pivot_wider
it’s easy to convert from long to wide data foramts.
With these tools, you can “tidy” most any dataset.
Once data is “tidy” it’s easy to lengthen or widen the data for any calculations.
renv
for managing packages and versions
The tidyverse
for easy data science
tibble
s as better data.frames
dplyr
for working with tibbles
|>
for piping
tidyr
for pivoting data long or wide
Practice reading in a csv
Tidying it with dplyr
and tidyr
Creating summary tables