defined outside the current project (usually by someone else).
. . .
For R, most packages are hosted on CRAN (Comprehensive R Archive Network), which makes installing packages easy.
You can also easily install packages directly from GitHub.
. . .
Packages are incredibly useful and will be a part of any project you work on.
Installing and Loading Packages
The base command for installing packages is
install.packages("nameOfPackage")
This will download the package from CRAN, and store it on your computer.
Installing and Loading Packages
If you then want to use a function from the package, you have two options:
Explicit reference for each function
nameOfPackage::some_function()
Loading the whole package
library(nameOfPackage)some_function()
. . .
Neither of these options will work if the package isn’t installed!
Package Versions
Packages are functions managed by someone else.
Sometimes those people update the functions.
. . .
Great! Probably the new functions are better!
. . .
But sometimes this is catastrophic for you.
. . .
If your code relies entirely on v0.9 of some_function() which always returned a boolean,
and now v1.0 returns an integer,
all of your code using some_function() just broke.
Environments
Environments will let us be explicit about which package versions we are using.
. . .
Allow you to keep track of what packages you use, and their versions.
Load those specific versions of packages at any later date.
This makes environments incredibly important for reproducibility.
. . .
R environments: renv
We will be using renv to handle all of our packages.
Environments
Whenver you start a project,
Install renv
Initialize the environment
Install the rest of your packages
Setting up renv
If you have never installed renv before,
install.packages(renv)
Then, we tell renv to start tracking packages for our project.
This is as simple as,
renv::init()
Setting up renv
renv::init()
. . .
To start with, the only packge being tracked is renv itself, and the version of R we are using.
. . .
Go ahead and restart the R Session. You should get a new message from renv at the start of the new session.
VS Code can’t find jsonlite, rlang
We also get a warning message at the start of the new session:
VSCode R Session Watcher requires jsonlite, rlang. Please install manually in order to use VSCode-R.
Why is this happening?
. . .
renv has created an isolated environment for our project.
The only package our system can see is renv itself!
All of the other packages we installed when setting up R and VS Code still exist on our machine, but they cannot be accessed.
. . .
Fixing VS Code renv issue
To fix this, we just need to tell renv to install the packages.
renv::install("languageserver") ## Takes a few minutes the first timerenv::install("httpgd")
. . .
If you’ve used renv to install these packages before,
you should see the message “linked from cache”.
. . .
renv is smart enough to know it has installed these packages before and instead of redownloading them,
simply provides a link to the right version on our machine.
. . .
Reload the session. No more VS Code warning message!
renv Files
.Rprofile
This activates renv at the start of any R session.
renv/
Where packages for the isolated environment are installed.
renv.lock
A small file listing packages and their version numbers.
renv Files
renv.lock is the only piece needed to recreate your environment!
. . .
So renv/ doesn’t get commited to git, which is good, because it can be large.
The tidyverse
Introducing the tidyverse
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.1
. . .
Developed in part by Hadley Wickham (a name we will see again) and supported by the company behind RStudio.
. . .
The most popular collection of R packages.
. . .
Specifically designed for readable data science.
Installing the tidyverse
renv::install("tidyverse")
Once it is installed for our environment, we can call,
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Which prints a nice message showing the packages loaded.
Some tidyverse packages
tibble: A replacement for data.frame object, with better printing
. . .
dplyr: Data science functions for adding new columns, filtering data, summarizing, etc.
. . .
lubridate: Makes it easier to work with dates.
. . .
stringr: Better regular expression and string matching.
All of the join functions have the same basic structure.
[]_join(x, y, by = ___)
. . .
Here are all of the options:
full_join(tb1, tb2, by ="id")left_join(tb1, tb2, by ="id")right_join(tb1, tb2, by ="id")inner_join(tb1, tb2, by ="id")semi_join(tb1, tb2, by ="id")anti_join(tb1, tb2, by ="id")
dplyr::full_join()
Keeps all observations from both the x (left) and y (right) datasets.
full_join(tb1, tb2, by ="id")
# A tibble: 8 × 3
id x y
<chr> <dbl> <dbl>
1 A 4 NA
2 B 9 NA
3 C 2 NA
4 D 3 7
5 E 8 4
6 F NA 1
7 G NA 2
8 H NA 6
dplyr::left_join()
Keeps all observations from the x (left) dataset,
but only the matches from the y (right) dataset.
left_join(tb1, tb2, by ="id")
# A tibble: 5 × 3
id x y
<chr> <dbl> <dbl>
1 A 4 NA
2 B 9 NA
3 C 2 NA
4 D 3 7
5 E 8 4
dplyr::right_join()
Keeps all observations from the y (right) dataset,
but only the matches from the x (left) dataset.
right_join(tb1, tb2, by ="id")
# A tibble: 5 × 3
id x y
<chr> <dbl> <dbl>
1 D 3 7
2 E 8 4
3 F NA 1
4 G NA 2
5 H NA 6
dplyr::inner_join()
Keeps only matches between the two datasets.
inner_join(tb1, tb2, by ="id")
# A tibble: 2 × 3
id x y
<chr> <dbl> <dbl>
1 D 3 7
2 E 8 4
dplyr filtering joins
semi_join()
Keep observations in x (left) with a possible match in y (right).
semi_join(tb1, tb2, by ="id")
# A tibble: 2 × 2
id x
<chr> <dbl>
1 D 3
2 E 8
. . .
Note: only observations in x are returned, no values from y.
dplyr filtering joins
anti_join()
Keep observations in x (left) without a possible match in y (right).
anti_join(tb1, tb2, by ="id")
# A tibble: 3 × 2
id x
<chr> <dbl>
1 A 4
2 B 9
3 C 2
. . .
Note: only observations in x are returned, no values from y.
Merging by different column names
Often you will have slightly different identifying column names between two datasets.
We can handle this using join_by() in the “by” argument.
inner_join(tb1, tb2, by =join_by(id == ID))
# A tibble: 2 × 3
id x y
<chr> <dbl> <dbl>
1 D 3 7
2 E 8 4
. . .
With the join_by() function you can do inequality, rolling, or overlapping joins (see the documentation).
“tidy” Data
“tidy” Data
Each variable is a column
Each observation is a row
. . .
This sounds almost tautological.
But a lot of data is not stored this way.
. . .
Which of the following tables of stock prices is tidy?
# A tibble: 3 × 3
stock `2024-01-01` `2024-01-02`
<chr> <dbl> <dbl>
1 A 100 105
2 B 98 95
3 C 99 103
# A tibble: 6 × 3
stock date price
<chr> <date> <dbl>
1 A 2024-01-01 100
2 A 2024-01-02 105
3 B 2024-01-01 98
4 B 2024-01-02 95
5 C 2024-01-01 99
6 C 2024-01-02 103
Why tidy?
Imagine we want to add another variable—volume—to the stock price tables.
# A tibble: 6 × 3
stock name value
<chr> <chr> <dbl>
1 A 2024-01-01 100
2 A 2024-01-02 105
3 B 2024-01-01 98
4 B 2024-01-02 95
5 C 2024-01-01 99
6 C 2024-01-02 103
pivot_longer
nontidy_tb
# A tibble: 3 × 3
stock `2024-01-01` `2024-01-02`
<chr> <dbl> <dbl>
1 A 100 105
2 B 98 95
3 C 99 103
Instead, we could list the columns not to lengthen.
nontidy_tb |>pivot_longer(cols =-c("stock"))
# A tibble: 6 × 3
stock name value
<chr> <chr> <dbl>
1 A 2024-01-01 100
2 A 2024-01-02 105
3 B 2024-01-01 98
4 B 2024-01-02 95
5 C 2024-01-01 99
6 C 2024-01-02 103
pivot_longer names
pivot_longer defaults to naming new columns “name” and “value”, but you can override these.
# A tibble: 6 × 3
stock date price
<chr> <chr> <dbl>
1 A 2024-01-01 100
2 A 2024-01-02 105
3 B 2024-01-01 98
4 B 2024-01-02 95
5 C 2024-01-01 99
6 C 2024-01-02 103
pivot_longer data types
We should also convert the date column to the date-type.
# A tibble: 6 × 3
stock date price
<chr> <date> <dbl>
1 A 2024-01-01 100
2 A 2024-01-02 105
3 B 2024-01-01 98
4 B 2024-01-02 95
5 C 2024-01-01 99
6 C 2024-01-02 103
pivot_wider
Sometimes, you will want your data in wide format.
While this isn’t always “tidy” it can be useful for some calculations, and then you can go back to long.
pivot_wider
tidy_tb
# A tibble: 6 × 3
stock date price
<chr> <date> <dbl>
1 A 2024-01-01 100
2 A 2024-01-02 105
3 B 2024-01-01 98
4 B 2024-01-02 95
5 C 2024-01-01 99
6 C 2024-01-02 103
. . .
The two key arguments: names_from = and values_from =.