2024-01-01
renv
data.table
renv
R by default has one library of packages for all projects.
You can see where this library lives by calling,
Notice there are actually two library paths returned!
“User” Library
“System” Library
Installing packages puts them into the User Library. Calling library()
loads a package.
There are two problems with R’s default package setup.
These are obviously related, which is why renv
solves both.
renv
Separates Project Librariesrenv::init()
makes isolated libraries for each project.
renv
CacheIn order to avoid duplicate copies and downloads, renv
stores a cache of all the packages it installs behind the scenes.
This solves projects changing each other’s packages, but does not yet give us any way to track and restore packages.
renv
lockfileThe lockfile keeps tracks of what packages are used in a project.
It does not default to everything in the Project Library.
languageserver
packagerenv
Record Packages to the LockfileIn order to record a new package to the lockfile, run
This causes renv
to search all of your project’s R files, see what packages are being used, and save the current version to the lockfile.
renv
StatusTo see if your lockfile is up-to-date, you can run
renv
will also tell you if everything is up-to-date when you launch a new R terminal.
renv
RestoreIf you open a project with a renv
lockfile and want to recreate the package environment, run
You will see this in the next assignment.
renv
Summarydata.table
data.table
Is another package focused on data science.
It’s goals:
In contrast to the tidyverse
which is focused on readability.
data.table
Again, we use renv
to install data.table for any project we are using.
data.table
objectsdata.table
s are a new object to replace data.frame
.
Similar to tibble
s, they have nicer default printing
x y
1 0.83377917 0.42855072
2 -0.36077184 0.53474323
3 2.11178136 -0.07476166
4 0.79099643 0.15891206
5 -0.04275800 0.07132163
6 1.23273409 -0.76437363
7 -0.35672042 2.49836674
8 1.54626789 0.51806839
9 0.47547460 0.26209775
10 0.61888324 1.28221455
11 -1.56427897 1.04788785
12 -1.33957690 0.63640091
13 -0.31086400 -0.47595092
14 0.29463141 0.33838549
15 -0.79972988 0.51030725
16 -1.94221358 1.36570376
17 1.41690854 1.90730950
18 0.73980680 -0.64199706
19 -0.70874365 -0.43554910
20 -0.64913039 -1.61062741
21 -0.67103841 0.25141990
22 1.97312733 1.26008509
23 0.97448745 -0.34154539
24 -0.97855331 -0.35580154
25 -2.44145584 1.50798907
26 1.56142993 -0.65696576
27 -0.98832285 1.27780709
28 0.32080586 0.29752028
29 -0.39849697 -0.07838319
30 0.30725461 -1.27512624
31 -0.93404373 0.84386316
32 0.30909515 -1.46539157
33 -0.92918608 1.32002813
34 -0.73848119 1.54514920
35 0.02003996 -1.15617196
36 -0.84559194 -0.62265194
37 1.37588391 0.67335249
38 0.07091679 -1.65400615
39 -1.13626114 -0.09475975
40 -0.27398837 0.73372151
41 -0.16062430 0.45327133
42 0.74115272 -0.07673070
43 0.73017583 1.14002521
44 -0.46366947 0.85328903
45 -0.25627678 0.16746347
46 0.26692027 -1.38073257
47 0.85789370 0.66542867
48 -0.35307531 -0.79459430
49 1.16863021 -0.68762430
50 1.48238157 -1.36783859
51 -1.38829176 -1.56436398
52 0.45506797 -0.65327492
53 -0.95961886 2.16865159
54 0.20679806 0.89304371
55 -0.14924144 0.10380360
56 0.34617227 1.55884582
57 -0.44707571 -1.01702409
58 -0.07038732 1.24338133
59 0.01508632 -1.60860290
60 0.14377797 0.01422270
61 -0.47592212 -2.09510429
62 -0.18489840 0.46341697
63 -2.34252819 -0.43065278
64 0.64184081 0.15917291
65 1.06889406 0.01566283
66 0.11274229 1.38075918
67 0.92797834 0.72433469
68 -0.29506723 0.27852908
69 0.86763874 0.26116636
70 0.86340114 0.77255754
71 0.20557303 1.18846120
72 1.56808601 -2.60856880
73 1.44461676 0.54722415
74 -0.48743038 0.51118354
75 -0.44116934 0.89458143
76 -0.58507781 -1.33086406
77 -0.41085083 1.24373750
78 0.35281164 0.14250036
79 0.43195812 -1.84680797
80 0.40456099 1.13918645
81 -0.92370616 0.21089086
82 0.51476385 -0.99094096
83 0.57675672 1.28927599
84 1.37242625 -0.14106231
85 -1.25967764 -1.01421011
86 -0.17724804 -0.40073987
87 1.48900094 -1.08023394
88 0.31238965 -1.36900298
89 0.33618635 -0.42010586
90 -1.79876349 -1.20585154
91 -0.24026881 -0.04706084
92 -0.14865484 1.13828635
93 0.32793059 0.04332658
94 0.60392365 0.45964180
95 -1.17712154 1.61579177
96 -1.93135084 -1.24312182
97 -1.23281467 0.86054184
98 1.48761383 -0.96615897
99 0.41736954 0.08814179
100 -0.47268737 0.59772186
101 0.67725398 0.10717836
102 0.43898993 0.37018519
103 0.89710000 -2.21956380
104 -0.79911090 -1.35121018
105 -0.85600587 1.76496907
106 -0.19818464 -0.44727041
107 0.87408305 -0.20274831
108 0.84516783 -2.11227752
109 1.28249819 1.49713065
110 -2.01576831 0.56953182
111 -0.36324598 0.30786463
112 0.58714744 1.31490079
113 -0.06159335 -0.57149550
114 0.73098682 0.71580401
115 -0.23360319 -0.21085447
116 -1.64182403 -0.17340268
117 -0.74162728 1.13216050
118 -0.85269725 -0.56400404
119 -0.89855593 -1.15023030
120 1.82738317 0.91358654
121 0.61410585 -0.74057573
122 0.13390337 -1.88425565
123 1.04851015 -0.74130028
124 -0.32335482 -1.63741397
125 -0.65414831 1.38164950
126 -1.19751763 -0.16461730
127 2.04570872 -0.87959895
128 -0.53516445 1.26963165
129 -0.18040332 0.88908948
130 -0.46408498 0.92391859
131 -0.80902063 0.97409543
132 0.85149007 -0.13961062
133 -0.09836902 0.37954484
134 -0.47410787 -2.06438342
135 -0.23860408 1.52237776
136 -0.40153081 1.50593748
137 1.55072645 1.71443410
138 -0.93672793 -0.47321349
139 0.22803765 -1.47116753
140 0.26018802 -0.04272721
141 -0.38394998 1.36349392
142 -0.41859760 0.42184092
143 -0.51270321 1.66551189
144 1.39559290 0.70203129
145 -0.61752100 2.91003234
146 0.70770715 1.59477605
147 -1.00840837 -0.26743837
148 0.92721208 -0.52490848
149 -0.21810987 0.18132720
150 -0.47543538 -1.09812239
151 0.54252766 -0.24747720
152 0.36133737 -0.55865007
153 -0.43866217 1.25028299
154 1.09169396 -1.68635282
155 0.02692408 -1.39889031
156 0.04867265 -0.29793267
157 -0.13255793 0.44633739
158 0.54726402 -0.53147758
159 0.49347359 -1.49471166
160 0.27638636 -0.21632815
161 -3.24236647 0.98446427
162 -0.19978516 -1.22114773
163 1.04976539 -1.58358745
164 0.70393545 0.43180716
165 0.39408693 0.32954214
166 -0.92018846 -0.68299375
167 -0.46088598 -2.47876262
168 -0.86553374 0.63809522
169 0.37565439 0.11511289
170 1.17928135 -0.62563798
171 -0.70017186 1.12360419
172 -0.42803127 0.55387936
173 -3.25087344 -0.33958271
174 0.27046186 0.21100700
175 -0.25663660 0.60874978
176 -0.90820997 0.05266776
177 -0.44563149 -1.11855346
178 1.30859122 -0.27223246
179 -0.44813126 -2.10862740
180 -0.71505106 -0.44094351
181 0.35639409 -0.22461498
182 -0.17278591 0.03807352
183 0.01758829 -1.28705513
184 -1.06117228 0.05136929
185 0.44297244 -0.25948172
186 -1.39253467 0.07712081
187 0.05736367 -1.33846674
188 -0.13134255 0.18815446
189 1.45164219 -0.71646591
190 1.56213431 1.19188564
191 1.74919383 1.81041742
192 -0.90359507 -0.32468805
193 -1.22983233 1.05387887
194 1.22986608 -0.10111660
195 0.14475900 1.64803595
196 -0.34735696 -1.27477799
197 0.93444931 0.11091217
198 1.49584036 0.21367063
199 -0.58907131 0.23027112
200 1.02456852 -0.25140097
x y
<num> <num>
1: 0.4144923 1.5800321
2: -0.0475305 1.3920581
3: 0.0200618 -0.7373673
4: 0.8364404 2.4736258
5: -0.7871103 0.7911376
---
196: 0.3394136 -0.3338586
197: -0.2314435 1.2737760
198: -1.8162878 0.2241776
199: 0.3350380 0.3556842
200: 1.1777340 1.0143951
data.table
square bracketsRemember, from base R, than we can subset vectors, matrices, and data.frames using [i,j]
.
data.table
square bracketsdata.table
takes square brackets a huge step further, by allowing
We will look at each of these in turn.
data.table
filtering a b
<num> <num>
1: -0.1435757 -0.83542558
2: -0.7093246 0.73930375
3: 0.5523134 0.01372678
4: 1.0869389 -0.56078594
5: -1.0121181 0.44603375
6: 0.3343524 -1.63500508
7: -0.1386005 0.85560953
8: 1.5839076 0.46814137
9: -0.9337572 1.02040010
10: 0.4589008 -1.03079198
What if we want all the values where “a” is less than 0?
data.table
filtering a b
<num> <num>
1: -0.1435757 -0.83542558
2: -0.7093246 0.73930375
3: 0.5523134 0.01372678
4: 1.0869389 -0.56078594
5: -1.0121181 0.44603375
6: 0.3343524 -1.63500508
7: -0.1386005 0.85560953
8: 1.5839076 0.46814137
9: -0.9337572 1.02040010
10: 0.4589008 -1.03079198
Instead we could do all the values where “a” is less than “b”.
data.table
filtering a b
<num> <num>
1: -0.1435757 -0.83542558
2: -0.7093246 0.73930375
3: 0.5523134 0.01372678
4: 1.0869389 -0.56078594
5: -1.0121181 0.44603375
6: 0.3343524 -1.63500508
7: -0.1386005 0.85560953
8: 1.5839076 0.46814137
9: -0.9337572 1.02040010
10: 0.4589008 -1.03079198
In fact, we can drop the comma, as data.table
will assume that only one item is always the filtering argument.
tidyverse
filtering equivalentHere is a comparison of the data.table
and tidyverse
syntaxes side-by-side.
Pretty similar, but data.table
is shorter.
data.table
column subsettingRemember that normal table subsetting uses “i” for rows, and “j” for columns.
data.table
follows this.
If you want two columns, you have to pass a list.
data.table
variable declarationWhat if we want to add a new column?
data.table
does this using the column subsetting place, “j”.
:=
is a data.table
function representing “defined”.
i.e. newcol defined as ‘a’ plus ‘b’
Also called the “walrus” operator by some people.
data.table
variable declarationWould this work?
No.
data.table
will assume we are using the “i” place for rows.
Error: Operator := detected in i, the first argument inside DT[...], but is only valid in the second argument, j. Most often, this happens when forgetting the first comma (e.g. DT[newvar := 5] instead of DT[ , new_var := 5]). Please double-check the syntax. Run traceback(), and debugger() to get a line number.
data.table
variable editingWe can use the exact same syntax to edit current columns.
a b
<int> <num>
1: 1 1.1949701
2: 2 -0.1591262
3: 3 1.4724939
4: 4 -1.2206118
5: 5 -1.3712154
data.table
conditional editingWhat if we want to remove values of “b” that are less than 0?
We could conditionally declare values of “b” to be NA
.
data.table
Remember in tidyverse
we could write:
The equivalent in data.table
is to pass a list .()
to the “j” place:
Note that I had to make “mtcars” a data.table
first.
data.table
Instead, I could simply make the mean a new column.
mpg cyl disp hp drat wt qsec vs am gear carb avg
<num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 20.09062
2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 20.09062
3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 20.09062
4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 20.09062
5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 20.09062
6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 20.09062
7: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 20.09062
8: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 20.09062
9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 20.09062
10: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 20.09062
11: 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 20.09062
12: 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 20.09062
13: 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 20.09062
14: 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 20.09062
15: 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 20.09062
16: 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 20.09062
17: 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 20.09062
18: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 20.09062
19: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 20.09062
20: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 20.09062
21: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 20.09062
22: 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 20.09062
23: 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 20.09062
24: 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 20.09062
25: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 20.09062
26: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 20.09062
27: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 20.09062
28: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 20.09062
29: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 20.09062
30: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 20.09062
31: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 20.09062
32: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 20.09062
mpg cyl disp hp drat wt qsec vs am gear carb avg
data.table
will spread any single value to the length of the table.
We can summarize multiple values or columns by simply adding arguments to .()
How do we do this by group?
We could use the filtering spot, “i”, and append each result
or…
We can use the “hidden” third square bracket argument: “by=”.
cyl avg med n
<num> <num> <num> <int>
1: 6 19.74286 19.7 7
2: 4 26.66364 26.0 11
3: 8 15.10000 15.2 14
We can pass multiple columns to by using the short list: .()
data.table
argument summarydt[
i ,
j ,
by = ]
filtering rows
filtering columns, new columns, summarize
group by columns
tidyverse
equivalent verbs
filter()
select()
, mutate()
, summarize()
groub_by()
# A tibble: 7 × 3
# Groups: cyl [3]
cyl gear avg_mgp
<dbl> <dbl> <dbl>
1 4 4 21.4
2 4 5 30.4
3 6 3 19.8
4 6 4 19.8
5 6 5 19.7
6 8 3 15.0
7 8 5 15.4
cyl gear avg_mpg
<num> <num> <num>
1: 6 4 19.75
2: 6 3 19.75
3: 8 3 15.05
4: 4 5 30.40
5: 8 5 15.40
6: 6 5 19.70
7: 4 4 21.40
data.table
syntax is more concise, but harder to read.
The main answer:
Second answer:
data.table
is much fasterA simulation exercise by mdharris makes this clear.
“site_count” is a measure of the simulation size (i.e. size of the data.frames). Then some costly, but not absurd calculations are performed on the data.
data.table
clearly scales much better with size.
data.table
faster?tidyverse
functions instead “copy-on-modify”.
Copy-on-modify
Modify-in-place (modify “by reference”)
Avoids duplicating data, which is costly for your computer’s memory, especially with large datasets.
You may have noticed that I had to add an extra call to print data.table
s after modifying them.
This is because the line dt[, y := c("a", "b")]
doesn’t return anything.
Because it edited the data in place, not by creating a copy.
Modifying by reference can be dangerous if you are not careful.
What happens to “dt” if we remove a column from “dt2”?
“dt” and “dt2” reference the same underlying data.
Keys provide some organization to your data.
This makes it much faster to later:
filter
group by
merge
data.table
lets you add keys in a variety of ways.
You can also pass a vector of keys c("x", "y")
to use.
Keys provide an index for the data.
Think of this like a phone book.
If you wanted to look my phone number up in the phone book,
you would search under “D” for DeHaven
you would not search the entire phone book!
When we tell data.table
to use some columns as a key, it presorts and indexes the data on those columns.
data.table
For data.table
all types of merges are handled by one function:
dt <- data.table(x = c("a", "b", "c"), y = 1:3)
dt2 <- data.table(x = c("a", "c", "d"), z = 4:6)
merge(dt, dt2, by = "x")
Key: <x>
x y z
<char> <int> <int>
1: a 1 4
2: c 3 5
By default, merge()
does an “inner” join (matches in both datasets).
data.table
Instead, we can force
A “left” join,
Or an “outer” join
We can also be explicit about which columns to use from “x” and from “y”,
dt <- data.table(x = c("a", "b", "c"), y = 1:3)
dt2 <- data.table(XX = c("a", "c", "d"), z = 4:6)
merge(dt, dt2, by.x = "x", by.y = "XX")
Key: <x>
x y z
<char> <int> <int>
1: a 1 4
2: c 3 5
But it’s probably better to clean up your column names before hand.
data.table
data.table
pivotsTwo main functions:
melt()
go from wide to long
dcast()
go from long to wide
We will see how to use each.
There is instead a package tidyfast
which offers a dt_pivot_longer()
and dt_pivot_wider()
to match the syntax of tidyr
but with the speed of data.table
.
melt
Use to go from wide to long data.
You can provide the “id.vars=” argument, i.e. the columns not to be melted.
dt <- data.table(
measure = c("unem", "GDP"),
RI = c(0.05, 100),
MA = c(0.04, 102),
CT = c(0.06, 101)
)
dt
measure RI MA CT
<char> <num> <num> <num>
1: unem 5e-02 0.04 0.06
2: GDP 1e+02 102.00 101.00
measure variable value
<char> <fctr> <num>
1: unem RI 0.05
2: GDP RI 100.00
3: unem MA 0.04
4: GDP MA 102.00
5: unem CT 0.06
6: GDP CT 101.00
melt
Instead, you can provide the “measure.vars=” argument, i.e. the columns to be melted.
dt <- data.table(
measure = c("unem", "GDP"),
RI = c(0.05, 100),
MA = c(0.04, 102),
CT = c(0.06, 101)
)
melt(dt, measure.vars = c("RI", "MA", "CT"))
measure variable value
<char> <fctr> <num>
1: unem RI 0.05
2: GDP RI 100.00
3: unem MA 0.04
4: GDP MA 102.00
5: unem CT 0.06
6: GDP CT 101.00
Generally there are more of those, so it’s easier to list the “id.vars=”.
melt
You can overwrite the default “variable” and “value” column names,
dcast
Has a bit of an odd syntax, it takes the data.table
and a “formula”.
The formula is of the form “LHS ~ RHS”,
tidyverse
or data.table
?tidyverse
or data.table
?In practice, I (and many people) use a mix of both.
I personally prefer data.tables
syntax for filtering, modifying columns, and summarizing.
But I find melt
and dcast
much clunkier than the pivot_
functions.
If you really want the speed of data.table
and the syntax of dplyr
…
dtplyr
…there is a special package that combines the two!
dtplyr
allows you to use all of the tidyverse
verbs:
mutate()
filter()
groupby()
But behind the scenes runs everything using data.table
.
You can use pipes with data.table
as well.
The underscore character _
is known as a “placeholder”,
and represents the object being piped.
A bit less useful because the data.table
syntax is already very short.
renv
snapshot()
packages to lockfilesdata.table
dt[i, j, by=]
merge()
melt()
and dcast()
Pick your own journey if you prefer data.table
or tidyverse
!
Though I will make you use both on this next assignment…