2024-01-01
renv
data.table
renv
R by default has one library of packages for all projects.
You can see where this library lives by calling,
Notice there are actually two library paths returned!
“User” Library
“System” Library
Installing packages puts them into the User Library. Calling library()
loads a package.
There are two problems with R’s default package setup.
These are obviously related, which is why renv
solves both.
renv
Separates Project Librariesrenv::init()
makes isolated libraries for each project.
renv
CacheIn order to avoid duplicate copies and downloads, renv
stores a cache of all the packages it installs behind the scenes.
This solves projects changing each other’s packages, but does not yet give us any way to track and restore packages.
renv
lockfileThe lockfile keeps tracks of what packages are used in a project.
It does not default to everything in the Project Library.
languageserver
packagerenv
Record Packages to the LockfileIn order to record a new package to the lockfile, run
This causes renv
to search all of your project’s R files, see what packages are being used, and save the current version to the lockfile.
renv
StatusTo see if your lockfile is up-to-date, you can run
renv
will also tell you if everything is up-to-date when you launch a new R terminal.
renv
RestoreIf you open a project with a renv
lockfile and want to recreate the package environment, run
You will see this in the next assignment.
renv
Summarydata.table
data.table
Is another package focused on data science.
It’s goals:
In contrast to the tidyverse
which is focused on readability.
data.table
Again, we use renv
to install data.table for any project we are using.
data.table
objectsdata.table
s are a new object to replace data.frame
.
Similar to tibble
s, they have nicer default printing
x y
1 0.2176180759 0.845299920
2 0.8315502429 -1.912088839
3 0.4336599381 1.249713280
4 -0.6064116759 -0.492438762
5 0.1047437378 0.692491473
6 0.3954585758 0.363160549
7 -0.6322579392 0.461267335
8 2.0097489559 -1.794764289
9 0.1238147480 -0.897051101
10 0.4085475701 -0.596935133
11 0.3041320144 -0.961204785
12 1.6307634028 0.122357807
13 -1.4995724509 0.494226730
14 -0.8058411664 0.834436270
15 -0.4607678624 1.065866434
16 0.3728639124 0.898016393
17 0.2873486773 -1.232859638
18 1.5832830672 -0.835078286
19 0.7696143682 -0.714059577
20 1.3027338917 0.754586810
21 -0.5517187078 -1.295293331
22 -0.7377118245 0.311617171
23 -0.0502708149 -1.233521659
24 -0.6336227276 -1.340867827
25 1.1508161177 -0.119841729
26 0.1282261261 -0.006074305
27 -1.3645556345 1.045876379
28 -0.7225296817 -0.609757552
29 -0.0710353360 1.021934442
30 0.4490436085 -0.852037359
31 -0.3453445626 0.392332497
32 -0.2237668796 -1.554617727
33 0.2287305416 0.747754548
34 -0.1431252640 -0.715315296
35 0.8608447455 0.127446580
36 -0.0407475106 1.245841770
37 0.5347260810 0.604684439
38 0.8707140324 1.103927155
39 0.0665660930 -0.473655327
40 -0.8437202918 -0.296073582
41 -0.5167530696 -0.200588433
42 0.1830690735 0.589045879
43 -0.1814497300 1.236846776
44 1.0407452099 -0.369801577
45 -1.1621759916 -0.699248569
46 -0.0706461030 -0.169438587
47 0.4881736124 -1.487003804
48 -0.2875757311 -0.980333901
49 -0.6151910329 -1.289897742
50 0.1479521868 0.131813032
51 0.9285833469 0.154239001
52 -1.4326248346 -0.844985139
53 0.5484043089 -0.990673508
54 0.2310564403 -0.816596765
55 -0.6424217470 1.996500471
56 -0.3262261102 0.306810375
57 0.9434319958 2.521753953
58 0.5484243505 -0.123019321
59 0.4295029437 -1.041834460
60 -0.4339641311 -0.906090005
61 -0.3022943673 -0.720878989
62 1.5578969895 -0.350037612
63 -0.3240115684 0.562773753
64 -0.8786397075 0.284150591
65 1.5131266848 -0.538517829
66 1.4506597254 1.040535004
67 0.1666648344 0.130587574
68 -2.0610876556 -0.395751313
69 -0.3894989597 -0.790612269
70 -0.2244559728 0.616398583
71 0.8438949857 1.393255004
72 1.3669475878 0.365977405
73 0.3479722869 0.816954126
74 0.6983610707 -1.479526273
75 -1.5830269844 -0.633090571
76 -0.0031007704 0.232942847
77 -1.5341464783 -0.021729267
78 0.6597705557 1.115202442
79 -1.3384155751 0.707655326
80 0.4796858415 0.557570314
81 0.7251979941 0.271521812
82 1.4251148491 0.738065918
83 0.9406016120 0.148378969
84 -0.3575855515 0.641216967
85 -0.6777377542 1.852589696
86 -1.3610219367 1.420314426
87 0.0539814614 0.338860244
88 -0.5988386125 -0.084309737
89 1.5994385376 0.351866923
90 -2.4144181291 -0.728598194
91 0.5595766221 -2.389937515
92 0.0972633356 0.504934616
93 0.8174738717 -0.188326693
94 -0.0194549340 0.016072918
95 -0.2639540897 -0.071135151
96 0.7203733441 1.239665018
97 1.6143563937 0.197653511
98 0.8921546644 -0.776755726
99 0.5003449229 -0.420460639
100 -0.7035744803 0.385971926
101 -0.6476566830 0.174015284
102 -0.4324603826 0.553896856
103 0.3151218822 0.529222141
104 1.5386630204 -0.670138704
105 0.3775913719 0.312006778
106 -1.1759612228 0.613993723
107 0.7506937497 -0.109901740
108 -1.3398935087 0.343893655
109 -0.3849629188 -1.600246580
110 -0.4634256708 -0.502375041
111 2.3872909513 0.399459501
112 -2.1184162857 0.937245325
113 1.1100140935 -0.766755258
114 -1.5148303806 -0.218489097
115 1.2057237133 -0.292945891
116 1.4373222907 -0.800390299
117 -1.0837885668 1.120067314
118 0.8953420196 -1.078227278
119 -0.3597390619 -0.349590498
120 1.6101408730 0.477827312
121 -0.3571059713 -1.219136523
122 0.2103909051 1.175544623
123 0.7434510791 -0.687909615
124 0.7884631157 -0.277844921
125 0.8051439795 1.671374901
126 -1.0227864563 0.998307121
127 1.3255777717 0.786876129
128 -0.4262748214 0.788845679
129 -1.3381830514 -0.025603406
130 -0.2886872702 -1.778837381
131 0.0273585255 0.031655419
132 0.0738927619 0.522333715
133 1.4057382752 -1.777153028
134 0.3559898698 -0.327881723
135 -0.1970144183 1.231251409
136 0.8782132330 1.444496074
137 -2.5476176862 -0.366210316
138 1.4173314139 0.781311418
139 0.4800506036 0.032877313
140 -0.3520628645 0.089089469
141 0.4517307219 1.350319879
142 -1.0068895688 -1.295883083
143 0.8896057310 1.251802086
144 -0.0237487748 1.245044351
145 -0.2346514033 -1.129440506
146 0.5333118573 2.037994871
147 -1.0192772118 0.464508929
148 0.3487913416 -0.216491297
149 -0.4091664460 -1.455393331
150 -0.2366783042 0.558416532
151 0.5953597924 1.937447736
152 0.8193235631 -0.373634136
153 0.0737412483 -0.001285201
154 0.3153821430 1.118826940
155 0.4229540634 1.710937613
156 1.8123798460 0.281283542
157 -0.3028493645 1.292680564
158 0.2810965731 -0.056968763
159 -1.5322632907 -0.025268407
160 -0.6603110360 -1.043771853
161 0.0181564449 -0.450405893
162 0.0060277508 1.301952383
163 -0.8765741073 -1.546954587
164 0.3374973999 2.163462794
165 1.5868941296 -0.215335337
166 -0.9869984369 -0.041920017
167 -1.3139744370 -1.274100225
168 -1.1966182645 -1.125421692
169 -1.3411635694 1.093781900
170 -2.0317296438 -0.835090228
171 -1.2795675426 -0.020887600
172 -0.1587601353 -0.410541121
173 0.8078697894 1.013471640
174 0.6711548706 -1.121446590
175 -0.6752803493 -1.611938891
176 -0.1133926579 0.817988591
177 -0.7549755668 -1.645389354
178 -0.4449720950 -0.148669054
179 -0.1603629544 0.303937067
180 -0.4825257439 0.586513872
181 1.7719553903 0.190627776
182 0.2748432332 -1.716396322
183 -0.0589740749 1.190722022
184 -1.8216225913 0.149191449
185 -0.1231964074 -0.519365064
186 -0.0391327213 0.793177325
187 -0.7987688907 -1.411250361
188 -2.3390080938 -0.070518619
189 -0.3773146067 -0.049133419
190 -0.9954655132 0.263681756
191 -0.0002018366 -1.157001243
192 -1.0395385565 0.238082780
193 -0.5684007849 -0.561563040
194 -0.9005807751 0.111836140
195 0.5015315683 1.121351091
196 -1.4971604623 0.347850056
197 0.8480539903 0.646499501
198 -0.0707157342 1.071113214
199 0.3868853709 2.175954241
200 0.5887495941 2.570317001
x y
<num> <num>
1: -1.00897359 -0.4087155
2: 2.55870500 -0.2288996
3: 1.18031640 -2.4433400
4: -1.76549680 -1.1524490
5: 2.10837263 1.0708214
---
196: -0.82412022 0.6333711
197: -1.76409078 1.0891432
198: 0.58533120 0.4220185
199: -0.04761425 0.4647622
200: -0.80502935 0.4626533
data.table
square bracketsRemember, from base R, than we can subset vectors, matrices, and data.frames using [i,j]
.
data.table
square bracketsdata.table
takes square brackets a huge step further, by allowing
We will look at each of these in turn.
data.table
filtering a b
<num> <num>
1: -0.319324871 -0.2243540
2: -1.003274392 0.3469449
3: -1.064288996 -0.4809599
4: 0.005162979 0.1318561
5: -1.822072621 0.3770973
6: -0.750549664 -0.2684060
7: 1.282488595 -0.9357365
8: 0.791203323 0.1288643
9: 0.507654181 1.6424895
10: -2.061684596 0.7237209
What if we want all the values where “a” is less than 0?
data.table
filtering a b
<num> <num>
1: -0.319324871 -0.2243540
2: -1.003274392 0.3469449
3: -1.064288996 -0.4809599
4: 0.005162979 0.1318561
5: -1.822072621 0.3770973
6: -0.750549664 -0.2684060
7: 1.282488595 -0.9357365
8: 0.791203323 0.1288643
9: 0.507654181 1.6424895
10: -2.061684596 0.7237209
Instead we could do all the values where “a” is less than “b”.
data.table
filtering a b
<num> <num>
1: -0.319324871 -0.2243540
2: -1.003274392 0.3469449
3: -1.064288996 -0.4809599
4: 0.005162979 0.1318561
5: -1.822072621 0.3770973
6: -0.750549664 -0.2684060
7: 1.282488595 -0.9357365
8: 0.791203323 0.1288643
9: 0.507654181 1.6424895
10: -2.061684596 0.7237209
In fact, we can drop the comma, as data.table
will assume that only one item is always the filtering argument.
tidyverse
filtering equivalentHere is a comparison of the data.table
and tidyverse
syntaxes side-by-side.
Pretty similar, but data.table
is shorter.
data.table
column subsettingRemember that normal table subsetting uses “i” for rows, and “j” for columns.
data.table
follows this.
If you want two columns, you have to pass a list.
data.table
variable declarationWhat if we want to add a new column?
data.table
does this using the column subsetting place, “j”.
:=
is a data.table
function representing “defined”.
i.e. newcol defined as ‘a’ plus ‘b’
Also called the “walrus” operator by some people.
data.table
variable declarationWould this work?
No.
data.table
will assume we are using the “i” place for rows.
Error: Operator := detected in i, the first argument inside DT[...], but is only valid in the second argument, j. Most often, this happens when forgetting the first comma (e.g. DT[newvar := 5] instead of DT[ , new_var := 5]). Please double-check the syntax. Run traceback(), and debugger() to get a line number.
data.table
variable editingWe can use the exact same syntax to edit current columns.
a b
<int> <num>
1: 1 0.7227419
2: 2 0.5990643
3: 3 0.2738779
4: 4 -0.0761493
5: 5 -2.0419961
data.table
conditional editingWhat if we want to remove values of “b” that are less than 0?
We could conditionally declare values of “b” to be NA
.
data.table
Remember in tidyverse
we could write:
The equivalent in data.table
is to pass a list .()
to the “j” place:
Note that I had to make “mtcars” a data.table
first.
data.table
Instead, I could simply make the mean a new column.
mpg cyl disp hp drat wt qsec vs am gear carb avg
<num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 20.09062
2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 20.09062
3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 20.09062
4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 20.09062
5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 20.09062
6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 20.09062
7: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 20.09062
8: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 20.09062
9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 20.09062
10: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 20.09062
11: 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 20.09062
12: 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 20.09062
13: 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 20.09062
14: 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 20.09062
15: 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 20.09062
16: 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 20.09062
17: 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 20.09062
18: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 20.09062
19: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 20.09062
20: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 20.09062
21: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 20.09062
22: 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 20.09062
23: 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 20.09062
24: 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 20.09062
25: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 20.09062
26: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 20.09062
27: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 20.09062
28: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 20.09062
29: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 20.09062
30: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 20.09062
31: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 20.09062
32: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 20.09062
mpg cyl disp hp drat wt qsec vs am gear carb avg
data.table
will spread any single value to the length of the table.
We can summarize multiple values or columns by simply adding arguments to .()
How do we do this by group?
We could use the filtering spot, “i”, and append each result
or…
We can use the “hidden” third square bracket argument: “by=”.
cyl avg med n
<num> <num> <num> <int>
1: 6 19.74286 19.7 7
2: 4 26.66364 26.0 11
3: 8 15.10000 15.2 14
We can pass multiple columns to by using the short list: .()
data.table
argument summarydt[
i ,
j ,
by = ]
filtering rows
filtering columns, new columns, summarize
group by columns
tidyverse
equivalent verbs
filter()
select()
, mutate()
, summarize()
groub_by()
# A tibble: 7 × 3
# Groups: cyl [3]
cyl gear avg_mgp
<dbl> <dbl> <dbl>
1 4 4 21.4
2 4 5 30.4
3 6 3 19.8
4 6 4 19.8
5 6 5 19.7
6 8 3 15.0
7 8 5 15.4
cyl gear avg_mpg
<num> <num> <num>
1: 6 4 19.75
2: 6 3 19.75
3: 8 3 15.05
4: 4 5 30.40
5: 8 5 15.40
6: 6 5 19.70
7: 4 4 21.40
data.table
syntax is more concise, but harder to read.
The main answer:
Second answer:
data.table
is much fasterA simulation exercise by mdharris makes this clear.
“site_count” is a measure of the simulation size (i.e. size of the data.frames). Then some costly, but not absurd calculations are performed on the data.
data.table
clearly scales much better with size.
data.table
faster?tidyverse
functions instead “copy-on-modify”.
Copy-on-modify
Modify-in-place (modify “by reference”)
Avoids duplicating data, which is costly for your computer’s memory, especially with large datasets.
You may have noticed that I had to add an extra call to print data.table
s after modifying them.
This is because the line dt[, y := c("a", "b")]
doesn’t return anything.
Because it edited the data in place, not by creating a copy.
Modifying by reference can be dangerous if you are not careful.
What happens to “dt” if we remove a column from “dt2”?
“dt” and “dt2” reference the same underlying data.
Keys provide some organization to your data.
This makes it much faster to later:
filter
group by
merge
data.table
lets you add keys in a variety of ways.
You can also pass a vector of keys c("x", "y")
to use.
Keys provide an index for the data.
Think of this like a phone book.
If you wanted to look my phone number up in the phone book,
you would search under “D” for DeHaven
you would not search the entire phone book!
When we tell data.table
to use some columns as a key, it presorts and indexes the data on those columns.
data.table
For data.table
all types of merges are handled by one function:
dt <- data.table(x = c("a", "b", "c"), y = 1:3)
dt2 <- data.table(x = c("a", "c", "d"), z = 4:6)
merge(dt, dt2, by = "x")
Key: <x>
x y z
<char> <int> <int>
1: a 1 4
2: c 3 5
By default, merge()
does an “inner” join (matches in both datasets).
data.table
Instead, we can force
A “left” join,
Or an “outer” join
We can also be explicit about which columns to use from “x” and from “y”,
dt <- data.table(x = c("a", "b", "c"), y = 1:3)
dt2 <- data.table(XX = c("a", "c", "d"), z = 4:6)
merge(dt, dt2, by.x = "x", by.y = "XX")
Key: <x>
x y z
<char> <int> <int>
1: a 1 4
2: c 3 5
But it’s probably better to clean up your column names before hand.
data.table
data.table
pivotsTwo main functions:
melt()
go from wide to long
dcast()
go from long to wide
We will see how to use each.
There is instead a package tidyfast
which offers a dt_pivot_longer()
and dt_pivot_wider()
to match the syntax of tidyr
but with the speed of data.table
.
melt
Use to go from wide to long data.
You can provide the “id.vars=” argument, i.e. the columns not to be melted.
dt <- data.table(
measure = c("unem", "GDP"),
RI = c(0.05, 100),
MA = c(0.04, 102),
CT = c(0.06, 101)
)
dt
measure RI MA CT
<char> <num> <num> <num>
1: unem 5e-02 0.04 0.06
2: GDP 1e+02 102.00 101.00
measure variable value
<char> <fctr> <num>
1: unem RI 0.05
2: GDP RI 100.00
3: unem MA 0.04
4: GDP MA 102.00
5: unem CT 0.06
6: GDP CT 101.00
melt
Instead, you can provide the “measure.vars=” argument, i.e. the columns to be melted.
dt <- data.table(
measure = c("unem", "GDP"),
RI = c(0.05, 100),
MA = c(0.04, 102),
CT = c(0.06, 101)
)
melt(dt, measure.vars = c("RI", "MA", "CT"))
measure variable value
<char> <fctr> <num>
1: unem RI 0.05
2: GDP RI 100.00
3: unem MA 0.04
4: GDP MA 102.00
5: unem CT 0.06
6: GDP CT 101.00
Generally there are more of those, so it’s easier to list the “id.vars=”.
melt
You can overwrite the default “variable” and “value” column names,
dcast
Has a bit of an odd syntax, it takes the data.table
and a “formula”.
The formula is of the form “LHS ~ RHS”,
tidyverse
or data.table
?tidyverse
or data.table
?In practice, I (and many people) use a mix of both.
I personally prefer data.tables
syntax for filtering, modifying columns, and summarizing.
But I find melt
and dcast
much clunkier than the pivot_
functions.
If you really want the speed of data.table
and the syntax of dplyr
…
dtplyr
…there is a special package that combines the two!
dtplyr
allows you to use all of the tidyverse
verbs:
mutate()
filter()
groupby()
But behind the scenes runs everything using data.table
.
You can use pipes with data.table
as well.
The underscore character _
is known as a “placeholder”,
and represents the object being piped.
A bit less useful because the data.table
syntax is already very short.
renv
snapshot()
packages to lockfilesdata.table
dt[i, j, by=]
merge()
melt()
and dcast()
Pick your own journey if you prefer data.table
or tidyverse
!
Though I will make you use both on this next assignment…