R Data Wrangling: data.table
Using packages in R for data science
Lecture Summary
- Talking more about
renv data.table
More renv
R Library
R by default has one library of packages for all projects.
R Library
You can see where this library lives by calling,
.libPaths(). . .

. . .
Notice there are actually two library paths returned!
. . .
“User” Library
- For any packages you install
“System” Library
- For default packages when base R is installed
R Library
Installing packages puts them into the User Library. Calling library() loads a package.
The Problems
There are two problems with R’s default package setup.
. . .
- All projects share the same library
- Updating a package for one project updates it for all
. . .
- It is not easy to record and recreate the exact packages used for a project
. . .
These are obviously related, which is why renv solves both.
renv Separates Project Libraries
renv::init() makes isolated libraries for each project.
renv Cache
In order to avoid duplicate copies and downloads, renv stores a cache of all the packages it installs behind the scenes.
. . .
This solves projects changing each other’s packages, but does not yet give us any way to track and restore packages.
renv lockfile
The lockfile keeps tracks of what packages are used in a project.
. . .
It does not default to everything in the Project Library.
- You may have some packages you use just for development that are not needed for all of your code to run.
. . .
- An example of this is the
languageserverpackage
renv Record Packages to the Lockfile
In order to record a new package to the lockfile, run
renv::snapshot(). . .
This causes renv to search all of your project’s R files, see what packages are being used, and save the current version to the lockfile.
renv Status
To see if your lockfile is up-to-date, you can run
renv::status(). . .
renv will also tell you if everything is up-to-date when you launch a new R terminal.
renv Restore
If you open a project with a renv lockfile and want to recreate the package environment, run
renv::restore(). . .
renv Summary

Image from renv’s documentation
data.table
data.table
Is another package focused on data science.
. . .
It’s goals:
- speed
. . .
- brevity
. . .
- no dependencies
. . .
In contrast to the tidyverse which is focused on readability.
Installing data.table
Again, we use renv to install data.table for any project we are using.
renv::install("data.table"). . .
Then we can load the package,
library(data.table)data.table objects
data.tables are a new object to replace data.frame.
. . .
Similar to tibbles, they have nicer default printing
data.frame(
x = rnorm(200),
y = rnorm(200)
) x y
1 0.32043156 0.255292484
2 -1.52156818 0.780172869
3 0.43904608 -2.289433190
4 -0.47216306 -0.934598397
5 1.28320089 0.546684013
6 0.57003809 -0.504787402
7 1.94341023 -0.856221565
8 1.15111036 -0.861376094
9 0.07349453 -1.714759807
10 0.90727820 -1.200123841
11 -0.02418606 0.274953010
12 0.01644661 -0.972020374
13 0.48830663 -1.153462981
14 0.74558022 0.180113591
15 0.89357158 0.595864478
16 1.00197169 -0.033486449
17 -0.74418794 1.287716070
18 0.31925527 0.167462080
19 -0.02374163 -1.446287271
20 0.52041651 -0.642466157
21 -0.75527718 -1.145122196
22 1.09576519 0.133047830
23 0.67527062 -0.074318948
24 -0.14879240 0.603668696
25 0.25864742 0.169725595
26 0.51900653 1.401554569
27 -1.44287935 0.270066647
28 0.58709278 -0.715203191
29 0.13398771 -1.120194780
30 0.19247806 0.134025204
31 -0.41755746 0.519240400
32 3.10831047 -1.001759194
33 0.20698485 -0.617488666
34 -1.31289165 1.270845033
35 -1.93444660 -0.444588449
36 0.17445051 -0.719792120
37 1.13481275 -0.268528563
38 0.72639175 -1.867091867
39 1.53655654 0.113343080
40 1.61039413 0.208378008
41 0.06639953 1.356375890
42 1.03420331 0.807229218
43 1.18910007 0.974370871
44 1.63308257 0.545667561
45 -0.25922699 1.505171424
46 -3.02996930 0.679779418
47 -0.99946826 -0.968191106
48 -1.46360515 0.085206322
49 -0.49978479 -0.442087739
50 0.95273736 -0.181899439
51 -0.30697742 0.489584687
52 -0.78924959 -0.187788104
53 0.69805805 0.765143520
54 -1.15901427 -0.565375828
55 -0.31897956 -0.740002592
56 0.22418744 1.224100696
57 -0.55604853 1.693238194
58 -0.04937874 -0.544029110
59 0.36596818 0.744066799
60 -0.91637705 0.408629168
61 -1.48921584 0.605874043
62 -1.11475813 0.316318864
63 0.81703789 0.646829669
64 1.32409425 0.729301939
65 -1.35527071 -0.024242493
66 -0.94590106 -0.345005007
67 0.49221902 -0.514970101
68 -0.79174755 -0.425284602
69 -1.36381233 0.674711342
70 -0.20665742 0.077557183
71 1.22243972 -0.296124744
72 1.09508760 -0.093748790
73 0.38238055 1.781639630
74 -0.22285146 0.992609479
75 -1.35818369 -0.024005653
76 -1.18089790 0.881957340
77 2.08851670 0.099149260
78 -0.41609538 0.010346044
79 -0.61208034 1.684050440
80 0.15706091 -0.937919698
81 0.74656601 0.719188824
82 0.28116569 0.386819700
83 0.05015600 -0.604327634
84 -2.22987390 -0.227338114
85 0.73350017 -1.384444667
86 -2.08337171 0.833925136
87 0.85599783 1.967301148
88 0.10921493 -1.295980382
89 0.10757758 1.117635835
90 -0.65842760 0.559917885
91 1.09782937 0.037720302
92 0.62406038 0.088822499
93 0.90593044 -0.139525437
94 1.83003542 -0.624084017
95 0.32911513 -0.576363062
96 -0.24675501 -1.745966392
97 -0.23612645 0.413709539
98 1.13515423 -0.755389959
99 -0.42067878 1.564925091
100 0.61928263 1.498008224
101 -1.46530885 1.495112804
102 0.08179590 -1.172137837
103 -0.43079213 0.729141824
104 -2.70732860 1.909422809
105 -0.07566363 0.428489566
106 -1.04536505 -0.334600774
107 -0.30211222 0.254321454
108 0.72270657 2.187983453
109 0.61865421 0.016845715
110 -0.61857051 -0.544445134
111 1.34592201 0.880067581
112 0.28685358 -1.823377896
113 -1.67583874 0.098851798
114 0.17749971 -1.751483507
115 -1.00863473 0.814252284
116 0.88887653 1.328688073
117 0.26386249 1.851143915
118 0.51129711 1.006128458
119 -2.50349066 -1.922437032
120 -0.48358079 0.003556990
121 3.09321135 -1.622522483
122 -1.34191653 0.824196461
123 0.58832407 -1.601493670
124 -0.19528333 -0.344344054
125 -1.39163330 1.526550889
126 -1.23018851 -0.186076691
127 -2.12083974 1.183320508
128 0.67860159 -1.290013969
129 -0.38560187 -0.644706402
130 -1.57409618 0.861269237
131 1.15679985 1.396595495
132 1.53500346 0.961147898
133 2.20402969 2.006467175
134 0.14174903 -0.409713811
135 0.90564795 -0.064135908
136 0.03977575 0.383090530
137 -0.09084079 1.678219779
138 0.87318458 -0.017240497
139 -0.80322980 0.631212252
140 0.39828993 -0.723220261
141 0.16136968 -0.781766695
142 0.64955312 0.458474796
143 0.41745918 -1.872256958
144 1.75946479 -1.509325067
145 -0.58695446 -1.119325704
146 -0.44877868 0.312101857
147 -0.15159862 0.674558291
148 1.77932594 -0.242251820
149 0.94139394 -0.931496771
150 -1.20808923 0.230492068
151 -1.03437094 1.802138894
152 0.22999593 -1.698665398
153 -0.40819777 -0.420539945
154 -0.18608621 -0.965874452
155 0.03660006 -0.528939636
156 -0.22005470 -1.127982790
157 -0.01765128 -0.102772000
158 -0.61780392 0.390727671
159 -0.64951037 1.183026732
160 0.54238296 1.295157955
161 0.79281110 -2.007003896
162 0.08755999 0.957376677
163 -1.43468894 -0.393436570
164 -1.14073345 -0.400682489
165 -0.66871910 -1.012278734
166 0.89460639 -0.413898606
167 -1.16682202 -0.877006514
168 0.75553614 1.903142444
169 1.33910779 1.508373725
170 -1.40488824 0.041895511
171 -0.76628668 1.016769504
172 1.20708480 0.103263702
173 -0.77278257 -0.413291878
174 0.86003059 -0.015786051
175 0.43449037 1.614497312
176 -0.77228654 0.567409911
177 0.71516670 -0.649307206
178 0.22850834 -0.587293187
179 -0.33895897 -1.691036146
180 0.02866362 -1.287643257
181 -0.84981182 1.552754555
182 -2.19818586 1.787115442
183 0.84234870 -0.779222442
184 -0.78278462 0.123003064
185 1.22907822 -1.023748030
186 -0.78079368 0.224536438
187 -1.44322337 0.187418961
188 1.40568901 -0.010079845
189 0.24230999 -0.006904995
190 1.46567854 0.026041782
191 -0.14865742 0.297176559
192 0.84164598 0.957362048
193 0.97925553 1.544227491
194 2.50872712 -0.951613682
195 -0.00588793 -0.306146125
196 0.73207565 -1.510935990
197 0.45234444 0.155306863
198 -0.56784000 2.060873615
199 0.74717576 -0.435403313
200 -1.53414795 0.404397048
data.table(
x = rnorm(200),
y = rnorm(200)
) x y
<num> <num>
1: -1.0112152 -0.50032928
2: -1.1969889 0.32248802
3: 0.4072197 1.03141524
4: -0.3486641 0.27314338
5: 1.0354548 0.63720437
---
196: 0.5688225 0.92180051
197: 0.2324362 -2.07045111
198: 0.6067250 1.23063655
199: 0.6402398 0.05148768
200: 2.2433042 -0.14723650
data.table square brackets
Remember, from base R, than we can subset vectors, matrices, and data.frames using [i,j].
. . .
This is true for data.table as well.
x <- data.table(a = rnorm(10), b = rnorm(10))
x[1,2] b
<num>
1: 1.356775
data.table square brackets
data.table takes square brackets a huge step further, by allowing
. . .
- filtering in the ‘i’ place
. . .
- new variable declaration in the ‘j’ place
. . .
We will look at each of these in turn.
data.table filtering
x <- data.table(a = c(-0.2, 0.1, 0.25, 0, -0.3), b = c(0.5, 0, -0.2, 0.1, 0.3))
x a b
<num> <num>
1: -0.20 0.5
2: 0.10 0.0
3: 0.25 -0.2
4: 0.00 0.1
5: -0.30 0.3
What if we want all the values where “a” is less than 0?
. . .
x[a < 0, ] a b
<num> <num>
1: -0.2 0.5
2: -0.3 0.3
data.table filtering
x a b
<num> <num>
1: -0.20 0.5
2: 0.10 0.0
3: 0.25 -0.2
4: 0.00 0.1
5: -0.30 0.3
Instead we could do all the values where “a” is less than “b”.
x[a < b, ] a b
<num> <num>
1: -0.2 0.5
2: 0.0 0.1
3: -0.3 0.3
data.table filtering
x a b
<num> <num>
1: -0.20 0.5
2: 0.10 0.0
3: 0.25 -0.2
4: 0.00 0.1
5: -0.30 0.3
In fact, we can drop the comma, as data.table will assume that only one item is always the filtering argument.
x[a < b] a b
<num> <num>
1: -0.2 0.5
2: 0.0 0.1
3: -0.3 0.3
tidyverse filtering equivalent
Here is a comparison of the data.table and tidyverse syntaxes side-by-side.
x[a < 0] a b
<num> <num>
1: -0.2 0.5
2: -0.3 0.3
Attaching package: 'dplyr'
The following objects are masked from 'package:data.table':
between, first, last
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
x |>
filter(a < 0)# A tibble: 2 × 2
a b
<dbl> <dbl>
1 -0.2 0.5
2 -0.3 0.3
. . .
Pretty similar, but data.table is shorter.
data.table column subsetting
Remember: data.frame subsetting uses “i” for rows, and “j” for columns. data.table follows this.
. . .
dt <- data.table(a = rnorm(5), b = rnorm(5), c = rnorm(5))
dt[, a][1] -0.4476028 0.4165121 -1.3462013 0.7440658 -0.6615837
. . .
For two columns, pass a list:
dt[, list(a, b)] a b
<num> <num>
1: -0.4476028 -0.04112529
2: 0.4165121 -0.72898296
3: -1.3462013 0.42793992
4: 0.7440658 -0.32325760
5: -0.6615837 -0.87804230
Or .(a,b) as equivalent:
dt[, .(a, b)] a b
<num> <num>
1: -0.4476028 -0.04112529
2: 0.4165121 -0.72898296
3: -1.3462013 0.42793992
4: 0.7440658 -0.32325760
5: -0.6615837 -0.87804230
data.table variable declaration
What if we want to add a new column?
. . .
data.table does this using the column subsetting place, “j”.
. . .
dt <- data.table(a = rnorm(5), b = rnorm(5))
dt[, newcol := a + b]
dt a b newcol
<num> <num> <num>
1: 0.1762745 1.05950844 1.2357829
2: -0.2409535 0.03222178 -0.2087317
3: 0.5940959 0.96573903 1.5598350
4: 0.2615928 -0.64545535 -0.3838626
5: -0.1991795 0.58154695 0.3823674
. . .
:= is a data.table function representing “defined”. i.e. newcol defined as ‘a’ plus ‘b’.
Also called the “walrus” operator by some people.
data.table variable declaration
Would this work?
dt <- data.table(a = rnorm(5), b = rnorm(5))
dt[newcol := a + b]. . .
No.
data.table will assume we are using the “i” place for rows.
dt[newcol := a + b]Error in `value[[3L]]()`:
! Operator := detected in i, the first argument inside DT[...], but is only valid in the second argument, j. Most often, this happens when forgetting the first comma (e.g. DT[newvar := 5] instead of DT[ , new_var := 5]). Please double-check the syntax. Run traceback(), and debugger() to get a line number.
data.table variable editing
We can use the exact same syntax to edit current columns.
dt <- data.table(a = rnorm(5), b = rnorm(5))
dt[, a := 1:5]
dt a b
<int> <num>
1: 1 -1.5621630
2: 2 -0.4067241
3: 3 -0.3234394
4: 4 0.4222994
5: 5 1.1477452
data.table variable editing
Or to remove a column,
dt[, b := NULL]
dt a
<int>
1: 1
2: 2
3: 3
4: 4
5: 5
data.table conditional editing
What if we want to remove values of “b” that are less than 0?
dt <- data.table(a = -2:2, b = -2:2). . .
We could conditionally declare values of “b” to be NA.
dt[b < 0, b := NA]
dt a b
<int> <int>
1: -2 NA
2: -1 NA
3: 0 0
4: 1 1
5: 2 2
. . .
In tidyverse, the equivalent is somewhat wordy:
dt |> mutate(b = if_else(b < 0, NA, b))Summarizing with data.table
Remember in tidyverse we could write:
mtcars |> summarize(avg = mean(mpg)) avg
1 20.09062
. . .
The equivalent in data.table is to pass a list .() to the “j” place:
dt <- as.data.table(mtcars)
dt[, .(avg = mean(mpg))] avg
<num>
1: 20.09062
. . .
Note that I had to make “mtcars” a data.table first.
Summarizing with data.table
Instead, I could simply make the mean a new column.
dt[, avg := mean(mpg)]
dt mpg cyl disp hp drat wt qsec vs am gear carb avg
<num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 20.09062
2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 20.09062
3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 20.09062
4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 20.09062
5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 20.09062
6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 20.09062
7: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 20.09062
8: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 20.09062
9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 20.09062
10: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 20.09062
11: 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 20.09062
12: 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 20.09062
13: 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 20.09062
14: 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 20.09062
15: 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 20.09062
16: 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 20.09062
17: 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 20.09062
18: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 20.09062
19: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 20.09062
20: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 20.09062
21: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 20.09062
22: 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 20.09062
23: 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 20.09062
24: 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 20.09062
25: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 20.09062
26: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 20.09062
27: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 20.09062
28: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 20.09062
29: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 20.09062
30: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 20.09062
31: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 20.09062
32: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 20.09062
mpg cyl disp hp drat wt qsec vs am gear carb avg
<num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
data.table will spread any single value to the length of the table.
Multiple summaries
We can summarize multiple values or columns by simply adding arguments to .()
dt[, .(avg = mean(mpg), med = median(mpg), n = .N)] avg med n
<num> <num> <int>
1: 20.09062 19.2 32
. . .
How do we do this by group?
. . .
We could use the filtering spot, “i”, and append each result
dt[cyl == 6, .(avg = mean(mpg), med = median(mpg), n = .N)] avg med n
<num> <num> <int>
1: 19.74286 19.7 7
. . .
or…
Summarizing by group
We can use the “hidden” third square bracket argument: by=
dt[, .(avg = mean(mpg), med = median(mpg), n = .N), by = cyl] cyl avg med n
<num> <num> <num> <int>
1: 6 19.74286 19.7 7
2: 4 26.66364 26.0 11
3: 8 15.10000 15.2 14
. . .
We can pass multiple columns to by using the short list: .()
dt[, .(avg = mean(mpg), med = median(mpg), n = .N), by = .(cyl, gear)] cyl gear avg med n
<num> <num> <num> <num> <int>
1: 6 4 19.750 20.10 4
2: 4 4 26.925 25.85 8
3: 6 3 19.750 19.75 2
4: 8 3 15.050 15.20 12
5: 4 3 21.500 21.50 1
6: 4 5 28.200 28.20 2
7: 8 5 15.400 15.40 2
8: 6 5 19.700 19.70 1
data.table argument summary
dt[
i ,
j ,
by = ]
. . .
filtering rows
filtering columns, new columns, summarize
group by columns
. . .
tidyverse equivalent verbs
filter()
select(), mutate(), summarize()
groub_by()
Comparison
mtcars |>
filter(hp > 100) |>
group_by(cyl, gear) |>
summarize(m = mean(mpg))`summarise()` has grouped output by 'cyl'. You can override using the `.groups`
argument.
# A tibble: 7 × 3
# Groups: cyl [3]
cyl gear m
<dbl> <dbl> <dbl>
1 4 4 21.4
2 4 5 30.4
3 6 3 19.8
4 6 4 19.8
5 6 5 19.7
6 8 3 15.0
7 8 5 15.4
dt[hp > 100, .(m = mean(mpg)), by = .(cyl, gear)] cyl gear m
<num> <num> <num>
1: 6 4 19.75
2: 6 3 19.75
3: 8 3 15.05
4: 4 5 30.40
5: 8 5 15.40
6: 6 5 19.70
7: 4 4 21.40
. . .
data.table syntax is more concise, but harder to read.
Which should I use?
The main answer:
- personal preference (which syntax do you prefer?)
. . .
Second answer:
data.tableis much faster
. . .
- especially for large datasets
Speed comparison
A simulation exercise by mdharris makes this clear.

Why is data.table faster?
- optimized for speed
. . .
- Modifies “by reference”
. . .
tidyverse functions instead “copy-on-modify”.
Modify by reference vs. copy
Copy-on-modify
- Creates a copy of the data
- Modifies the copy
- Returns the copy
. . .
Modify-in-place (modify “by reference”)
- References the data
- Modifies the original data
. . .
Avoids duplicating data, which is costly for your computer’s memory, especially with large datasets.
Modify by reference vs. copy
You may have noticed that I had to add an extra call to print data.tables after modifying them.
. . .
This is because the line dt[, y := c("a", "b")] doesn’t return anything.
. . .
Because it edited the data in place, not by creating a copy.
Dangers of editing by reference
Modifying by reference can be dangerous if you are not careful.
. . .
dt <- data.table(a = 1:2, b = 3:4)
dt2 <- dt
dt2 a b
<int> <int>
1: 1 3
2: 2 4
. . .
What happens to “dt” if we remove a column from “dt2”?
. . .
dt2[, b := NULL]
dt a
<int>
1: 1
2: 2
. . .
“dt” and “dt2” reference the same underlying data.
Dangers of editing by reference
If you truly want a new copy of a data.table, use
dt2 <- copy(dt)Even Faster!
Keys
Keys provide some organization to your data.
. . .
This makes it much faster to later:
filter
group by
merge
. . .
data.table lets you add keys in a variety of ways.
data.table(x = c("a", "a", "b"), y = 1:3, key = "x")
dt <- as.data.table(df, key = "x")
setkey(dt, "x"). . .
You can also pass a vector of keys c("x", "y") to use.
What are keys?
Keys provide an index for the data.
. . .
Think of this like a phone book.
. . .
If you wanted to look my phone number up in the phone book,
you would search under “D” for DeHaven
you would not search the entire phone book!
. . .
When we tell data.table to use some columns as a key, it presorts and indexes the data on those columns.
Merging with data.table
For data.table all types of merges are handled by one function:
dt <- data.table(x = c("a", "b", "c"), y = 1:3)
dt2 <- data.table(x = c("a", "c", "d"), z = 4:6)
merge(dt, dt2, by = "x")Key: <x>
x y z
<char> <int> <int>
1: a 1 4
2: c 3 5
. . .
By default, merge() does an “inner” join (matches in both datasets).
Merging with data.table
A “left” join,
merge(dt, dt2, by = "x", all.x = TRUE)Key: <x>
x y z
<char> <int> <int>
1: a 1 4
2: b 2 NA
3: c 3 5
A “right” join,
merge(dt, dt2, by = "x", all.y = TRUE)Key: <x>
x y z
<char> <int> <int>
1: a 1 4
2: c 3 5
3: d NA 6
Or an “outer” join
merge(dt, dt2, by = "x", all = TRUE)Key: <x>
x y z
<char> <int> <int>
1: a 1 4
2: b 2 NA
3: c 3 5
4: d NA 6
Non-matching column names
We can also be explicit about which columns to use from “x” and from “y”,
dt <- data.table(x = c("a", "b", "c"), y = 1:3)
dt2 <- data.table(XX = c("a", "c", "d"), z = 4:6)
merge(dt, dt2, by.x = "x", by.y = "XX")Key: <x>
x y z
<char> <int> <int>
1: a 1 4
2: c 3 5
. . .
But it’s probably better to clean up your column names before hand.
data.table pivots
A package tidyfast which offers
dt_pivot_longer()dt_pivot_wider()
to match the syntax of tidyr but with the speed of data.table.
tidyverse or data.table?
tidyverse or data.table?
In practice, I (and many people) use a mix of both.
. . .
I personally prefer data.tables syntax for filtering, modifying columns, and summarizing.
. . .
If you really want the speed of data.table and the syntax of dplyr…
dtplyr
…there is a special package that combines the two!
dtplyr allows you to use all of the tidyverse verbs:
mutate()filter()groupby()
. . .
But behind the scenes runs everything using data.table.
A note on pipes
You can use pipes with data.table as well.
. . .
dt <- data.table(x = 1:2)
dt[, y := 3:4]
dt[, z := 5:6]
dt x y z
<int> <int> <int>
1: 1 3 5
2: 2 4 6
dt <- data.table(x = 1:2)
dt |>
_[, y := 3:4] |>
_[, z := 5:6]
dt x y z
<int> <int> <int>
1: 1 3 5
2: 2 4 6
. . .
The underscore character _ is known as a “placeholder”,
and represents the object being piped.
. . .
A bit less useful because the data.table syntax is already very short.
Summary
renv- Separate libraries
snapshot()packages to lockfiles
- Introduced to
data.table- Using square bracket positions
dt[i, j, by=] - Fast and concise
- Keys
merge()melt()anddcast()
- Using square bracket positions
Live Coding Example
- Loading
data.table
Coding Exercise
- Install and load
data.tablein a temporary workspace - Create a
data.tablewith columnsaandbof length 10 with random values - Add new column
cwhich is the sum ofaandb - Add a new column
dwhich is the product ofaandbwhenais greater than 0