R Data Wrangling: data.table

Using packages in R for data science

Author

Matthew DeHaven

Published

February 11, 2026

Lecture Summary

  • Talking more about renv
  • data.table

More renv

R Library

R by default has one library of packages for all projects.

G A User Library B Project 1 A--B C Project 2 A--C D Project 3 A--D

R Library

You can see where this library lives by calling,

.libPaths()

. . .

. . .

Notice there are actually two library paths returned!

. . .

“User” Library

  • For any packages you install

“System” Library

  • For default packages when base R is installed

R Library

Installing packages puts them into the User Library. Calling library() loads a package.

G A User Library B1 Project 1 A--B1 library() B2 Project 1 A--B2 B3 Project 1 A--B3 library() C CRAN C--A install.packages()

The Problems

There are two problems with R’s default package setup.

. . .

  1. All projects share the same library
  • Updating a package for one project updates it for all

. . .

  1. It is not easy to record and recreate the exact packages used for a project

. . .

These are obviously related, which is why renv solves both.

renv Separates Project Libraries

renv::init() makes isolated libraries for each project.

G A1 Library 1 B Project 1 A1--B A2 Library 2 C Project 2 A2--C A3 Library 3 D Project 3 A3--D

renv Cache

In order to avoid duplicate copies and downloads, renv stores a cache of all the packages it installs behind the scenes.

G Z renv Cache A1 Library 1 Z--A1 A2 Library 2 Z--A2 A3 Library 3 Z--A3 B Project 1 A1--B C Project 2 A2--C D Project 3 A3--D

. . .

This solves projects changing each other’s packages, but does not yet give us any way to track and restore packages.

renv lockfile

The lockfile keeps tracks of what packages are used in a project.

. . .

It does not default to everything in the Project Library.

  • You may have some packages you use just for development that are not needed for all of your code to run.

. . .

  • An example of this is the languageserver package

renv Record Packages to the Lockfile

In order to record a new package to the lockfile, run

renv::snapshot()

. . .

This causes renv to search all of your project’s R files, see what packages are being used, and save the current version to the lockfile.

renv Status

To see if your lockfile is up-to-date, you can run

renv::status()

. . .

renv will also tell you if everything is up-to-date when you launch a new R terminal.

renv Restore

If you open a project with a renv lockfile and want to recreate the package environment, run

renv::restore()

. . .

You will see this in the next assignment.

renv Summary

Image from renv’s documentation

data.table

data.table

Is another package focused on data science.

. . .

It’s goals:

  • speed

. . .

  • brevity

. . .

  • no dependencies

. . .

In contrast to the tidyverse which is focused on readability.

Installing data.table

Again, we use renv to install data.table for any project we are using.

renv::install("data.table")

. . .

Then we can load the package,

library(data.table)

data.table objects

data.tables are a new object to replace data.frame.

. . .

Similar to tibbles, they have nicer default printing

data.frame(
  x = rnorm(200),
  y = rnorm(200)
)
              x            y
1    0.32043156  0.255292484
2   -1.52156818  0.780172869
3    0.43904608 -2.289433190
4   -0.47216306 -0.934598397
5    1.28320089  0.546684013
6    0.57003809 -0.504787402
7    1.94341023 -0.856221565
8    1.15111036 -0.861376094
9    0.07349453 -1.714759807
10   0.90727820 -1.200123841
11  -0.02418606  0.274953010
12   0.01644661 -0.972020374
13   0.48830663 -1.153462981
14   0.74558022  0.180113591
15   0.89357158  0.595864478
16   1.00197169 -0.033486449
17  -0.74418794  1.287716070
18   0.31925527  0.167462080
19  -0.02374163 -1.446287271
20   0.52041651 -0.642466157
21  -0.75527718 -1.145122196
22   1.09576519  0.133047830
23   0.67527062 -0.074318948
24  -0.14879240  0.603668696
25   0.25864742  0.169725595
26   0.51900653  1.401554569
27  -1.44287935  0.270066647
28   0.58709278 -0.715203191
29   0.13398771 -1.120194780
30   0.19247806  0.134025204
31  -0.41755746  0.519240400
32   3.10831047 -1.001759194
33   0.20698485 -0.617488666
34  -1.31289165  1.270845033
35  -1.93444660 -0.444588449
36   0.17445051 -0.719792120
37   1.13481275 -0.268528563
38   0.72639175 -1.867091867
39   1.53655654  0.113343080
40   1.61039413  0.208378008
41   0.06639953  1.356375890
42   1.03420331  0.807229218
43   1.18910007  0.974370871
44   1.63308257  0.545667561
45  -0.25922699  1.505171424
46  -3.02996930  0.679779418
47  -0.99946826 -0.968191106
48  -1.46360515  0.085206322
49  -0.49978479 -0.442087739
50   0.95273736 -0.181899439
51  -0.30697742  0.489584687
52  -0.78924959 -0.187788104
53   0.69805805  0.765143520
54  -1.15901427 -0.565375828
55  -0.31897956 -0.740002592
56   0.22418744  1.224100696
57  -0.55604853  1.693238194
58  -0.04937874 -0.544029110
59   0.36596818  0.744066799
60  -0.91637705  0.408629168
61  -1.48921584  0.605874043
62  -1.11475813  0.316318864
63   0.81703789  0.646829669
64   1.32409425  0.729301939
65  -1.35527071 -0.024242493
66  -0.94590106 -0.345005007
67   0.49221902 -0.514970101
68  -0.79174755 -0.425284602
69  -1.36381233  0.674711342
70  -0.20665742  0.077557183
71   1.22243972 -0.296124744
72   1.09508760 -0.093748790
73   0.38238055  1.781639630
74  -0.22285146  0.992609479
75  -1.35818369 -0.024005653
76  -1.18089790  0.881957340
77   2.08851670  0.099149260
78  -0.41609538  0.010346044
79  -0.61208034  1.684050440
80   0.15706091 -0.937919698
81   0.74656601  0.719188824
82   0.28116569  0.386819700
83   0.05015600 -0.604327634
84  -2.22987390 -0.227338114
85   0.73350017 -1.384444667
86  -2.08337171  0.833925136
87   0.85599783  1.967301148
88   0.10921493 -1.295980382
89   0.10757758  1.117635835
90  -0.65842760  0.559917885
91   1.09782937  0.037720302
92   0.62406038  0.088822499
93   0.90593044 -0.139525437
94   1.83003542 -0.624084017
95   0.32911513 -0.576363062
96  -0.24675501 -1.745966392
97  -0.23612645  0.413709539
98   1.13515423 -0.755389959
99  -0.42067878  1.564925091
100  0.61928263  1.498008224
101 -1.46530885  1.495112804
102  0.08179590 -1.172137837
103 -0.43079213  0.729141824
104 -2.70732860  1.909422809
105 -0.07566363  0.428489566
106 -1.04536505 -0.334600774
107 -0.30211222  0.254321454
108  0.72270657  2.187983453
109  0.61865421  0.016845715
110 -0.61857051 -0.544445134
111  1.34592201  0.880067581
112  0.28685358 -1.823377896
113 -1.67583874  0.098851798
114  0.17749971 -1.751483507
115 -1.00863473  0.814252284
116  0.88887653  1.328688073
117  0.26386249  1.851143915
118  0.51129711  1.006128458
119 -2.50349066 -1.922437032
120 -0.48358079  0.003556990
121  3.09321135 -1.622522483
122 -1.34191653  0.824196461
123  0.58832407 -1.601493670
124 -0.19528333 -0.344344054
125 -1.39163330  1.526550889
126 -1.23018851 -0.186076691
127 -2.12083974  1.183320508
128  0.67860159 -1.290013969
129 -0.38560187 -0.644706402
130 -1.57409618  0.861269237
131  1.15679985  1.396595495
132  1.53500346  0.961147898
133  2.20402969  2.006467175
134  0.14174903 -0.409713811
135  0.90564795 -0.064135908
136  0.03977575  0.383090530
137 -0.09084079  1.678219779
138  0.87318458 -0.017240497
139 -0.80322980  0.631212252
140  0.39828993 -0.723220261
141  0.16136968 -0.781766695
142  0.64955312  0.458474796
143  0.41745918 -1.872256958
144  1.75946479 -1.509325067
145 -0.58695446 -1.119325704
146 -0.44877868  0.312101857
147 -0.15159862  0.674558291
148  1.77932594 -0.242251820
149  0.94139394 -0.931496771
150 -1.20808923  0.230492068
151 -1.03437094  1.802138894
152  0.22999593 -1.698665398
153 -0.40819777 -0.420539945
154 -0.18608621 -0.965874452
155  0.03660006 -0.528939636
156 -0.22005470 -1.127982790
157 -0.01765128 -0.102772000
158 -0.61780392  0.390727671
159 -0.64951037  1.183026732
160  0.54238296  1.295157955
161  0.79281110 -2.007003896
162  0.08755999  0.957376677
163 -1.43468894 -0.393436570
164 -1.14073345 -0.400682489
165 -0.66871910 -1.012278734
166  0.89460639 -0.413898606
167 -1.16682202 -0.877006514
168  0.75553614  1.903142444
169  1.33910779  1.508373725
170 -1.40488824  0.041895511
171 -0.76628668  1.016769504
172  1.20708480  0.103263702
173 -0.77278257 -0.413291878
174  0.86003059 -0.015786051
175  0.43449037  1.614497312
176 -0.77228654  0.567409911
177  0.71516670 -0.649307206
178  0.22850834 -0.587293187
179 -0.33895897 -1.691036146
180  0.02866362 -1.287643257
181 -0.84981182  1.552754555
182 -2.19818586  1.787115442
183  0.84234870 -0.779222442
184 -0.78278462  0.123003064
185  1.22907822 -1.023748030
186 -0.78079368  0.224536438
187 -1.44322337  0.187418961
188  1.40568901 -0.010079845
189  0.24230999 -0.006904995
190  1.46567854  0.026041782
191 -0.14865742  0.297176559
192  0.84164598  0.957362048
193  0.97925553  1.544227491
194  2.50872712 -0.951613682
195 -0.00588793 -0.306146125
196  0.73207565 -1.510935990
197  0.45234444  0.155306863
198 -0.56784000  2.060873615
199  0.74717576 -0.435403313
200 -1.53414795  0.404397048
data.table(
  x = rnorm(200),
  y = rnorm(200)
)
              x           y
          <num>       <num>
  1: -1.0112152 -0.50032928
  2: -1.1969889  0.32248802
  3:  0.4072197  1.03141524
  4: -0.3486641  0.27314338
  5:  1.0354548  0.63720437
 ---                       
196:  0.5688225  0.92180051
197:  0.2324362 -2.07045111
198:  0.6067250  1.23063655
199:  0.6402398  0.05148768
200:  2.2433042 -0.14723650

data.table square brackets

Remember, from base R, than we can subset vectors, matrices, and data.frames using [i,j].

. . .

This is true for data.table as well.

x <- data.table(a = rnorm(10), b = rnorm(10))
x[1,2]
          b
      <num>
1: 1.356775

data.table square brackets

data.table takes square brackets a huge step further, by allowing

. . .

  • filtering in the ‘i’ place

. . .

  • new variable declaration in the ‘j’ place

. . .

We will look at each of these in turn.

data.table filtering

x <- data.table(a = c(-0.2, 0.1, 0.25, 0, -0.3), b = c(0.5, 0, -0.2, 0.1, 0.3))
x
       a     b
   <num> <num>
1: -0.20   0.5
2:  0.10   0.0
3:  0.25  -0.2
4:  0.00   0.1
5: -0.30   0.3

What if we want all the values where “a” is less than 0?

. . .

x[a < 0, ]
       a     b
   <num> <num>
1:  -0.2   0.5
2:  -0.3   0.3

data.table filtering

x
       a     b
   <num> <num>
1: -0.20   0.5
2:  0.10   0.0
3:  0.25  -0.2
4:  0.00   0.1
5: -0.30   0.3

Instead we could do all the values where “a” is less than “b”.

x[a < b, ]
       a     b
   <num> <num>
1:  -0.2   0.5
2:   0.0   0.1
3:  -0.3   0.3

data.table filtering

x
       a     b
   <num> <num>
1: -0.20   0.5
2:  0.10   0.0
3:  0.25  -0.2
4:  0.00   0.1
5: -0.30   0.3

In fact, we can drop the comma, as data.table will assume that only one item is always the filtering argument.

x[a < b]
       a     b
   <num> <num>
1:  -0.2   0.5
2:   0.0   0.1
3:  -0.3   0.3

tidyverse filtering equivalent

Here is a comparison of the data.table and tidyverse syntaxes side-by-side.

x[a < 0]
       a     b
   <num> <num>
1:  -0.2   0.5
2:  -0.3   0.3

Attaching package: 'dplyr'
The following objects are masked from 'package:data.table':

    between, first, last
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
x |>
  filter(a < 0)
# A tibble: 2 × 2
      a     b
  <dbl> <dbl>
1  -0.2   0.5
2  -0.3   0.3

. . .

Pretty similar, but data.table is shorter.

data.table column subsetting

Remember: data.frame subsetting uses “i” for rows, and “j” for columns. data.table follows this.

. . .

dt <- data.table(a = rnorm(5), b = rnorm(5), c = rnorm(5))
dt[, a]
[1] -0.4476028  0.4165121 -1.3462013  0.7440658 -0.6615837

. . .

For two columns, pass a list:

dt[, list(a, b)]
            a           b
        <num>       <num>
1: -0.4476028 -0.04112529
2:  0.4165121 -0.72898296
3: -1.3462013  0.42793992
4:  0.7440658 -0.32325760
5: -0.6615837 -0.87804230

Or .(a,b) as equivalent:

dt[, .(a, b)]
            a           b
        <num>       <num>
1: -0.4476028 -0.04112529
2:  0.4165121 -0.72898296
3: -1.3462013  0.42793992
4:  0.7440658 -0.32325760
5: -0.6615837 -0.87804230

data.table variable declaration

What if we want to add a new column?

. . .

data.table does this using the column subsetting place, “j”.

. . .

dt <- data.table(a = rnorm(5), b = rnorm(5))
dt[, newcol := a + b]
dt
            a           b     newcol
        <num>       <num>      <num>
1:  0.1762745  1.05950844  1.2357829
2: -0.2409535  0.03222178 -0.2087317
3:  0.5940959  0.96573903  1.5598350
4:  0.2615928 -0.64545535 -0.3838626
5: -0.1991795  0.58154695  0.3823674

. . .

:= is a data.table function representing “defined”. i.e. newcol defined as ‘a’ plus ‘b’.

Also called the “walrus” operator by some people.

data.table variable declaration

Would this work?

dt <- data.table(a = rnorm(5), b = rnorm(5))
dt[newcol := a + b]

. . .

No.

data.table will assume we are using the “i” place for rows.

dt[newcol := a + b]
Error in `value[[3L]]()`:
! Operator := detected in i, the first argument inside DT[...], but is only valid in the second argument, j. Most often, this happens when forgetting the first comma (e.g. DT[newvar := 5] instead of DT[ , new_var := 5]). Please double-check the syntax. Run traceback(), and debugger() to get a line number.

data.table variable editing

We can use the exact same syntax to edit current columns.

dt <- data.table(a = rnorm(5), b = rnorm(5))
dt[, a := 1:5]
dt
       a          b
   <int>      <num>
1:     1 -1.5621630
2:     2 -0.4067241
3:     3 -0.3234394
4:     4  0.4222994
5:     5  1.1477452

data.table variable editing

Or to remove a column,

dt[, b := NULL]
dt
       a
   <int>
1:     1
2:     2
3:     3
4:     4
5:     5

data.table conditional editing

What if we want to remove values of “b” that are less than 0?

dt <- data.table(a = -2:2, b = -2:2)

. . .

We could conditionally declare values of “b” to be NA.

dt[b < 0, b := NA]
dt
       a     b
   <int> <int>
1:    -2    NA
2:    -1    NA
3:     0     0
4:     1     1
5:     2     2

. . .

In tidyverse, the equivalent is somewhat wordy:

dt |> mutate(b = if_else(b < 0, NA, b))

Summarizing with data.table

Remember in tidyverse we could write:

mtcars |> summarize(avg = mean(mpg))
       avg
1 20.09062

. . .

The equivalent in data.table is to pass a list .() to the “j” place:

dt <- as.data.table(mtcars)
dt[, .(avg = mean(mpg))]
        avg
      <num>
1: 20.09062

. . .

Note that I had to make “mtcars” a data.table first.

Summarizing with data.table

Instead, I could simply make the mean a new column.

dt[, avg := mean(mpg)]
dt
      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb      avg
    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>    <num>
 1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4 20.09062
 2:  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4 20.09062
 3:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1 20.09062
 4:  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1 20.09062
 5:  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2 20.09062
 6:  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1 20.09062
 7:  14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4 20.09062
 8:  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2 20.09062
 9:  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2 20.09062
10:  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4 20.09062
11:  17.8     6 167.6   123  3.92 3.440 18.90     1     0     4     4 20.09062
12:  16.4     8 275.8   180  3.07 4.070 17.40     0     0     3     3 20.09062
13:  17.3     8 275.8   180  3.07 3.730 17.60     0     0     3     3 20.09062
14:  15.2     8 275.8   180  3.07 3.780 18.00     0     0     3     3 20.09062
15:  10.4     8 472.0   205  2.93 5.250 17.98     0     0     3     4 20.09062
16:  10.4     8 460.0   215  3.00 5.424 17.82     0     0     3     4 20.09062
17:  14.7     8 440.0   230  3.23 5.345 17.42     0     0     3     4 20.09062
18:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1 20.09062
19:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2 20.09062
20:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1 20.09062
21:  21.5     4 120.1    97  3.70 2.465 20.01     1     0     3     1 20.09062
22:  15.5     8 318.0   150  2.76 3.520 16.87     0     0     3     2 20.09062
23:  15.2     8 304.0   150  3.15 3.435 17.30     0     0     3     2 20.09062
24:  13.3     8 350.0   245  3.73 3.840 15.41     0     0     3     4 20.09062
25:  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2 20.09062
26:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1 20.09062
27:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2 20.09062
28:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2 20.09062
29:  15.8     8 351.0   264  4.22 3.170 14.50     0     1     5     4 20.09062
30:  19.7     6 145.0   175  3.62 2.770 15.50     0     1     5     6 20.09062
31:  15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8 20.09062
32:  21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2 20.09062
      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb      avg
    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>    <num>

data.table will spread any single value to the length of the table.

Multiple summaries

We can summarize multiple values or columns by simply adding arguments to .()

dt[, .(avg = mean(mpg), med = median(mpg), n = .N)]
        avg   med     n
      <num> <num> <int>
1: 20.09062  19.2    32

. . .

How do we do this by group?

. . .

We could use the filtering spot, “i”, and append each result

dt[cyl == 6,  .(avg = mean(mpg), med = median(mpg), n = .N)]
        avg   med     n
      <num> <num> <int>
1: 19.74286  19.7     7

. . .

or…

Summarizing by group

We can use the “hidden” third square bracket argument: by=

dt[, .(avg = mean(mpg), med = median(mpg), n = .N), by = cyl]
     cyl      avg   med     n
   <num>    <num> <num> <int>
1:     6 19.74286  19.7     7
2:     4 26.66364  26.0    11
3:     8 15.10000  15.2    14

. . .

We can pass multiple columns to by using the short list: .()

dt[, .(avg = mean(mpg), med = median(mpg), n = .N), by = .(cyl, gear)]
     cyl  gear    avg   med     n
   <num> <num>  <num> <num> <int>
1:     6     4 19.750 20.10     4
2:     4     4 26.925 25.85     8
3:     6     3 19.750 19.75     2
4:     8     3 15.050 15.20    12
5:     4     3 21.500 21.50     1
6:     4     5 28.200 28.20     2
7:     8     5 15.400 15.40     2
8:     6     5 19.700 19.70     1

data.table argument summary

dt[

i ,

j ,

by = ]

. . .

filtering rows

filtering columns, new columns, summarize

group by columns

. . .

tidyverse equivalent verbs

filter()

select(), mutate(), summarize()

groub_by()

Comparison

mtcars |>
  filter(hp > 100) |>
  group_by(cyl, gear) |>
  summarize(m = mean(mpg))
`summarise()` has grouped output by 'cyl'. You can override using the `.groups`
argument.
# A tibble: 7 × 3
# Groups:   cyl [3]
    cyl  gear     m
  <dbl> <dbl> <dbl>
1     4     4  21.4
2     4     5  30.4
3     6     3  19.8
4     6     4  19.8
5     6     5  19.7
6     8     3  15.0
7     8     5  15.4
dt[hp > 100, .(m = mean(mpg)), by = .(cyl, gear)]
     cyl  gear     m
   <num> <num> <num>
1:     6     4 19.75
2:     6     3 19.75
3:     8     3 15.05
4:     4     5 30.40
5:     8     5 15.40
6:     6     5 19.70
7:     4     4 21.40

. . .

data.table syntax is more concise, but harder to read.

Which should I use?

The main answer:

  • personal preference (which syntax do you prefer?)

. . .

Second answer:

  • data.table is much faster

. . .

  • especially for large datasets

Speed comparison

A simulation exercise by mdharris makes this clear.

“site_count” is a measure of the simulation size (i.e. size of the data.frames). Then some costly, but not absurd calculations are performed on the data.

Why is data.table faster?

  1. optimized for speed

. . .

  1. Modifies “by reference”

. . .

tidyverse functions instead “copy-on-modify”.

Modify by reference vs. copy

Copy-on-modify

  • Creates a copy of the data
  • Modifies the copy
  • Returns the copy

. . .

Modify-in-place (modify “by reference”)

  • References the data
  • Modifies the original data

. . .

Avoids duplicating data, which is costly for your computer’s memory, especially with large datasets.

Modify by reference vs. copy

You may have noticed that I had to add an extra call to print data.tables after modifying them.

dt <- data.table(x = 1:2)
dt[, y := c("a", "b")]
dt
       x      y
   <int> <char>
1:     1      a
2:     2      b

. . .

This is because the line dt[, y := c("a", "b")] doesn’t return anything.

. . .

Because it edited the data in place, not by creating a copy.

Dangers of editing by reference

Modifying by reference can be dangerous if you are not careful.

. . .

dt <- data.table(a = 1:2, b = 3:4)
dt2 <- dt
dt2
       a     b
   <int> <int>
1:     1     3
2:     2     4

. . .

What happens to “dt” if we remove a column from “dt2”?

. . .

dt2[, b := NULL]
dt
       a
   <int>
1:     1
2:     2

. . .

“dt” and “dt2” reference the same underlying data.

Dangers of editing by reference

If you truly want a new copy of a data.table, use

dt2 <- copy(dt)

Even Faster!

Keys

Keys provide some organization to your data.

. . .

This makes it much faster to later:

  • filter

  • group by

  • merge

. . .

data.table lets you add keys in a variety of ways.

data.table(x = c("a", "a", "b"), y = 1:3, key = "x")
dt <- as.data.table(df, key = "x")
setkey(dt, "x")

. . .

You can also pass a vector of keys c("x", "y") to use.

What are keys?

Keys provide an index for the data.

. . .

Think of this like a phone book.

. . .

If you wanted to look my phone number up in the phone book,

  • you would search under “D” for DeHaven

  • you would not search the entire phone book!

. . .

When we tell data.table to use some columns as a key, it presorts and indexes the data on those columns.

Merging with data.table

For data.table all types of merges are handled by one function:

dt <- data.table(x = c("a", "b", "c"), y = 1:3)
dt2 <- data.table(x = c("a", "c", "d"), z = 4:6)
merge(dt, dt2, by = "x")
Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      c     3     5

. . .

By default, merge() does an “inner” join (matches in both datasets).

Merging with data.table

A “left” join,

merge(dt, dt2, by = "x", all.x = TRUE)
Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      b     2    NA
3:      c     3     5

A “right” join,

merge(dt, dt2, by = "x", all.y = TRUE)
Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      c     3     5
3:      d    NA     6

Or an “outer” join

merge(dt, dt2, by = "x", all = TRUE)
Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      b     2    NA
3:      c     3     5
4:      d    NA     6

Non-matching column names

We can also be explicit about which columns to use from “x” and from “y”,

dt <- data.table(x = c("a", "b", "c"), y = 1:3)
dt2 <- data.table(XX = c("a", "c", "d"), z = 4:6)
merge(dt, dt2, by.x = "x", by.y = "XX")
Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      c     3     5

. . .

But it’s probably better to clean up your column names before hand.

data.table pivots

A package tidyfast which offers

  • dt_pivot_longer()
  • dt_pivot_wider()

to match the syntax of tidyr but with the speed of data.table.

tidyverse or data.table?

tidyverse or data.table?

In practice, I (and many people) use a mix of both.

. . .

I personally prefer data.tables syntax for filtering, modifying columns, and summarizing.

. . .

If you really want the speed of data.table and the syntax of dplyr

dtplyr

…there is a special package that combines the two!

dtplyr allows you to use all of the tidyverse verbs:

  • mutate()
  • filter()
  • groupby()

. . .

But behind the scenes runs everything using data.table.

A note on pipes

You can use pipes with data.table as well.

. . .

dt <- data.table(x = 1:2)
dt[, y := 3:4]
dt[, z := 5:6]
dt
       x     y     z
   <int> <int> <int>
1:     1     3     5
2:     2     4     6
dt <- data.table(x = 1:2)
dt |>
  _[, y := 3:4] |>
  _[, z := 5:6]
dt
       x     y     z
   <int> <int> <int>
1:     1     3     5
2:     2     4     6

. . .

The underscore character _ is known as a “placeholder”,

and represents the object being piped.

. . .

A bit less useful because the data.table syntax is already very short.

Summary

  • renv
    • Separate libraries
    • snapshot() packages to lockfiles
  • Introduced to data.table
    • Using square bracket positions dt[i, j, by=]
    • Fast and concise
    • Keys
    • merge()
    • melt() and dcast()

Live Coding Example

  • Loading data.table

Coding Exercise

  • Install and load data.table in a temporary workspace
  • Create a data.table with columns a and b of length 10 with random values
  • Add new column c which is the sum of a and b
  • Add a new column d which is the product of a and b when a is greater than 0