R Data Wrangling: `data.table`

Using packages in R for data science

Matthew DeHaven

March 4, 2026

Course Home Page

R environment details

R Library

R by default has one library of packages for all projects.

R Library

You can see where this library lives by calling,

.libPaths()

Notice there are actually two library paths returned!

“User” Library

For any packages you install

“System” Library

For default packages when base R is installed

R Library

Installing packages puts them in User Library. Calling library() loads a package.

The Problems

There are two problems with R’s default package setup.

All projects share the same library

Updating a package for one project updates it for all

It is not easy to record and recreate the exact packages used for a project

These are obviously related, which is why renv solves both.

`renv` Separates Project Libraries

renv::init() makes isolated libraries for each project.

`renv` Cache

In order to avoid duplicate copies and downloads, renv stores a cache of all the packages it installs behind the scenes.

`renv` lockfile

The lockfile keeps tracks of what packages are used in a project.

It does not default to everything in the Project Library.

You may have some packages you use just for development that are not needed for all of your code to run.
- An example of this is the languageserver package

`renv` Record Packages to the Lockfile

In order to record a new package to the lockfile, run

renv::snapshot()

This causes renv to search all of your project’s R files, see what packages are being used, and save the current version to the lockfile.

`renv` Status

To see if your lockfile is up-to-date, you can run

renv::status()

If everything is consistent, you will see this:

`renv` Status

To see if your lockfile is up-to-date, you can run

renv::status()

If everything a package is not tracked, but used, you will see:

`renv` Restore

If you open a project with a renv lockfile and want to recreate the package environment, run

renv::restore()

You will see this in the next assignment.

`renv` Summary

`data.table`

Is another package focused on data science.

It’s goals:

speed

brevity

no dependencies

In contrast to the tidyverse which is focused on readability.

Installing `data.table`

Again, we use renv to install data.table for any project we are using.

renv::install("data.table")

Then we can load the package,

library(data.table)

`data.table` objects

data.tables are a new object to replace data.frame.

Similar to tibbles, they have nicer default printing

data.frame(
  x = rnorm(200),
  y = rnorm(200)
)

               x           y
1    0.391037051  0.70672486
2   -0.797844018  1.27603101
3   -0.359482436  2.33528973
4   -0.237163332  0.19471902
5    0.690523500 -0.42378531
6    0.839320394  0.24693601
7    0.869473313  0.94007589
8    0.834238550 -1.45206334
9    0.704444826  0.31517017
10   1.857178038  0.34484943
11  -0.346629841 -0.20960703
12  -1.348117741 -0.43333636
13  -0.053063625  0.03998982
14  -0.837661905  1.70693872
15   0.235160981  0.29858022
16  -0.040019721  0.82758645
17   0.598917692  0.70845359
18   0.838430045  0.95846162
19  -0.217081609 -1.44102669
20   0.239927008  0.80564436
21  -0.683584208 -1.50964917
22  -0.452412301 -1.15454320
23  -0.546887431  0.03654041
24   1.143485195  0.58753254
25   0.655092187  1.25765603
26  -2.008748761 -0.71647941
27  -1.124809212 -1.46744876
28  -0.797140116  0.07858638
29  -0.902630700  1.19251838
30  -0.331246041  0.67247946
31  -0.498869023 -0.25941124
32  -1.291164502 -1.58580904
33  -0.185443174  1.71772606
34  -1.172406757  1.36257921
35   0.538753382 -0.32156718
36   1.754851154  1.04548331
37   0.571242475 -0.45254506
38   1.183745583  2.35805959
39  -0.071524635  0.72039114
40   2.213903828 -0.34642400
41   0.561433330  1.42152888
42   0.919708149  0.96938829
43   0.308319404 -1.73063926
44  -1.760492315  0.46158476
45  -1.310156433  0.77645704
46   0.769059596 -0.99961363
47   0.796739376 -0.53309785
48   0.018045191  0.71305882
49   0.939513320 -1.20582769
50  -0.764112200 -1.43367324
51  -1.262773849 -0.34772437
52   0.921321685  0.04391214
53  -1.294393423  0.74933623
54  -0.427052355 -0.13573626
55  -0.174637060 -1.42435498
56  -0.890281369  0.06967650
57  -1.214959957 -0.62391607
58  -0.623642438  1.81151848
59  -1.091010228  1.55703434
60  -0.917971935 -0.05189142
61  -1.321473628 -0.58969509
62  -1.873376647 -0.22787171
63  -0.659244271 -0.38643994
64  -0.116121593  1.27464443
65   0.178359454  0.86823984
66   0.622661575  0.32359083
67  -0.455876998 -0.51074616
68   1.232376107 -2.17389704
69   0.221093507 -1.57654844
70   0.317650421  2.99307244
71  -0.305121279  1.24824365
72   0.556446787  1.28195029
73   0.388784702 -0.58731265
74  -0.267789441 -0.32330193
75  -2.021060315 -2.42633102
76  -0.720577375  1.12797879
77  -0.807193919 -0.34797189
78   2.299893710 -1.69520745
79  -1.738654737 -0.78621392
80   0.078680302 -0.91597025
81   1.443552918 -0.12737549
82   0.215272448  0.71294022
83   1.690477443 -1.22851185
84   1.621934221  0.12748025
85  -0.138061040  1.71642865
86  -2.882525491 -0.50187316
87  -0.243444676  0.67322895
88  -2.195537979  1.90042596
89  -0.333587368  0.27050373
90   0.481517047 -0.99939382
91  -0.248646870  1.25953137
92  -0.067594891 -0.47346150
93   0.020474914  0.21328221
94   0.280744742  1.22446837
95   0.410260199  0.32041896
96   1.736610774 -0.67747606
97   0.949427209  1.85425533
98   1.054886829 -0.59644089
99  -0.485546005 -0.55949133
100 -0.066194377 -0.13316839
101  0.440781190 -2.31940301
102  1.771135378  1.42397182
103 -0.658486220 -0.53762804
104 -0.067216106 -0.37809847
105  0.868385292 -1.15369975
106 -0.167888490  1.13914743
107 -1.375105702  0.22484152
108 -0.805698093 -0.94410990
109  0.921195168 -0.60047808
110  0.638187056 -1.17519173
111 -1.322092027  0.61497970
112 -0.432022064  0.63432908
113  1.777501765  0.41680344
114  0.613973826 -0.88787392
115  0.393691140  0.32113845
116  0.761477884 -0.68333482
117 -0.503996768 -0.12435248
118 -0.381777068 -0.51950232
119 -3.570718958 -0.27007480
120  0.530932302 -1.63695068
121  0.938263515 -0.81525206
122  1.082399860 -0.40777629
123  0.840228605 -0.49733226
124 -0.660535840  0.45173782
125 -1.248263882 -0.36138639
126  1.730976261  0.13739870
127  0.483036157 -1.43151615
128  2.770251935 -0.81930598
129 -0.304846767 -0.44239061
130  0.989892767 -1.38892317
131 -1.282289285 -2.74581696
132 -0.071703881  0.89594907
133 -0.778305059  0.34888951
134 -0.711945238  0.22716615
135  0.477687178 -0.95322393
136  0.472164586  0.48337998
137  1.128484233 -0.23044453
138  1.463336419 -2.18832832
139 -1.726400828 -1.51130749
140  1.354573268  0.22307906
141 -1.563540941 -0.95865902
142  0.784883639 -0.26277337
143  0.447038299 -1.74509347
144  1.649248379  1.40564849
145 -0.024732417  0.31576528
146  0.374618014 -2.00527330
147  0.520441272  1.04045184
148  0.826118850 -0.11604969
149 -1.543415491  0.46384648
150 -0.357275864  1.23515919
151  0.579270150 -1.46637172
152  0.645195093  1.21648840
153 -0.957784807  0.31853750
154  0.017303839  0.72973881
155  0.881239459  0.90288465
156 -0.500219374 -0.19918925
157 -0.624313537  0.83157341
158  0.818231680  0.31046142
159  0.106953803  0.75397339
160 -0.576022561 -0.40330651
161  0.379822421 -2.10829582
162 -0.530939451  0.57972020
163  0.303890483 -0.31051805
164  0.313468805  1.04280470
165 -0.003517994 -0.22219718
166  0.630670561  2.36656695
167  0.314355195 -0.07397635
168  0.464017744  0.24959355
169  0.099589434  0.69112747
170 -1.554300553  0.96817679
171  1.350635667 -1.57128908
172  0.075102924 -0.02203584
173  0.202248756  0.58609342
174  0.986742644 -0.27794561
175 -1.243596077 -1.48023371
176  1.106496769  0.71884759
177 -0.757080542  1.10360372
178 -0.854065201  0.74137858
179 -0.745752376 -0.87978273
180  1.303124623  0.59804648
181  0.660346988  0.06912345
182  0.620740677 -0.19366783
183 -0.248433483  0.47915353
184 -0.425737118  0.78810226
185  0.380041465  0.20236721
186  0.278776077  0.33435084
187  0.570234569 -0.65076270
188  0.964952193 -0.23768627
189  1.581634958 -1.43782422
190  0.578149797  1.19945221
191  1.236180775 -0.55117243
192  0.563292243  0.53213623
193  0.901014214  0.49518964
194  0.816465456  0.74667707
195 -0.723993713 -1.12572641
196  1.153471919  1.79943287
197  0.163978362 -0.27611366
198  0.354244275 -0.02716657
199  0.383443888 -0.92011170
200  1.269924411  0.82574740

data.table(
  x = rnorm(200),
  y = rnorm(200)
)

               x           y
           <num>       <num>
  1: -0.84011814 -1.40068384
  2:  0.03326073 -0.06354209
  3:  1.27120495 -0.77549786
  4:  0.48363083 -0.53904015
  5: -0.09653324 -0.25541796
 ---                        
196: -2.37699653 -0.60439492
197:  0.78147175 -1.62531533
198: -0.34816658  1.14836532
199:  0.31235246 -0.36572910
200:  0.54380139  0.23651600

`data.table` square brackets

Remember, from base R, than we can subset vectors, matrices, and data.frames using [i,j].

This is true for data.table as well.

x <- data.table(a = rnorm(10), b = rnorm(10))
x[1,2]

           b
       <num>
1: 0.2726407

`data.table` square brackets

data.table takes square brackets a huge step further, by allowing

filtering in the ‘i’ place

new variable declaration in the ‘j’ place

We will look at each of these in turn.

`data.table` filtering

x <- data.table(a = c(-0.2, 0.1, 0.25, 0, -0.3), b = c(0.5, 0, -0.2, 0.1, 0.3))
x

       a     b
   <num> <num>
1: -0.20   0.5
2:  0.10   0.0
3:  0.25  -0.2
4:  0.00   0.1
5: -0.30   0.3

What if we want all the values where “a” is less than 0?

x[a < 0, ]

       a     b
   <num> <num>
1:  -0.2   0.5
2:  -0.3   0.3

`data.table` filtering

       a     b
   <num> <num>
1: -0.20   0.5
2:  0.10   0.0
3:  0.25  -0.2
4:  0.00   0.1
5: -0.30   0.3

Instead we could do all the values where “a” is less than “b”.

x[a < b, ]

       a     b
   <num> <num>
1:  -0.2   0.5
2:   0.0   0.1
3:  -0.3   0.3

`data.table` filtering

       a     b
   <num> <num>
1: -0.20   0.5
2:  0.10   0.0
3:  0.25  -0.2
4:  0.00   0.1
5: -0.30   0.3

In fact, we can drop the comma, as data.table will assume that only one item is always the filtering argument.

x[a < b]

       a     b
   <num> <num>
1:  -0.2   0.5
2:   0.0   0.1
3:  -0.3   0.3

`tidyverse` filtering equivalent

Here is a comparison of the data.table and tidyverse syntaxes side-by-side.

x[a < 0]

       a     b
   <num> <num>
1:  -0.2   0.5
2:  -0.3   0.3

x |>
  filter(a < 0)

# A tibble: 2 × 2
      a     b
  <dbl> <dbl>
1  -0.2   0.5
2  -0.3   0.3

Pretty similar, but data.table is shorter.

`data.table` column subsetting

Remember: data.frame subsetting uses “i” for rows, and “j” for columns. data.table follows this.

dt <- data.table(a = rnorm(5), b = rnorm(5), c = rnorm(5))
dt[, a]

[1]  0.3426756  0.3168033 -0.3181542 -1.4532720  0.1263276

For two columns, pass a list:

dt[, list(a, b)]

            a          b
        <num>      <num>
1:  0.3426756 -0.1775514
2:  0.3168033  1.0238900
3: -0.3181542 -1.0213965
4: -1.4532720  1.4607505
5:  0.1263276 -1.6070249

Or .(a,b) as equivalent:

dt[, .(a, b)]

            a          b
        <num>      <num>
1:  0.3426756 -0.1775514
2:  0.3168033  1.0238900
3: -0.3181542 -1.0213965
4: -1.4532720  1.4607505
5:  0.1263276 -1.6070249

`data.table` variable declaration

What if we want to add a new column?

data.table does this using the column subsetting place, “j”.

dt <- data.table(a = rnorm(5), b = rnorm(5))
dt[, newcol := a + b]
dt

             a          b     newcol
         <num>      <num>      <num>
1: -0.68924040 -1.7041279 -2.3933683
2: -0.43337993  1.3902757  0.9568958
3: -0.09609694 -0.5080765 -0.6041734
4:  0.70287679  0.3760391  1.0789159
5:  1.03602400 -1.8491853 -0.8131613

:= is a data.table function representing “defined”.

Also called the “walrus” operator by some people.

`data.table` variable declaration

Would this work?

dt <- data.table(a = rnorm(5), b = rnorm(5))
dt[newcol := a + b]

No.

data.table will assume we are using the “i” place for rows.

dt[newcol := a + b]

Error in `value[[3L]]()`:
! Operator := detected in i, the first argument inside DT[...], but is only valid in the second argument, j. Most often, this happens when forgetting the first comma (e.g. DT[newvar := 5] instead of DT[ , new_var := 5]). Please double-check the syntax. Run traceback(), and debugger() to get a line number.

`data.table` variable editing

We can use the exact same syntax to edit current columns.

dt <- data.table(a = rnorm(5), b = rnorm(5))
dt[, a := 1:5]
dt

       a          b
   <int>      <num>
1:     1 -1.1186931
2:     2  0.8109287
3:     3  0.6727535
4:     4 -0.4505330
5:     5 -0.3421106

`data.table` variable editing

Or to remove a column,

dt[, b := NULL]
dt

`data.table` conditional editing

What if we want to remove values of “b” that are less than 0?

dt <- data.table(a = -2:2, b = -2:2)

We could conditionally declare values of “b” to be NA.

dt[b < 0, b := NA]
dt

       a     b
   <int> <int>
1:    -2    NA
2:    -1    NA
3:     0     0
4:     1     1
5:     2     2

In tidyverse, the equivalent is somewhat wordy:

dt |> mutate(b = if_else(b < 0, NA, b))

Summarizing with `data.table`

Remember in tidyverse we could write:

mtcars |> summarize(avg = mean(mpg))

       avg
1 20.09062

The equivalent in data.table is to pass a list .() to the “j” place:

dt <- as.data.table(mtcars)
dt[, .(avg = mean(mpg))]

        avg
      <num>
1: 20.09062

Note that I had to make “mtcars” a data.table first.

Summarizing with `data.table`

Instead, I could simply make the mean a new column.

dt[, avg := mean(mpg)]
dt

      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb      avg
    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>    <num>
 1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4 20.09062
 2:  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4 20.09062
 3:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1 20.09062
 4:  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1 20.09062
 5:  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2 20.09062
 6:  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1 20.09062
 7:  14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4 20.09062
 8:  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2 20.09062
 9:  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2 20.09062
10:  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4 20.09062
11:  17.8     6 167.6   123  3.92 3.440 18.90     1     0     4     4 20.09062
12:  16.4     8 275.8   180  3.07 4.070 17.40     0     0     3     3 20.09062
13:  17.3     8 275.8   180  3.07 3.730 17.60     0     0     3     3 20.09062
14:  15.2     8 275.8   180  3.07 3.780 18.00     0     0     3     3 20.09062
15:  10.4     8 472.0   205  2.93 5.250 17.98     0     0     3     4 20.09062
16:  10.4     8 460.0   215  3.00 5.424 17.82     0     0     3     4 20.09062
17:  14.7     8 440.0   230  3.23 5.345 17.42     0     0     3     4 20.09062
18:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1 20.09062
19:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2 20.09062
20:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1 20.09062
21:  21.5     4 120.1    97  3.70 2.465 20.01     1     0     3     1 20.09062
22:  15.5     8 318.0   150  2.76 3.520 16.87     0     0     3     2 20.09062
23:  15.2     8 304.0   150  3.15 3.435 17.30     0     0     3     2 20.09062
24:  13.3     8 350.0   245  3.73 3.840 15.41     0     0     3     4 20.09062
25:  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2 20.09062
26:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1 20.09062
27:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2 20.09062
28:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2 20.09062
29:  15.8     8 351.0   264  4.22 3.170 14.50     0     1     5     4 20.09062
30:  19.7     6 145.0   175  3.62 2.770 15.50     0     1     5     6 20.09062
31:  15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8 20.09062
32:  21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2 20.09062
      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb      avg
    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>    <num>

data.table will spread any single value to the length of the table.

Multiple summaries

We can summarize multiple values or columns by simply adding arguments to .()

dt[, .(avg = mean(mpg), med = median(mpg), n = .N)]

        avg   med     n
      <num> <num> <int>
1: 20.09062  19.2    32

How do we do this by group?

We could use the filtering spot, “i”, and append each result

dt[cyl == 6,  .(avg = mean(mpg), med = median(mpg), n = .N)]

        avg   med     n
      <num> <num> <int>
1: 19.74286  19.7     7

or…

Summarizing by group

We can use the “hidden” third square bracket argument: by=

dt[, .(avg = mean(mpg), med = median(mpg), n = .N), by = cyl]

     cyl      avg   med     n
   <num>    <num> <num> <int>
1:     6 19.74286  19.7     7
2:     4 26.66364  26.0    11
3:     8 15.10000  15.2    14

We can pass multiple columns to by using the short list: .()

dt[, .(avg = mean(mpg), med = median(mpg), n = .N), by = .(cyl, gear)]

     cyl  gear    avg   med     n
   <num> <num>  <num> <num> <int>
1:     6     4 19.750 20.10     4
2:     4     4 26.925 25.85     8
3:     6     3 19.750 19.75     2
4:     8     3 15.050 15.20    12
5:     4     3 21.500 21.50     1
6:     4     5 28.200 28.20     2
7:     8     5 15.400 15.40     2
8:     6     5 19.700 19.70     1

`data.table` argument summary

dt[

i ,

j ,

by = ]

filtering rows

filtering columns, new columns, summarize

group by columns

tidyverse equivalent verbs

filter()

select(), mutate(), summarize()

groub_by()

Comparison

mtcars |>
  filter(hp > 100) |>
  group_by(cyl, gear) |>
  summarize(m = mean(mpg))

# A tibble: 7 × 3
# Groups:   cyl [3]
    cyl  gear     m
  <dbl> <dbl> <dbl>
1     4     4  21.4
2     4     5  30.4
3     6     3  19.8
4     6     4  19.8
5     6     5  19.7
6     8     3  15.0
7     8     5  15.4

dt[hp > 100, .(m = mean(mpg)), by = .(cyl, gear)]

     cyl  gear     m
   <num> <num> <num>
1:     6     4 19.75
2:     6     3 19.75
3:     8     3 15.05
4:     4     5 30.40
5:     8     5 15.40
6:     6     5 19.70
7:     4     4 21.40

data.table syntax is more concise, but harder to read.

Class Activity

Install and load data.table in a temporary workspace
Create a data.table with columns a and b of length 10 with random values
Add new column c which is the sum of a and b
Add a new column d which is the product of a and b when a is greater than 0

Modify by Reference

Which should I use? `data.table` or `tidyverse`?

The main answer:

personal preference (which syntax do you prefer?)

Second answer:

data.table is much faster

especially for large datasets

Speed comparison

A group that creates DuckDB (a fast database engine) does a regular speed comparison of different data manipulation tools across R, Python, Julia, and other languages:

Database-like ops benchmark

Takeaway:

data.table is much faster than dplyr for data manipulation in R

Why is `data.table` faster?

optimized for speed

Modifies “by reference”

tidyverse functions instead “copy-on-modify”.

Modify by reference vs. copy

Copy-on-modify

Creates a copy of the data
Modifies the copy
Returns the copy

Modify-in-place (modify “by reference”)

References the data
Modifies the original data

Avoids duplicating data, which is costly for your computer’s memory, especially with large datasets.

Modify by reference vs. copy

You may have noticed that I had to add an extra call to print data.tables after modifying them.

dt <- data.table(x = 1:2)
dt[, y := c("a", "b")]
dt

       x      y
   <int> <char>
1:     1      a
2:     2      b

This is because the line dt[, y := c("a", "b")] doesn’t return anything.

Because it edited the data in place, not by creating a copy.

Dangers of editing by reference

Modifying by reference can be dangerous if you are not careful.

dt <- data.table(a = 1:2, b = 3:4)
dt2 <- dt
dt2

       a     b
   <int> <int>
1:     1     3
2:     2     4

What happens to “dt” if we remove a column from “dt2”?

dt2[, b := NULL]
dt

       a
   <int>
1:     1
2:     2

“dt” and “dt2” reference the same underlying data.

Dangers of editing by reference

If you truly want a new copy of a data.table, use

dt2 <- copy(dt)

Converting to a `data.table` by reference

If you have a data.frame and want to convert it to a data.table, you can do this by reference as well.

df <- data.frame(a = 1:2, b = 3:4)
setDT(df)
df

       a     b
   <int> <int>
1:     1     3
2:     2     4

Instead, if you use as.data.table(), this will create a copy of the data.

df <- data.frame(a = 1:2, b = 3:4)
dt <- as.data.table(df)
df

  a b
1 1 3
2 2 4

Even Faster with Keys!

Keys

Keys provide some organization to your data.

This makes it much faster to later:

filter
group by
merge

data.table lets you add keys in a variety of ways.

data.table(x = c("a", "a", "b"), y = 1:3, key = "x")
dt <- as.data.table(df, key = "x")
setkey(dt, "x")

You can also pass a vector of keys c("x", "y") to use.

What are keys?

Keys provide an index for the data.

Think of this like a phone book.

If you wanted to look my phone number up in the phone book,

you would search under “D” for DeHaven
you would not search the entire phone book!

When we tell data.table to use some columns as a key, it presorts and indexes the data on those columns.

Merging with `data.table`

For data.table all types of merges are handled by one function:

dt <- data.table(x = c("a", "b", "c"), y = 1:3)
dt2 <- data.table(x = c("a", "c", "d"), z = 4:6)
merge(dt, dt2, by = "x")

Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      c     3     5

By default, merge() does an “inner” join (matches in both datasets).

Merging with `data.table`

A “left” join,

merge(dt, dt2, by = "x", all.x = TRUE)

Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      b     2    NA
3:      c     3     5

A “right” join,

merge(dt, dt2, by = "x", all.y = TRUE)

Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      c     3     5
3:      d    NA     6

Or an “outer” join

merge(dt, dt2, by = "x", all = TRUE)

Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      b     2    NA
3:      c     3     5
4:      d    NA     6

Non-matching column names

We can also be explicit about which columns to use from “x” and from “y”,

dt <- data.table(x = c("a", "b", "c"), y = 1:3)
dt2 <- data.table(XX = c("a", "c", "d"), z = 4:6)
merge(dt, dt2, by.x = "x", by.y = "XX")

Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      c     3     5

But it’s probably better to clean up your column names before hand.

`data.table` pivots

A package tidyfast which offers

dt_pivot_longer()
dt_pivot_wider()

to match the syntax of tidyr but with the speed of data.table.

`data.table` read / write functions

data.table has its own functions for reading and writing data, which are much faster than base R’s read.csv() and write.csv(), or tidyverse’s readr::read_csv() and readr::write_csv().

dt <- fread("path/to/input-file.csv")

fwrite(dt, "path/to/output-file.csv")

`tidyverse` or `data.table`?

In practice, I (and many people) use a mix of both.

I personally prefer data.tables syntax for filtering, modifying columns, and summarizing.

If you really want the speed of data.table and the syntax of dplyr…

`dtplyr`

…there is a special package that combines the two!

dtplyr allows you to use all of the tidyverse verbs:

mutate()
filter()
groupby()

But behind the scenes runs everything using data.table.

A note on pipes

You can use pipes with data.table as well.

dt <- data.table(x = 1:2)
dt[, y := 3:4]
dt[, z := 5:6]
dt

       x     y     z
   <int> <int> <int>
1:     1     3     5
2:     2     4     6

dt <- data.table(x = 1:2)
dt |>
  _[, y := 3:4] |>
  _[, z := 5:6]
dt

       x     y     z
   <int> <int> <int>
1:     1     3     5
2:     2     4     6

The underscore character _ is known as a “placeholder”,

and represents the object being piped.

Placeholder for pipes

The placeholder _ is normally used to assign the result of a pipe to a specific named argument in a function.

x = c(1, 2, 3)
x |> append(4)

[1] 1 2 3 4

the append() function has two arguments: x and values.

We can use the placeholder to specify that the piped object should be passed to values instead of x.

x |> append(x = 4, values = _)

[1] 4 1 2 3

Live Coding Example

Speed Comparison:

library(data.table)
library(dplyr)
library(tictoc)

nrows <- 1e7 # 10 million rows

dt <- data.table(
  id    = sample(letters, nrows, replace = TRUE),
  group = sample(1:100, nrows, replace = TRUE),
  value = rnorm(nrows)
)

tb <- tibble(
  id    = dt$id,
  group = dt$group,
  value = dt$value
)

tic("data.table")
dt[group == 1, value := 0]
toc()

tic("dplyr")
tb <- tb |> mutate(value = if_else(group == 1, 0, value))
toc()

dt2 <- copy(dt)
setkey(dt2, "group")

tic("data.table grouped")
dt2[group == 1, value := 0]
toc()

R Data Wrangling: data.table

R environment details

R Library

R Library

R Library

The Problems

renv Separates Project Libraries

renv Cache

renv lockfile

renv Record Packages to the Lockfile

renv Status

renv Status

renv Restore

renv Summary

data.table

data.table

Installing data.table

data.table objects

data.table square brackets

data.table square brackets

data.table filtering

data.table filtering

data.table filtering

tidyverse filtering equivalent

data.table column subsetting

data.table variable declaration

data.table variable declaration

data.table variable editing

data.table variable editing

data.table conditional editing

Summarizing with data.table

Summarizing with data.table

Multiple summaries

Summarizing by group

data.table argument summary

Comparison

Class Activity

Class Activity

Modify by Reference

Which should I use? data.table or tidyverse?

Speed comparison

Why is data.table faster?

Modify by reference vs. copy

Modify by reference vs. copy

Dangers of editing by reference

Dangers of editing by reference

Converting to a data.table by reference

Even Faster with Keys!

Keys

What are keys?

Merging with data.table

Merging with data.table

Non-matching column names

data.table pivots

data.table read / write functions

tidyverse or data.table?

tidyverse or data.table?

dtplyr

A note on pipes

Placeholder for pipes

Live Coding Example

Live Coding Example

R Data Wrangling: `data.table`

`renv` Separates Project Libraries

`renv` Cache

`renv` lockfile

`renv` Record Packages to the Lockfile

`renv` Status

`renv` Status

`renv` Restore

`renv` Summary

`data.table`

`data.table`

Installing `data.table`

`data.table` objects

`data.table` square brackets

`data.table` square brackets

`data.table` filtering

`data.table` filtering

`data.table` filtering

`tidyverse` filtering equivalent

`data.table` column subsetting

`data.table` variable declaration

`data.table` variable declaration

`data.table` variable editing

`data.table` variable editing

`data.table` conditional editing

Summarizing with `data.table`

Summarizing with `data.table`

`data.table` argument summary

Which should I use? `data.table` or `tidyverse`?

Why is `data.table` faster?

Converting to a `data.table` by reference

Merging with `data.table`

Merging with `data.table`

`data.table` pivots

`data.table` read / write functions

`tidyverse` or `data.table`?

`tidyverse` or `data.table`?

`dtplyr`