R Data Wrangling:
data.table

Matthew DeHaven

Course Home Page

2024-01-01

Lecture Summary

  • Talking more about renv
  • data.table

More renv

R Library

R by default has one library of packages for all projects.

G A User Library B Project 1 A--B C Project 2 A--C D Project 3 A--D

R Library

You can see where this library lives by calling,

.libPaths()

Notice there are actually two library paths returned!

“User” Library

  • For any packages you install

“System” Library

  • For default packages when base R is installed

R Library

Installing packages puts them into the User Library. Calling library() loads a package.

G A User Library B1 Project 1 A--B1 library() B2 Project 1 A--B2 B3 Project 1 A--B3 library() C CRAN C--A install.packages()

The Problems

There are two problems with R’s default package setup.

  1. All projects share the same library
  • Updating a package for one project updates it for all
  1. It is not easy to record and recreate the exact packages used for a project

These are obviously related, which is why renv solves both.

renv Separates Project Libraries

renv::init() makes isolated libraries for each project.

G A1 Library 1 B Project 1 A1--B A2 Library 2 C Project 2 A2--C A3 Library 3 D Project 3 A3--D

renv Cache

In order to avoid duplicate copies and downloads, renv stores a cache of all the packages it installs behind the scenes.

G Z renv Cache A1 Library 1 Z--A1 A2 Library 2 Z--A2 A3 Library 3 Z--A3 B Project 1 A1--B C Project 2 A2--C D Project 3 A3--D

This solves projects changing each other’s packages, but does not yet give us any way to track and restore packages.

renv lockfile

The lockfile keeps tracks of what packages are used in a project.

It does not default to everything in the Project Library.

  • You may have some packages you use just for development that are not needed for all of your code to run.
  • An example of this is the languageserver package

renv Record Packages to the Lockfile

In order to record a new package to the lockfile, run

renv::snapshot()

This causes renv to search all of your project’s R files, see what packages are being used, and save the current version to the lockfile.

renv Status

To see if your lockfile is up-to-date, you can run

renv::status()

renv will also tell you if everything is up-to-date when you launch a new R terminal.

renv Restore

If you open a project with a renv lockfile and want to recreate the package environment, run

renv::restore()

You will see this in the next assignment.

renv Summary

data.table

data.table

Is another package focused on data science.

It’s goals:

  • speed
  • brevity
  • no dependencies

In contrast to the tidyverse which is focused on readability.

Installing data.table

Again, we use renv to install data.table for any project we are using.

renv::install("data.table")

Then we can load the package,

library(data.table)

data.table objects

data.tables are a new object to replace data.frame.

Similar to tibbles, they have nicer default printing

data.frame(
  x = rnorm(200),
  y = rnorm(200)
)
              x           y
1    0.83377917  0.42855072
2   -0.36077184  0.53474323
3    2.11178136 -0.07476166
4    0.79099643  0.15891206
5   -0.04275800  0.07132163
6    1.23273409 -0.76437363
7   -0.35672042  2.49836674
8    1.54626789  0.51806839
9    0.47547460  0.26209775
10   0.61888324  1.28221455
11  -1.56427897  1.04788785
12  -1.33957690  0.63640091
13  -0.31086400 -0.47595092
14   0.29463141  0.33838549
15  -0.79972988  0.51030725
16  -1.94221358  1.36570376
17   1.41690854  1.90730950
18   0.73980680 -0.64199706
19  -0.70874365 -0.43554910
20  -0.64913039 -1.61062741
21  -0.67103841  0.25141990
22   1.97312733  1.26008509
23   0.97448745 -0.34154539
24  -0.97855331 -0.35580154
25  -2.44145584  1.50798907
26   1.56142993 -0.65696576
27  -0.98832285  1.27780709
28   0.32080586  0.29752028
29  -0.39849697 -0.07838319
30   0.30725461 -1.27512624
31  -0.93404373  0.84386316
32   0.30909515 -1.46539157
33  -0.92918608  1.32002813
34  -0.73848119  1.54514920
35   0.02003996 -1.15617196
36  -0.84559194 -0.62265194
37   1.37588391  0.67335249
38   0.07091679 -1.65400615
39  -1.13626114 -0.09475975
40  -0.27398837  0.73372151
41  -0.16062430  0.45327133
42   0.74115272 -0.07673070
43   0.73017583  1.14002521
44  -0.46366947  0.85328903
45  -0.25627678  0.16746347
46   0.26692027 -1.38073257
47   0.85789370  0.66542867
48  -0.35307531 -0.79459430
49   1.16863021 -0.68762430
50   1.48238157 -1.36783859
51  -1.38829176 -1.56436398
52   0.45506797 -0.65327492
53  -0.95961886  2.16865159
54   0.20679806  0.89304371
55  -0.14924144  0.10380360
56   0.34617227  1.55884582
57  -0.44707571 -1.01702409
58  -0.07038732  1.24338133
59   0.01508632 -1.60860290
60   0.14377797  0.01422270
61  -0.47592212 -2.09510429
62  -0.18489840  0.46341697
63  -2.34252819 -0.43065278
64   0.64184081  0.15917291
65   1.06889406  0.01566283
66   0.11274229  1.38075918
67   0.92797834  0.72433469
68  -0.29506723  0.27852908
69   0.86763874  0.26116636
70   0.86340114  0.77255754
71   0.20557303  1.18846120
72   1.56808601 -2.60856880
73   1.44461676  0.54722415
74  -0.48743038  0.51118354
75  -0.44116934  0.89458143
76  -0.58507781 -1.33086406
77  -0.41085083  1.24373750
78   0.35281164  0.14250036
79   0.43195812 -1.84680797
80   0.40456099  1.13918645
81  -0.92370616  0.21089086
82   0.51476385 -0.99094096
83   0.57675672  1.28927599
84   1.37242625 -0.14106231
85  -1.25967764 -1.01421011
86  -0.17724804 -0.40073987
87   1.48900094 -1.08023394
88   0.31238965 -1.36900298
89   0.33618635 -0.42010586
90  -1.79876349 -1.20585154
91  -0.24026881 -0.04706084
92  -0.14865484  1.13828635
93   0.32793059  0.04332658
94   0.60392365  0.45964180
95  -1.17712154  1.61579177
96  -1.93135084 -1.24312182
97  -1.23281467  0.86054184
98   1.48761383 -0.96615897
99   0.41736954  0.08814179
100 -0.47268737  0.59772186
101  0.67725398  0.10717836
102  0.43898993  0.37018519
103  0.89710000 -2.21956380
104 -0.79911090 -1.35121018
105 -0.85600587  1.76496907
106 -0.19818464 -0.44727041
107  0.87408305 -0.20274831
108  0.84516783 -2.11227752
109  1.28249819  1.49713065
110 -2.01576831  0.56953182
111 -0.36324598  0.30786463
112  0.58714744  1.31490079
113 -0.06159335 -0.57149550
114  0.73098682  0.71580401
115 -0.23360319 -0.21085447
116 -1.64182403 -0.17340268
117 -0.74162728  1.13216050
118 -0.85269725 -0.56400404
119 -0.89855593 -1.15023030
120  1.82738317  0.91358654
121  0.61410585 -0.74057573
122  0.13390337 -1.88425565
123  1.04851015 -0.74130028
124 -0.32335482 -1.63741397
125 -0.65414831  1.38164950
126 -1.19751763 -0.16461730
127  2.04570872 -0.87959895
128 -0.53516445  1.26963165
129 -0.18040332  0.88908948
130 -0.46408498  0.92391859
131 -0.80902063  0.97409543
132  0.85149007 -0.13961062
133 -0.09836902  0.37954484
134 -0.47410787 -2.06438342
135 -0.23860408  1.52237776
136 -0.40153081  1.50593748
137  1.55072645  1.71443410
138 -0.93672793 -0.47321349
139  0.22803765 -1.47116753
140  0.26018802 -0.04272721
141 -0.38394998  1.36349392
142 -0.41859760  0.42184092
143 -0.51270321  1.66551189
144  1.39559290  0.70203129
145 -0.61752100  2.91003234
146  0.70770715  1.59477605
147 -1.00840837 -0.26743837
148  0.92721208 -0.52490848
149 -0.21810987  0.18132720
150 -0.47543538 -1.09812239
151  0.54252766 -0.24747720
152  0.36133737 -0.55865007
153 -0.43866217  1.25028299
154  1.09169396 -1.68635282
155  0.02692408 -1.39889031
156  0.04867265 -0.29793267
157 -0.13255793  0.44633739
158  0.54726402 -0.53147758
159  0.49347359 -1.49471166
160  0.27638636 -0.21632815
161 -3.24236647  0.98446427
162 -0.19978516 -1.22114773
163  1.04976539 -1.58358745
164  0.70393545  0.43180716
165  0.39408693  0.32954214
166 -0.92018846 -0.68299375
167 -0.46088598 -2.47876262
168 -0.86553374  0.63809522
169  0.37565439  0.11511289
170  1.17928135 -0.62563798
171 -0.70017186  1.12360419
172 -0.42803127  0.55387936
173 -3.25087344 -0.33958271
174  0.27046186  0.21100700
175 -0.25663660  0.60874978
176 -0.90820997  0.05266776
177 -0.44563149 -1.11855346
178  1.30859122 -0.27223246
179 -0.44813126 -2.10862740
180 -0.71505106 -0.44094351
181  0.35639409 -0.22461498
182 -0.17278591  0.03807352
183  0.01758829 -1.28705513
184 -1.06117228  0.05136929
185  0.44297244 -0.25948172
186 -1.39253467  0.07712081
187  0.05736367 -1.33846674
188 -0.13134255  0.18815446
189  1.45164219 -0.71646591
190  1.56213431  1.19188564
191  1.74919383  1.81041742
192 -0.90359507 -0.32468805
193 -1.22983233  1.05387887
194  1.22986608 -0.10111660
195  0.14475900  1.64803595
196 -0.34735696 -1.27477799
197  0.93444931  0.11091217
198  1.49584036  0.21367063
199 -0.58907131  0.23027112
200  1.02456852 -0.25140097
data.table(
  x = rnorm(200),
  y = rnorm(200)
)
              x          y
          <num>      <num>
  1:  0.4144923  1.5800321
  2: -0.0475305  1.3920581
  3:  0.0200618 -0.7373673
  4:  0.8364404  2.4736258
  5: -0.7871103  0.7911376
 ---                      
196:  0.3394136 -0.3338586
197: -0.2314435  1.2737760
198: -1.8162878  0.2241776
199:  0.3350380  0.3556842
200:  1.1777340  1.0143951

data.table square brackets

Remember, from base R, than we can subset vectors, matrices, and data.frames using [i,j].

This is true for data.table as well.

x <- data.table(a = rnorm(10), b = rnorm(10))
x[1,2]
           b
       <num>
1: -0.199981

data.table square brackets

data.table takes square brackets a huge step further, by allowing

  • filtering in the ‘i’ place
  • new variable declaration in the ‘j’ place

We will look at each of these in turn.

data.table filtering

x <- data.table(a = rnorm(10), b = rnorm(10))
x
             a           b
         <num>       <num>
 1: -0.1435757 -0.83542558
 2: -0.7093246  0.73930375
 3:  0.5523134  0.01372678
 4:  1.0869389 -0.56078594
 5: -1.0121181  0.44603375
 6:  0.3343524 -1.63500508
 7: -0.1386005  0.85560953
 8:  1.5839076  0.46814137
 9: -0.9337572  1.02040010
10:  0.4589008 -1.03079198

What if we want all the values where “a” is less than 0?

x[a < 0, ]
            a          b
        <num>      <num>
1: -0.1435757 -0.8354256
2: -0.7093246  0.7393038
3: -1.0121181  0.4460338
4: -0.1386005  0.8556095
5: -0.9337572  1.0204001

data.table filtering

x
             a           b
         <num>       <num>
 1: -0.1435757 -0.83542558
 2: -0.7093246  0.73930375
 3:  0.5523134  0.01372678
 4:  1.0869389 -0.56078594
 5: -1.0121181  0.44603375
 6:  0.3343524 -1.63500508
 7: -0.1386005  0.85560953
 8:  1.5839076  0.46814137
 9: -0.9337572  1.02040010
10:  0.4589008 -1.03079198

Instead we could do all the values where “a” is less than “b”.

x[a < b, ]
            a         b
        <num>     <num>
1: -0.7093246 0.7393038
2: -1.0121181 0.4460338
3: -0.1386005 0.8556095
4: -0.9337572 1.0204001

data.table filtering

x
             a           b
         <num>       <num>
 1: -0.1435757 -0.83542558
 2: -0.7093246  0.73930375
 3:  0.5523134  0.01372678
 4:  1.0869389 -0.56078594
 5: -1.0121181  0.44603375
 6:  0.3343524 -1.63500508
 7: -0.1386005  0.85560953
 8:  1.5839076  0.46814137
 9: -0.9337572  1.02040010
10:  0.4589008 -1.03079198

In fact, we can drop the comma, as data.table will assume that only one item is always the filtering argument.

x[a < b]
            a         b
        <num>     <num>
1: -0.7093246 0.7393038
2: -1.0121181 0.4460338
3: -0.1386005 0.8556095
4: -0.9337572 1.0204001

tidyverse filtering equivalent

Here is a comparison of the data.table and tidyverse syntaxes side-by-side.

x[a < 0]
            a          b
        <num>      <num>
1: -0.1435757 -0.8354256
2: -0.7093246  0.7393038
3: -1.0121181  0.4460338
4: -0.1386005  0.8556095
5: -0.9337572  1.0204001
x |>
  filter(a < 0)
# A tibble: 5 × 2
       a      b
   <dbl>  <dbl>
1 -0.144 -0.835
2 -0.709  0.739
3 -1.01   0.446
4 -0.139  0.856
5 -0.934  1.02 

Pretty similar, but data.table is shorter.

data.table column subsetting

Remember that normal table subsetting uses “i” for rows, and “j” for columns.

data.table follows this.

dt <- data.table(a = rnorm(5), b = rnorm(5), c = rnorm(5))
dt[, a]
[1]  1.2169765 -1.1971431 -0.1258604  0.9539744  0.1020553

If you want two columns, you have to pass a list.

dt[, list(a, b)]
            a           b
        <num>       <num>
1:  1.2169765  0.16612719
2: -1.1971431  0.79522672
3: -0.1258604 -0.01885368
4:  0.9539744 -1.12673667
5:  0.1020553  0.48392779

Or instead, data.table uses .(a,b) as equivalent.

dt[, .(a, b)]

data.table variable declaration

What if we want to add a new column?

data.table does this using the column subsetting place, “j”.

dt <- data.table(a = rnorm(5), b = rnorm(5))
dt[, newcol := a + b]
dt
             a          b     newcol
         <num>      <num>      <num>
1:  1.22913930  2.2693465  3.4984858
2: -0.21333148 -0.5114615 -0.7247930
3: -0.62143161  0.0292738 -0.5921578
4: -0.85616921  2.4456923  1.5895231
5:  0.02375024  3.0363877  3.0601380

:= is a data.table function representing “defined”.

i.e. newcol defined as ‘a’ plus ‘b’

Also called the “walrus” operator by some people.

data.table variable declaration

Would this work?

dt <- data.table(a = rnorm(5), b = rnorm(5))
dt[newcol := a + b]

No.

data.table will assume we are using the “i” place for rows.

dt[newcol := a + b]
Error: Operator := detected in i, the first argument inside DT[...], but is only valid in the second argument, j. Most often, this happens when forgetting the first comma (e.g. DT[newvar := 5] instead of DT[ , new_var := 5]). Please double-check the syntax. Run traceback(), and debugger() to get a line number.

data.table variable editing

We can use the exact same syntax to edit current columns.

dt <- data.table(a = rnorm(5), b = rnorm(5))
dt[, a := 1:5]
dt
       a          b
   <int>      <num>
1:     1  1.1949701
2:     2 -0.1591262
3:     3  1.4724939
4:     4 -1.2206118
5:     5 -1.3712154

Or to remove a column,

dt[, b := NULL]
dt
       a
   <int>
1:     1
2:     2
3:     3
4:     4
5:     5

data.table conditional editing

What if we want to remove values of “b” that are less than 0?

dt <- data.table(a = -2:2, b = -2:2)

We could conditionally declare values of “b” to be NA.

dt[b < 0, b := NA]
dt
       a     b
   <int> <int>
1:    -2    NA
2:    -1    NA
3:     0     0
4:     1     1
5:     2     2

In tidyverse, the equivalent is somewhat wordy:

dt |> mutate(b = if_else(b < 0, NA, b))

Summarizing with data.table

Remember in tidyverse we could write:

mtcars |> summarize(avg = mean(mpg))
       avg
1 20.09062

The equivalent in data.table is to pass a list .() to the “j” place:

dt <- as.data.table(mtcars)
dt[, .(avg = mean(mpg))]
        avg
      <num>
1: 20.09062

Note that I had to make “mtcars” a data.table first.

Summarizing with data.table

Instead, I could simply make the mean a new column.

dt[, avg := mean(mpg)]
dt
      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb      avg
    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>    <num>
 1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4 20.09062
 2:  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4 20.09062
 3:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1 20.09062
 4:  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1 20.09062
 5:  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2 20.09062
 6:  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1 20.09062
 7:  14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4 20.09062
 8:  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2 20.09062
 9:  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2 20.09062
10:  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4 20.09062
11:  17.8     6 167.6   123  3.92 3.440 18.90     1     0     4     4 20.09062
12:  16.4     8 275.8   180  3.07 4.070 17.40     0     0     3     3 20.09062
13:  17.3     8 275.8   180  3.07 3.730 17.60     0     0     3     3 20.09062
14:  15.2     8 275.8   180  3.07 3.780 18.00     0     0     3     3 20.09062
15:  10.4     8 472.0   205  2.93 5.250 17.98     0     0     3     4 20.09062
16:  10.4     8 460.0   215  3.00 5.424 17.82     0     0     3     4 20.09062
17:  14.7     8 440.0   230  3.23 5.345 17.42     0     0     3     4 20.09062
18:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1 20.09062
19:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2 20.09062
20:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1 20.09062
21:  21.5     4 120.1    97  3.70 2.465 20.01     1     0     3     1 20.09062
22:  15.5     8 318.0   150  2.76 3.520 16.87     0     0     3     2 20.09062
23:  15.2     8 304.0   150  3.15 3.435 17.30     0     0     3     2 20.09062
24:  13.3     8 350.0   245  3.73 3.840 15.41     0     0     3     4 20.09062
25:  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2 20.09062
26:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1 20.09062
27:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2 20.09062
28:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2 20.09062
29:  15.8     8 351.0   264  4.22 3.170 14.50     0     1     5     4 20.09062
30:  19.7     6 145.0   175  3.62 2.770 15.50     0     1     5     6 20.09062
31:  15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8 20.09062
32:  21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2 20.09062
      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb      avg

data.table will spread any single value to the length of the table.

Multiple summaries

We can summarize multiple values or columns by simply adding arguments to .()

dt[, .(avg = mean(mpg), med = median(mpg), n = .N)]
        avg   med     n
      <num> <num> <int>
1: 20.09062  19.2    32

How do we do this by group?

We could use the filtering spot, “i”, and append each result

dt[cyl == 6,  .(avg = mean(mpg), med = median(mpg), n = .N)]
        avg   med     n
      <num> <num> <int>
1: 19.74286  19.7     7

or…

Summarizing by group

We can use the “hidden” third square bracket argument: “by=”.

dt[, .(avg = mean(mpg), med = median(mpg), n = .N), by = cyl]
     cyl      avg   med     n
   <num>    <num> <num> <int>
1:     6 19.74286  19.7     7
2:     4 26.66364  26.0    11
3:     8 15.10000  15.2    14

We can pass multiple columns to by using the short list: .()

dt[, .(avg = mean(mpg), med = median(mpg), n = .N), by = .(cyl, gear)]
     cyl  gear    avg   med     n
   <num> <num>  <num> <num> <int>
1:     6     4 19.750 20.10     4
2:     4     4 26.925 25.85     8
3:     6     3 19.750 19.75     2
4:     8     3 15.050 15.20    12
5:     4     3 21.500 21.50     1
6:     4     5 28.200 28.20     2
7:     8     5 15.400 15.40     2
8:     6     5 19.700 19.70     1

data.table argument summary

dt[

i ,

j ,

by = ]

filtering rows

filtering columns, new columns, summarize

group by columns

tidyverse equivalent verbs

filter()

select(), mutate(), summarize()

groub_by()

Comparison

mtcars |>
  filter(hp > 100) |>
  group_by(cyl, gear) |>
  summarize(avg_mgp = mean(mpg))
# A tibble: 7 × 3
# Groups:   cyl [3]
    cyl  gear avg_mgp
  <dbl> <dbl>   <dbl>
1     4     4    21.4
2     4     5    30.4
3     6     3    19.8
4     6     4    19.8
5     6     5    19.7
6     8     3    15.0
7     8     5    15.4
dt[hp > 100, .(avg_mpg = mean(mpg)), by = .(cyl, gear)]
     cyl  gear avg_mpg
   <num> <num>   <num>
1:     6     4   19.75
2:     6     3   19.75
3:     8     3   15.05
4:     4     5   30.40
5:     8     5   15.40
6:     6     5   19.70
7:     4     4   21.40

data.table syntax is more concise, but harder to read.

Which should I use?

The main answer:

  • personal preference (which syntax do you prefer?)

Second answer:

  • data.table is much faster
  • especially for large datasets

Speed comparison

A simulation exercise by mdharris makes this clear.

“site_count” is a measure of the simulation size (i.e. size of the data.frames). Then some costly, but not absurd calculations are performed on the data.

data.table clearly scales much better with size.

Why is data.table faster?

  1. optimized for speed
  1. Modifies “by reference”

tidyverse functions instead “copy-on-modify”.

Modify by reference vs. copy

Copy-on-modify

  • Creates a copy of the data
  • Modifies the copy
  • Returns the copy

Modify-in-place (modify “by reference”)

  • References the data
  • Modifies the original data

Avoids duplicating data, which is costly for your computer’s memory, especially with large datasets.

Modify by reference vs. copy

You may have noticed that I had to add an extra call to print data.tables after modifying them.

dt <- data.table(x = 1:2)
dt[, y := c("a", "b")]
dt
       x      y
   <int> <char>
1:     1      a
2:     2      b

This is because the line dt[, y := c("a", "b")] doesn’t return anything.

Because it edited the data in place, not by creating a copy.

Dangers of editing by reference

Modifying by reference can be dangerous if you are not careful.

dt <- data.table(a = 1:2, b = 3:4)
dt2 <- dt
dt2
       a     b
   <int> <int>
1:     1     3
2:     2     4

What happens to “dt” if we remove a column from “dt2”?

dt2[, b := NULL]
dt
       a
   <int>
1:     1
2:     2

“dt” and “dt2” reference the same underlying data.

If you truly want a new copy, use

dt2 <- copy(dt)

Even Faster!

Keys

Keys provide some organization to your data.

This makes it much faster to later:

  • filter

  • group by

  • merge

data.table lets you add keys in a variety of ways.

data.table(x = c("a", "a", "b"), y = 1:3, key = "x")
dt <- as.data.table(df, key = "x")
setkey(dt, "x")

You can also pass a vector of keys c("x", "y") to use.

What are keys?

Keys provide an index for the data.

Think of this like a phone book.

If you wanted to look my phone number up in the phone book,

  • you would search under “D” for DeHaven

  • you would not search the entire phone book!

When we tell data.table to use some columns as a key, it presorts and indexes the data on those columns.

Merging with data.table

For data.table all types of merges are handled by one function:

dt <- data.table(x = c("a", "b", "c"), y = 1:3)
dt2 <- data.table(x = c("a", "c", "d"), z = 4:6)
merge(dt, dt2, by = "x")
Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      c     3     5

By default, merge() does an “inner” join (matches in both datasets).

Merging with data.table

Instead, we can force

A “left” join,

merge(dt, dt2, by = "x", all.x = TRUE)
Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      b     2    NA
3:      c     3     5

A “right” join,

merge(dt, dt2, by = "x", all.y = TRUE)
Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      c     3     5
3:      d    NA     6

Or an “outer” join

merge(dt, dt2, by = "x", all = TRUE)
Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      b     2    NA
3:      c     3     5
4:      d    NA     6

Non-matching column names

We can also be explicit about which columns to use from “x” and from “y”,

dt <- data.table(x = c("a", "b", "c"), y = 1:3)
dt2 <- data.table(XX = c("a", "c", "d"), z = 4:6)
merge(dt, dt2, by.x = "x", by.y = "XX")
Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      c     3     5

But it’s probably better to clean up your column names before hand.

Pivoting with data.table

data.table pivots

Two main functions:

  • melt() go from wide to long

  • dcast() go from long to wide

We will see how to use each.

There is instead a package tidyfast which offers a dt_pivot_longer() and dt_pivot_wider() to match the syntax of tidyr but with the speed of data.table.

melt

Use to go from wide to long data.

You can provide the “id.vars=” argument, i.e. the columns not to be melted.

dt <- data.table(
  measure = c("unem", "GDP"),
  RI = c(0.05, 100),
  MA = c(0.04, 102),
  CT = c(0.06, 101)
)
dt
   measure    RI     MA     CT
    <char> <num>  <num>  <num>
1:    unem 5e-02   0.04   0.06
2:     GDP 1e+02 102.00 101.00
melt(dt, id.vars = c("measure"))
   measure variable  value
    <char>   <fctr>  <num>
1:    unem       RI   0.05
2:     GDP       RI 100.00
3:    unem       MA   0.04
4:     GDP       MA 102.00
5:    unem       CT   0.06
6:     GDP       CT 101.00

melt

Instead, you can provide the “measure.vars=” argument, i.e. the columns to be melted.

dt <- data.table(
  measure = c("unem", "GDP"),
  RI = c(0.05, 100),
  MA = c(0.04, 102),
  CT = c(0.06, 101)
)
melt(dt, measure.vars = c("RI", "MA", "CT"))
   measure variable  value
    <char>   <fctr>  <num>
1:    unem       RI   0.05
2:     GDP       RI 100.00
3:    unem       MA   0.04
4:     GDP       MA 102.00
5:    unem       CT   0.06
6:     GDP       CT 101.00

Generally there are more of those, so it’s easier to list the “id.vars=”.

melt

You can overwrite the default “variable” and “value” column names,

long_dt <- melt(dt, id.vars = "measure", value.name = "x", variable.name = "state")
long_dt
   measure  state      x
    <char> <fctr>  <num>
1:    unem     RI   0.05
2:     GDP     RI 100.00
3:    unem     MA   0.04
4:     GDP     MA 102.00
5:    unem     CT   0.06
6:     GDP     CT 101.00

dcast

Has a bit of an odd syntax, it takes the data.table and a “formula”.

The formula is of the form “LHS ~ RHS”,

  • LHS are the columns that remain “long”
  • RHS is the column that gets spread “wide”
long_dt
   measure  state      x
    <char> <fctr>  <num>
1:    unem     RI   0.05
2:     GDP     RI 100.00
3:    unem     MA   0.04
4:     GDP     MA 102.00
5:    unem     CT   0.06
6:     GDP     CT 101.00
dcast(long_dt, measure ~ state)
Key: <measure>
   measure    RI     MA     CT
    <char> <num>  <num>  <num>
1:     GDP 1e+02 102.00 101.00
2:    unem 5e-02   0.04   0.06

tidyverse or data.table?

tidyverse or data.table?

In practice, I (and many people) use a mix of both.

I personally prefer data.tables syntax for filtering, modifying columns, and summarizing.

But I find melt and dcast much clunkier than the pivot_ functions.

If you really want the speed of data.table and the syntax of dplyr

dtplyr

…there is a special package that combines the two!

dtplyr allows you to use all of the tidyverse verbs:

  • mutate()
  • filter()
  • groupby()

But behind the scenes runs everything using data.table.

A note on pipes

You can use pipes with data.table as well.

dt <- data.table(x = 1:2)
dt[, y := 3:4]
dt[, z := 5:6]
dt
       x     y     z
   <int> <int> <int>
1:     1     3     5
2:     2     4     6
dt <- data.table(x = 1:2)
dt |>
  _[, y := 3:4] |>
  _[, z := 5:6]
dt
       x     y     z
   <int> <int> <int>
1:     1     3     5
2:     2     4     6

The underscore character _ is known as a “placeholder”,

and represents the object being piped.

A bit less useful because the data.table syntax is already very short.

Summary

  • renv
    • Separate libraries
    • snapshot() packages to lockfiles
  • Introduced to data.table
    • Using square bracket positions dt[i, j, by=]
    • Fast and concise
    • Keys
    • merge()
    • melt() and dcast()

Summary

Pick your own journey if you prefer data.table or tidyverse!

Though I will make you use both on this next assignment…