R Data Wrangling:
data.table

Matthew DeHaven

Course Home Page

2024-01-01

Lecture Summary

  • Talking more about renv
  • data.table

More renv

R Library

R by default has one library of packages for all projects.

G A User Library B Project 1 A--B C Project 2 A--C D Project 3 A--D

R Library

You can see where this library lives by calling,

.libPaths()

Notice there are actually two library paths returned!

“User” Library

  • For any packages you install

“System” Library

  • For default packages when base R is installed

R Library

Installing packages puts them into the User Library. Calling library() loads a package.

G A User Library B1 Project 1 A--B1 library() B2 Project 1 A--B2 B3 Project 1 A--B3 library() C CRAN C--A install.packages()

The Problems

There are two problems with R’s default package setup.

  1. All projects share the same library
  • Updating a package for one project updates it for all
  1. It is not easy to record and recreate the exact packages used for a project

These are obviously related, which is why renv solves both.

renv Separates Project Libraries

renv::init() makes isolated libraries for each project.

G A1 Library 1 B Project 1 A1--B A2 Library 2 C Project 2 A2--C A3 Library 3 D Project 3 A3--D

renv Cache

In order to avoid duplicate copies and downloads, renv stores a cache of all the packages it installs behind the scenes.

G Z renv Cache A1 Library 1 Z--A1 A2 Library 2 Z--A2 A3 Library 3 Z--A3 B Project 1 A1--B C Project 2 A2--C D Project 3 A3--D

This solves projects changing each other’s packages, but does not yet give us any way to track and restore packages.

renv lockfile

The lockfile keeps tracks of what packages are used in a project.

It does not default to everything in the Project Library.

  • You may have some packages you use just for development that are not needed for all of your code to run.
  • An example of this is the languageserver package

renv Record Packages to the Lockfile

In order to record a new package to the lockfile, run

renv::snapshot()

This causes renv to search all of your project’s R files, see what packages are being used, and save the current version to the lockfile.

renv Status

To see if your lockfile is up-to-date, you can run

renv::status()

renv will also tell you if everything is up-to-date when you launch a new R terminal.

renv Restore

If you open a project with a renv lockfile and want to recreate the package environment, run

renv::restore()

You will see this in the next assignment.

renv Summary

data.table

data.table

Is another package focused on data science.

It’s goals:

  • speed
  • brevity
  • no dependencies

In contrast to the tidyverse which is focused on readability.

Installing data.table

Again, we use renv to install data.table for any project we are using.

renv::install("data.table")

Then we can load the package,

library(data.table)

data.table objects

data.tables are a new object to replace data.frame.

Similar to tibbles, they have nicer default printing

data.frame(
  x = rnorm(200),
  y = rnorm(200)
)
                x            y
1    0.2176180759  0.845299920
2    0.8315502429 -1.912088839
3    0.4336599381  1.249713280
4   -0.6064116759 -0.492438762
5    0.1047437378  0.692491473
6    0.3954585758  0.363160549
7   -0.6322579392  0.461267335
8    2.0097489559 -1.794764289
9    0.1238147480 -0.897051101
10   0.4085475701 -0.596935133
11   0.3041320144 -0.961204785
12   1.6307634028  0.122357807
13  -1.4995724509  0.494226730
14  -0.8058411664  0.834436270
15  -0.4607678624  1.065866434
16   0.3728639124  0.898016393
17   0.2873486773 -1.232859638
18   1.5832830672 -0.835078286
19   0.7696143682 -0.714059577
20   1.3027338917  0.754586810
21  -0.5517187078 -1.295293331
22  -0.7377118245  0.311617171
23  -0.0502708149 -1.233521659
24  -0.6336227276 -1.340867827
25   1.1508161177 -0.119841729
26   0.1282261261 -0.006074305
27  -1.3645556345  1.045876379
28  -0.7225296817 -0.609757552
29  -0.0710353360  1.021934442
30   0.4490436085 -0.852037359
31  -0.3453445626  0.392332497
32  -0.2237668796 -1.554617727
33   0.2287305416  0.747754548
34  -0.1431252640 -0.715315296
35   0.8608447455  0.127446580
36  -0.0407475106  1.245841770
37   0.5347260810  0.604684439
38   0.8707140324  1.103927155
39   0.0665660930 -0.473655327
40  -0.8437202918 -0.296073582
41  -0.5167530696 -0.200588433
42   0.1830690735  0.589045879
43  -0.1814497300  1.236846776
44   1.0407452099 -0.369801577
45  -1.1621759916 -0.699248569
46  -0.0706461030 -0.169438587
47   0.4881736124 -1.487003804
48  -0.2875757311 -0.980333901
49  -0.6151910329 -1.289897742
50   0.1479521868  0.131813032
51   0.9285833469  0.154239001
52  -1.4326248346 -0.844985139
53   0.5484043089 -0.990673508
54   0.2310564403 -0.816596765
55  -0.6424217470  1.996500471
56  -0.3262261102  0.306810375
57   0.9434319958  2.521753953
58   0.5484243505 -0.123019321
59   0.4295029437 -1.041834460
60  -0.4339641311 -0.906090005
61  -0.3022943673 -0.720878989
62   1.5578969895 -0.350037612
63  -0.3240115684  0.562773753
64  -0.8786397075  0.284150591
65   1.5131266848 -0.538517829
66   1.4506597254  1.040535004
67   0.1666648344  0.130587574
68  -2.0610876556 -0.395751313
69  -0.3894989597 -0.790612269
70  -0.2244559728  0.616398583
71   0.8438949857  1.393255004
72   1.3669475878  0.365977405
73   0.3479722869  0.816954126
74   0.6983610707 -1.479526273
75  -1.5830269844 -0.633090571
76  -0.0031007704  0.232942847
77  -1.5341464783 -0.021729267
78   0.6597705557  1.115202442
79  -1.3384155751  0.707655326
80   0.4796858415  0.557570314
81   0.7251979941  0.271521812
82   1.4251148491  0.738065918
83   0.9406016120  0.148378969
84  -0.3575855515  0.641216967
85  -0.6777377542  1.852589696
86  -1.3610219367  1.420314426
87   0.0539814614  0.338860244
88  -0.5988386125 -0.084309737
89   1.5994385376  0.351866923
90  -2.4144181291 -0.728598194
91   0.5595766221 -2.389937515
92   0.0972633356  0.504934616
93   0.8174738717 -0.188326693
94  -0.0194549340  0.016072918
95  -0.2639540897 -0.071135151
96   0.7203733441  1.239665018
97   1.6143563937  0.197653511
98   0.8921546644 -0.776755726
99   0.5003449229 -0.420460639
100 -0.7035744803  0.385971926
101 -0.6476566830  0.174015284
102 -0.4324603826  0.553896856
103  0.3151218822  0.529222141
104  1.5386630204 -0.670138704
105  0.3775913719  0.312006778
106 -1.1759612228  0.613993723
107  0.7506937497 -0.109901740
108 -1.3398935087  0.343893655
109 -0.3849629188 -1.600246580
110 -0.4634256708 -0.502375041
111  2.3872909513  0.399459501
112 -2.1184162857  0.937245325
113  1.1100140935 -0.766755258
114 -1.5148303806 -0.218489097
115  1.2057237133 -0.292945891
116  1.4373222907 -0.800390299
117 -1.0837885668  1.120067314
118  0.8953420196 -1.078227278
119 -0.3597390619 -0.349590498
120  1.6101408730  0.477827312
121 -0.3571059713 -1.219136523
122  0.2103909051  1.175544623
123  0.7434510791 -0.687909615
124  0.7884631157 -0.277844921
125  0.8051439795  1.671374901
126 -1.0227864563  0.998307121
127  1.3255777717  0.786876129
128 -0.4262748214  0.788845679
129 -1.3381830514 -0.025603406
130 -0.2886872702 -1.778837381
131  0.0273585255  0.031655419
132  0.0738927619  0.522333715
133  1.4057382752 -1.777153028
134  0.3559898698 -0.327881723
135 -0.1970144183  1.231251409
136  0.8782132330  1.444496074
137 -2.5476176862 -0.366210316
138  1.4173314139  0.781311418
139  0.4800506036  0.032877313
140 -0.3520628645  0.089089469
141  0.4517307219  1.350319879
142 -1.0068895688 -1.295883083
143  0.8896057310  1.251802086
144 -0.0237487748  1.245044351
145 -0.2346514033 -1.129440506
146  0.5333118573  2.037994871
147 -1.0192772118  0.464508929
148  0.3487913416 -0.216491297
149 -0.4091664460 -1.455393331
150 -0.2366783042  0.558416532
151  0.5953597924  1.937447736
152  0.8193235631 -0.373634136
153  0.0737412483 -0.001285201
154  0.3153821430  1.118826940
155  0.4229540634  1.710937613
156  1.8123798460  0.281283542
157 -0.3028493645  1.292680564
158  0.2810965731 -0.056968763
159 -1.5322632907 -0.025268407
160 -0.6603110360 -1.043771853
161  0.0181564449 -0.450405893
162  0.0060277508  1.301952383
163 -0.8765741073 -1.546954587
164  0.3374973999  2.163462794
165  1.5868941296 -0.215335337
166 -0.9869984369 -0.041920017
167 -1.3139744370 -1.274100225
168 -1.1966182645 -1.125421692
169 -1.3411635694  1.093781900
170 -2.0317296438 -0.835090228
171 -1.2795675426 -0.020887600
172 -0.1587601353 -0.410541121
173  0.8078697894  1.013471640
174  0.6711548706 -1.121446590
175 -0.6752803493 -1.611938891
176 -0.1133926579  0.817988591
177 -0.7549755668 -1.645389354
178 -0.4449720950 -0.148669054
179 -0.1603629544  0.303937067
180 -0.4825257439  0.586513872
181  1.7719553903  0.190627776
182  0.2748432332 -1.716396322
183 -0.0589740749  1.190722022
184 -1.8216225913  0.149191449
185 -0.1231964074 -0.519365064
186 -0.0391327213  0.793177325
187 -0.7987688907 -1.411250361
188 -2.3390080938 -0.070518619
189 -0.3773146067 -0.049133419
190 -0.9954655132  0.263681756
191 -0.0002018366 -1.157001243
192 -1.0395385565  0.238082780
193 -0.5684007849 -0.561563040
194 -0.9005807751  0.111836140
195  0.5015315683  1.121351091
196 -1.4971604623  0.347850056
197  0.8480539903  0.646499501
198 -0.0707157342  1.071113214
199  0.3868853709  2.175954241
200  0.5887495941  2.570317001
data.table(
  x = rnorm(200),
  y = rnorm(200)
)
               x          y
           <num>      <num>
  1: -1.00897359 -0.4087155
  2:  2.55870500 -0.2288996
  3:  1.18031640 -2.4433400
  4: -1.76549680 -1.1524490
  5:  2.10837263  1.0708214
 ---                       
196: -0.82412022  0.6333711
197: -1.76409078  1.0891432
198:  0.58533120  0.4220185
199: -0.04761425  0.4647622
200: -0.80502935  0.4626533

data.table square brackets

Remember, from base R, than we can subset vectors, matrices, and data.frames using [i,j].

This is true for data.table as well.

x <- data.table(a = rnorm(10), b = rnorm(10))
x[1,2]
            b
        <num>
1: -0.1561649

data.table square brackets

data.table takes square brackets a huge step further, by allowing

  • filtering in the ‘i’ place
  • new variable declaration in the ‘j’ place

We will look at each of these in turn.

data.table filtering

x <- data.table(a = rnorm(10), b = rnorm(10))
x
               a          b
           <num>      <num>
 1: -0.319324871 -0.2243540
 2: -1.003274392  0.3469449
 3: -1.064288996 -0.4809599
 4:  0.005162979  0.1318561
 5: -1.822072621  0.3770973
 6: -0.750549664 -0.2684060
 7:  1.282488595 -0.9357365
 8:  0.791203323  0.1288643
 9:  0.507654181  1.6424895
10: -2.061684596  0.7237209

What if we want all the values where “a” is less than 0?

x[a < 0, ]
            a          b
        <num>      <num>
1: -0.3193249 -0.2243540
2: -1.0032744  0.3469449
3: -1.0642890 -0.4809599
4: -1.8220726  0.3770973
5: -0.7505497 -0.2684060
6: -2.0616846  0.7237209

data.table filtering

x
               a          b
           <num>      <num>
 1: -0.319324871 -0.2243540
 2: -1.003274392  0.3469449
 3: -1.064288996 -0.4809599
 4:  0.005162979  0.1318561
 5: -1.822072621  0.3770973
 6: -0.750549664 -0.2684060
 7:  1.282488595 -0.9357365
 8:  0.791203323  0.1288643
 9:  0.507654181  1.6424895
10: -2.061684596  0.7237209

Instead we could do all the values where “a” is less than “b”.

x[a < b, ]
              a          b
          <num>      <num>
1: -0.319324871 -0.2243540
2: -1.003274392  0.3469449
3: -1.064288996 -0.4809599
4:  0.005162979  0.1318561
5: -1.822072621  0.3770973
6: -0.750549664 -0.2684060
7:  0.507654181  1.6424895
8: -2.061684596  0.7237209

data.table filtering

x
               a          b
           <num>      <num>
 1: -0.319324871 -0.2243540
 2: -1.003274392  0.3469449
 3: -1.064288996 -0.4809599
 4:  0.005162979  0.1318561
 5: -1.822072621  0.3770973
 6: -0.750549664 -0.2684060
 7:  1.282488595 -0.9357365
 8:  0.791203323  0.1288643
 9:  0.507654181  1.6424895
10: -2.061684596  0.7237209

In fact, we can drop the comma, as data.table will assume that only one item is always the filtering argument.

x[a < b]
              a          b
          <num>      <num>
1: -0.319324871 -0.2243540
2: -1.003274392  0.3469449
3: -1.064288996 -0.4809599
4:  0.005162979  0.1318561
5: -1.822072621  0.3770973
6: -0.750549664 -0.2684060
7:  0.507654181  1.6424895
8: -2.061684596  0.7237209

tidyverse filtering equivalent

Here is a comparison of the data.table and tidyverse syntaxes side-by-side.

x[a < 0]
            a          b
        <num>      <num>
1: -0.3193249 -0.2243540
2: -1.0032744  0.3469449
3: -1.0642890 -0.4809599
4: -1.8220726  0.3770973
5: -0.7505497 -0.2684060
6: -2.0616846  0.7237209
x |>
  filter(a < 0)
# A tibble: 6 × 2
       a      b
   <dbl>  <dbl>
1 -0.319 -0.224
2 -1.00   0.347
3 -1.06  -0.481
4 -1.82   0.377
5 -0.751 -0.268
6 -2.06   0.724

Pretty similar, but data.table is shorter.

data.table column subsetting

Remember that normal table subsetting uses “i” for rows, and “j” for columns.

data.table follows this.

dt <- data.table(a = rnorm(5), b = rnorm(5), c = rnorm(5))
dt[, a]
[1] -0.76882903 -0.03243736  0.15819842 -0.07354196  0.42274581

If you want two columns, you have to pass a list.

dt[, list(a, b)]
             a          b
         <num>      <num>
1: -0.76882903 -0.8134719
2: -0.03243736  1.2999021
3:  0.15819842 -1.7423178
4: -0.07354196  1.0340278
5:  0.42274581  0.5187188

Or instead, data.table uses .(a,b) as equivalent.

dt[, .(a, b)]

data.table variable declaration

What if we want to add a new column?

data.table does this using the column subsetting place, “j”.

dt <- data.table(a = rnorm(5), b = rnorm(5))
dt[, newcol := a + b]
dt
            a           b     newcol
        <num>       <num>      <num>
1:  0.9685731 -0.08073421  0.8878389
2:  0.9084275  0.41180439  1.3202319
3: -0.5015523 -0.64215885 -1.1437111
4: -0.9879745 -1.81370331 -2.8016778
5: -1.8906486  0.03895319 -1.8516954

:= is a data.table function representing “defined”.

i.e. newcol defined as ‘a’ plus ‘b’

Also called the “walrus” operator by some people.

data.table variable declaration

Would this work?

dt <- data.table(a = rnorm(5), b = rnorm(5))
dt[newcol := a + b]

No.

data.table will assume we are using the “i” place for rows.

dt[newcol := a + b]
Error: Operator := detected in i, the first argument inside DT[...], but is only valid in the second argument, j. Most often, this happens when forgetting the first comma (e.g. DT[newvar := 5] instead of DT[ , new_var := 5]). Please double-check the syntax. Run traceback(), and debugger() to get a line number.

data.table variable editing

We can use the exact same syntax to edit current columns.

dt <- data.table(a = rnorm(5), b = rnorm(5))
dt[, a := 1:5]
dt
       a          b
   <int>      <num>
1:     1  0.7227419
2:     2  0.5990643
3:     3  0.2738779
4:     4 -0.0761493
5:     5 -2.0419961

Or to remove a column,

dt[, b := NULL]
dt
       a
   <int>
1:     1
2:     2
3:     3
4:     4
5:     5

data.table conditional editing

What if we want to remove values of “b” that are less than 0?

dt <- data.table(a = -2:2, b = -2:2)

We could conditionally declare values of “b” to be NA.

dt[b < 0, b := NA]
dt
       a     b
   <int> <int>
1:    -2    NA
2:    -1    NA
3:     0     0
4:     1     1
5:     2     2

In tidyverse, the equivalent is somewhat wordy:

dt |> mutate(b = if_else(b < 0, NA, b))

Summarizing with data.table

Remember in tidyverse we could write:

mtcars |> summarize(avg = mean(mpg))
       avg
1 20.09062

The equivalent in data.table is to pass a list .() to the “j” place:

dt <- as.data.table(mtcars)
dt[, .(avg = mean(mpg))]
        avg
      <num>
1: 20.09062

Note that I had to make “mtcars” a data.table first.

Summarizing with data.table

Instead, I could simply make the mean a new column.

dt[, avg := mean(mpg)]
dt
      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb      avg
    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>    <num>
 1:  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4 20.09062
 2:  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4 20.09062
 3:  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1 20.09062
 4:  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1 20.09062
 5:  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2 20.09062
 6:  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1 20.09062
 7:  14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4 20.09062
 8:  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2 20.09062
 9:  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2 20.09062
10:  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4 20.09062
11:  17.8     6 167.6   123  3.92 3.440 18.90     1     0     4     4 20.09062
12:  16.4     8 275.8   180  3.07 4.070 17.40     0     0     3     3 20.09062
13:  17.3     8 275.8   180  3.07 3.730 17.60     0     0     3     3 20.09062
14:  15.2     8 275.8   180  3.07 3.780 18.00     0     0     3     3 20.09062
15:  10.4     8 472.0   205  2.93 5.250 17.98     0     0     3     4 20.09062
16:  10.4     8 460.0   215  3.00 5.424 17.82     0     0     3     4 20.09062
17:  14.7     8 440.0   230  3.23 5.345 17.42     0     0     3     4 20.09062
18:  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1 20.09062
19:  30.4     4  75.7    52  4.93 1.615 18.52     1     1     4     2 20.09062
20:  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1 20.09062
21:  21.5     4 120.1    97  3.70 2.465 20.01     1     0     3     1 20.09062
22:  15.5     8 318.0   150  2.76 3.520 16.87     0     0     3     2 20.09062
23:  15.2     8 304.0   150  3.15 3.435 17.30     0     0     3     2 20.09062
24:  13.3     8 350.0   245  3.73 3.840 15.41     0     0     3     4 20.09062
25:  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2 20.09062
26:  27.3     4  79.0    66  4.08 1.935 18.90     1     1     4     1 20.09062
27:  26.0     4 120.3    91  4.43 2.140 16.70     0     1     5     2 20.09062
28:  30.4     4  95.1   113  3.77 1.513 16.90     1     1     5     2 20.09062
29:  15.8     8 351.0   264  4.22 3.170 14.50     0     1     5     4 20.09062
30:  19.7     6 145.0   175  3.62 2.770 15.50     0     1     5     6 20.09062
31:  15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8 20.09062
32:  21.4     4 121.0   109  4.11 2.780 18.60     1     1     4     2 20.09062
      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb      avg

data.table will spread any single value to the length of the table.

Multiple summaries

We can summarize multiple values or columns by simply adding arguments to .()

dt[, .(avg = mean(mpg), med = median(mpg), n = .N)]
        avg   med     n
      <num> <num> <int>
1: 20.09062  19.2    32

How do we do this by group?

We could use the filtering spot, “i”, and append each result

dt[cyl == 6,  .(avg = mean(mpg), med = median(mpg), n = .N)]
        avg   med     n
      <num> <num> <int>
1: 19.74286  19.7     7

or…

Summarizing by group

We can use the “hidden” third square bracket argument: “by=”.

dt[, .(avg = mean(mpg), med = median(mpg), n = .N), by = cyl]
     cyl      avg   med     n
   <num>    <num> <num> <int>
1:     6 19.74286  19.7     7
2:     4 26.66364  26.0    11
3:     8 15.10000  15.2    14

We can pass multiple columns to by using the short list: .()

dt[, .(avg = mean(mpg), med = median(mpg), n = .N), by = .(cyl, gear)]
     cyl  gear    avg   med     n
   <num> <num>  <num> <num> <int>
1:     6     4 19.750 20.10     4
2:     4     4 26.925 25.85     8
3:     6     3 19.750 19.75     2
4:     8     3 15.050 15.20    12
5:     4     3 21.500 21.50     1
6:     4     5 28.200 28.20     2
7:     8     5 15.400 15.40     2
8:     6     5 19.700 19.70     1

data.table argument summary

dt[

i ,

j ,

by = ]

filtering rows

filtering columns, new columns, summarize

group by columns

tidyverse equivalent verbs

filter()

select(), mutate(), summarize()

groub_by()

Comparison

mtcars |>
  filter(hp > 100) |>
  group_by(cyl, gear) |>
  summarize(avg_mgp = mean(mpg))
# A tibble: 7 × 3
# Groups:   cyl [3]
    cyl  gear avg_mgp
  <dbl> <dbl>   <dbl>
1     4     4    21.4
2     4     5    30.4
3     6     3    19.8
4     6     4    19.8
5     6     5    19.7
6     8     3    15.0
7     8     5    15.4
dt[hp > 100, .(avg_mpg = mean(mpg)), by = .(cyl, gear)]
     cyl  gear avg_mpg
   <num> <num>   <num>
1:     6     4   19.75
2:     6     3   19.75
3:     8     3   15.05
4:     4     5   30.40
5:     8     5   15.40
6:     6     5   19.70
7:     4     4   21.40

data.table syntax is more concise, but harder to read.

Which should I use?

The main answer:

  • personal preference (which syntax do you prefer?)

Second answer:

  • data.table is much faster
  • especially for large datasets

Speed comparison

A simulation exercise by mdharris makes this clear.

“site_count” is a measure of the simulation size (i.e. size of the data.frames). Then some costly, but not absurd calculations are performed on the data.

data.table clearly scales much better with size.

Why is data.table faster?

  1. optimized for speed
  1. Modifies “by reference”

tidyverse functions instead “copy-on-modify”.

Modify by reference vs. copy

Copy-on-modify

  • Creates a copy of the data
  • Modifies the copy
  • Returns the copy

Modify-in-place (modify “by reference”)

  • References the data
  • Modifies the original data

Avoids duplicating data, which is costly for your computer’s memory, especially with large datasets.

Modify by reference vs. copy

You may have noticed that I had to add an extra call to print data.tables after modifying them.

dt <- data.table(x = 1:2)
dt[, y := c("a", "b")]
dt
       x      y
   <int> <char>
1:     1      a
2:     2      b

This is because the line dt[, y := c("a", "b")] doesn’t return anything.

Because it edited the data in place, not by creating a copy.

Dangers of editing by reference

Modifying by reference can be dangerous if you are not careful.

dt <- data.table(a = 1:2, b = 3:4)
dt2 <- dt
dt2
       a     b
   <int> <int>
1:     1     3
2:     2     4

What happens to “dt” if we remove a column from “dt2”?

dt2[, b := NULL]
dt
       a
   <int>
1:     1
2:     2

“dt” and “dt2” reference the same underlying data.

If you truly want a new copy, use

dt2 <- copy(dt)

Even Faster!

Keys

Keys provide some organization to your data.

This makes it much faster to later:

  • filter

  • group by

  • merge

data.table lets you add keys in a variety of ways.

data.table(x = c("a", "a", "b"), y = 1:3, key = "x")
dt <- as.data.table(df, key = "x")
setkey(dt, "x")

You can also pass a vector of keys c("x", "y") to use.

What are keys?

Keys provide an index for the data.

Think of this like a phone book.

If you wanted to look my phone number up in the phone book,

  • you would search under “D” for DeHaven

  • you would not search the entire phone book!

When we tell data.table to use some columns as a key, it presorts and indexes the data on those columns.

Merging with data.table

For data.table all types of merges are handled by one function:

dt <- data.table(x = c("a", "b", "c"), y = 1:3)
dt2 <- data.table(x = c("a", "c", "d"), z = 4:6)
merge(dt, dt2, by = "x")
Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      c     3     5

By default, merge() does an “inner” join (matches in both datasets).

Merging with data.table

Instead, we can force

A “left” join,

merge(dt, dt2, by = "x", all.x = TRUE)
Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      b     2    NA
3:      c     3     5

A “right” join,

merge(dt, dt2, by = "x", all.y = TRUE)
Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      c     3     5
3:      d    NA     6

Or an “outer” join

merge(dt, dt2, by = "x", all = TRUE)
Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      b     2    NA
3:      c     3     5
4:      d    NA     6

Non-matching column names

We can also be explicit about which columns to use from “x” and from “y”,

dt <- data.table(x = c("a", "b", "c"), y = 1:3)
dt2 <- data.table(XX = c("a", "c", "d"), z = 4:6)
merge(dt, dt2, by.x = "x", by.y = "XX")
Key: <x>
        x     y     z
   <char> <int> <int>
1:      a     1     4
2:      c     3     5

But it’s probably better to clean up your column names before hand.

Pivoting with data.table

data.table pivots

Two main functions:

  • melt() go from wide to long

  • dcast() go from long to wide

We will see how to use each.

There is instead a package tidyfast which offers a dt_pivot_longer() and dt_pivot_wider() to match the syntax of tidyr but with the speed of data.table.

melt

Use to go from wide to long data.

You can provide the “id.vars=” argument, i.e. the columns not to be melted.

dt <- data.table(
  measure = c("unem", "GDP"),
  RI = c(0.05, 100),
  MA = c(0.04, 102),
  CT = c(0.06, 101)
)
dt
   measure    RI     MA     CT
    <char> <num>  <num>  <num>
1:    unem 5e-02   0.04   0.06
2:     GDP 1e+02 102.00 101.00
melt(dt, id.vars = c("measure"))
   measure variable  value
    <char>   <fctr>  <num>
1:    unem       RI   0.05
2:     GDP       RI 100.00
3:    unem       MA   0.04
4:     GDP       MA 102.00
5:    unem       CT   0.06
6:     GDP       CT 101.00

melt

Instead, you can provide the “measure.vars=” argument, i.e. the columns to be melted.

dt <- data.table(
  measure = c("unem", "GDP"),
  RI = c(0.05, 100),
  MA = c(0.04, 102),
  CT = c(0.06, 101)
)
melt(dt, measure.vars = c("RI", "MA", "CT"))
   measure variable  value
    <char>   <fctr>  <num>
1:    unem       RI   0.05
2:     GDP       RI 100.00
3:    unem       MA   0.04
4:     GDP       MA 102.00
5:    unem       CT   0.06
6:     GDP       CT 101.00

Generally there are more of those, so it’s easier to list the “id.vars=”.

melt

You can overwrite the default “variable” and “value” column names,

long_dt <- melt(dt, id.vars = "measure", value.name = "x", variable.name = "state")
long_dt
   measure  state      x
    <char> <fctr>  <num>
1:    unem     RI   0.05
2:     GDP     RI 100.00
3:    unem     MA   0.04
4:     GDP     MA 102.00
5:    unem     CT   0.06
6:     GDP     CT 101.00

dcast

Has a bit of an odd syntax, it takes the data.table and a “formula”.

The formula is of the form “LHS ~ RHS”,

  • LHS are the columns that remain “long”
  • RHS is the column that gets spread “wide”
long_dt
   measure  state      x
    <char> <fctr>  <num>
1:    unem     RI   0.05
2:     GDP     RI 100.00
3:    unem     MA   0.04
4:     GDP     MA 102.00
5:    unem     CT   0.06
6:     GDP     CT 101.00
dcast(long_dt, measure ~ state)
Key: <measure>
   measure    RI     MA     CT
    <char> <num>  <num>  <num>
1:     GDP 1e+02 102.00 101.00
2:    unem 5e-02   0.04   0.06

tidyverse or data.table?

tidyverse or data.table?

In practice, I (and many people) use a mix of both.

I personally prefer data.tables syntax for filtering, modifying columns, and summarizing.

But I find melt and dcast much clunkier than the pivot_ functions.

If you really want the speed of data.table and the syntax of dplyr

dtplyr

…there is a special package that combines the two!

dtplyr allows you to use all of the tidyverse verbs:

  • mutate()
  • filter()
  • groupby()

But behind the scenes runs everything using data.table.

A note on pipes

You can use pipes with data.table as well.

dt <- data.table(x = 1:2)
dt[, y := 3:4]
dt[, z := 5:6]
dt
       x     y     z
   <int> <int> <int>
1:     1     3     5
2:     2     4     6
dt <- data.table(x = 1:2)
dt |>
  _[, y := 3:4] |>
  _[, z := 5:6]
dt
       x     y     z
   <int> <int> <int>
1:     1     3     5
2:     2     4     6

The underscore character _ is known as a “placeholder”,

and represents the object being piped.

A bit less useful because the data.table syntax is already very short.

Summary

  • renv
    • Separate libraries
    • snapshot() packages to lockfiles
  • Introduced to data.table
    • Using square bracket positions dt[i, j, by=]
    • Fast and concise
    • Keys
    • merge()
    • melt() and dcast()

Summary

Pick your own journey if you prefer data.table or tidyverse!

Though I will make you use both on this next assignment…