Data wrangling with data.table and the Tidyverse

Common data wrangling operations with both data.table and the Tidyverse.

Data Manipulation
Tidyverse
data.table
R
First Published

May 19, 2022

Summary
This post showcases various ways to accomplish most data wrangling operations, from basic filtering/mutating to pivots and non-equi joins, with both data.table and the Tidyverse (dplyr, tidyr, purrr, stringr).

v1: 2022-05-19

v2: 2022-05-26

  • Improved the section on keys (for ordering & filtering)
  • Adding a section for translations of Tidyr (and other similar packages)
  • Capping tables to display 15 rows max when unfolded
  • Improving table display (stripping, hiding the contents of nested columns, …)

v3: 2022-07-20

  • Updating data.table’s examples of dynamic programming using env
  • Added new entries in processing examples
  • Added new entries to Tidyr & Others: expand + complete, transpose/rotation, …
  • Added pivot_wider examples to match the dcast ones in the Pivots section
  • Added some new examples here and there across the Basic Operations section
  • Added an entry for operating inside nested data.frames/data.tables
  • Added a processing example for run-length encoding (i.e. successive event tagging)

v4: 2022-08-05

  • Improved pivot section: example of one-hot encoding (and reverse operation) + better examples of partial pivots with .value
  • Added tidyr::uncount() (row duplication) example.
  • Improved both light & dark themes (code highlight, tables, …)

v5: 2023-03-12

  • Revamped the whole document with grouped tabsets by framework for better readability
  • Revamped the whole Basic Operations section: better structure, reworked examples, …
  • Revamped the whole Joins section: better structure, new examples (e.g. join_by), better explanations, …
  • Updated code to reflect recent updates of the Tidyverse:
    • dplyr (1.1.0): .by, reframe, join_by, consecutive_id, …
    • purrr (1.0.0): list_rbind, list_cbind, …
    • tidyr (1.3.0): updated the separate/separate_rows section to the newer separate_wider/longer_*
  • Updated code to reflect recent updates of data.table (1.14.9): let, DT(), …

Setup


library(here)        # Working directory management
library(pipebind)    # Piping goodies

library(data.table)  # Fast data manipulation (in-RAM)

library(tibble)      # Extending data.frames             (Tidyverse)
library(dplyr)       # Manipulating data.frames - core   (Tidyverse)
library(tidyr)       # Manipulating data.frames - extras (Tidyverse)
library(stringr)     # Manipulating strings              (Tidyverse)
library(purrr)       # Manipulating lists                (Tidyverse)
library(lubridate)   # Manipulating date/time            (Tidyverse)

library(broom)       # Tidying up models output          (Tidymodels)

data.table::setDTthreads(parallel::detectCores(logical = FALSE))
Applying a custom theme to all gt tables
#-----------------------#
####🔺gt knit_prints ####
#-----------------------#

library(knitr)
library(gt)

knit_print.grouped_df <- function(x, options, ...) {
  if ("grouped_df" %in% class(x)) x <- ungroup(x)
  
  cl <- intersect(class(x), c("data.table", "data.frame"))[1]
  nrows <- ifelse(!is.null(options$total_rows), as.numeric(options$total_rows), dim(x)[1])
  is_open <- ifelse(!is.null(options[["details-open"]]), as.logical(options[["details-open"]]), FALSE)
  
  cat(str_glue("\n<details{ifelse(is_open, ' open', '')}>\n"))
  cat("<summary>\n")
  cat(str_glue("\n*{cl} [{scales::label_comma()(nrows)} x {dim(x)[2]}]*\n"))
  cat("</summary>\n<br>\n")
  print(gt::as_raw_html(style_table(x, nrows)))
  cat("</details>\n\n")
}

registerS3method("knit_print", "grouped_df", knit_print.grouped_df)

knit_print.data.frame <- function(x, options, ...) {
  cl <- intersect(class(x), c("data.table", "data.frame"))[1]
  nrows <- ifelse(!is.null(options$total_rows), as.numeric(options$total_rows), dim(x)[1])
  is_open <- ifelse(!is.null(options[["details-open"]]), as.logical(options[["details-open"]]), FALSE)
  
  cat(str_glue("\n<details{ifelse(is_open, ' open', '')}>\n"))
  cat("<summary>\n")
  cat(str_glue("\n*{cl} [{scales::label_comma()(nrows)} x {dim(x)[2]}]*\n"))
  cat("</summary>\n<br>\n")
  print(gt::as_raw_html(style_table(x, nrows)))
  cat("</details>\n\n")
}

registerS3method("knit_print", "data.frame", knit_print.data.frame)

1 Basic Operations


data.table general syntax:

DT[row selector (filter/sort), col selector (select/mutate/summarize/reframe/rename), modifiers (group/join by)]

Data

MT <- as.data.table(mtcars)
IRIS <- as.data.table(iris)[, Species := as.character(Species)]

1.1 Arrange / Order

1.1.1 Basic ordering

mtcars |> arrange(desc(cyl))
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21 6 160 110 3.9 2.62 16.46 0 1 4 4
[ omitted 17 entries ]
mtcars |> arrange(desc(cyl), gear)
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
[ omitted 17 entries ]
MT[order(-cyl)]
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21 6 160 110 3.9 2.62 16.46 0 1 4 4
[ omitted 17 entries ]
MT[order(-cyl, gear)]
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
[ omitted 17 entries ]
Alternatives
MT[fsort(cyl, decreasing = TRUE)]

setorder(MT, -cyl, gear)[]

setorderv(MT, c("cyl", "gear"), c(-1 ,1))[]

Ordering on a character column

IRIS[chorder(Species)]
data.table [150 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 135 entries ]

1.1.2 Ordering with keys

  • Keys physically reorders the dataset within the RAM (by reference)
    • No memory is used for sorting (other than marking which columns is the key)
  • The dataset is marked with an attribute “sorted”
  • The dataset is always sorted in ascending order, with NA first
  • Using keyby instead of by when grouping will set the grouping factors as keys
Tip

See this SO post for more information on keys.

setkey(MT, cyl, gear)

setkeyv(MT, c("cyl", "gear"))

MT
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
[ omitted 17 entries ]

To see over which keys (if any) the dataset is currently ordered:

haskey(MT)

[1] TRUE

key(MT)

[1] “cyl” “gear”

Warning

Unless our task involves repeated subsetting on the same column, the speed gain from key-based subsetting could effectively be nullified by the time needed to reorder the data in RAM, especially for large datasets.

1.1.3 Ordering with (secondary) indices

  • setindex creates an index for the provided columns, but doesn’t physically reorder the dataset in RAM.
  • It computes the ordering vector of the dataset’s rows according to the provided columns in an additional attribute called index
setindex(MT, cyl, gear)

setindexv(MT, c("cyl", "gear"))

MT
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

We can see the additional index attribute added to the data.table:

[1] "names"             "row.names"         "class"            
[4] ".internal.selfref" "index"            

We can get the currently used indices with:

[1] “cyl__gear”

Adding a new index doesn’t remove a previously existing one:

setindex(MT, hp)

indices(MT)

[1] “cyl__gear” “hp”

We can thus use indices to pre-compute the ordering for the columns (or combinations of columns) that we will be using to group or subset by frequently !

1.2 Subset / Filter

1.2.1 Basic filtering

mtcars |> filter(cyl >= 6 & disp < 180)
data.frame [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
iris |> filter(Species %in% c("setosa"))
data.frame [50 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 35 entries ]
MT[cyl >= 6 & disp < 180]
data.table [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
IRIS[Species %chin% c("setosa")]
data.table [50 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 35 entries ]

For non-regex character filtering, use %chin% (which is a character-optimized version of %in%)

1.2.2 Filter based on a range

mtcars |> filter(between(disp, 200, 300))
data.frame [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
MT[disp %between% c(200, 300)]
data.table [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3