Data wrangling with data.table and the Tidyverse

Common data wrangling operations with both data.table and the Tidyverse.

Data Manipulation
Tidyverse
data.table
R
First Published

May 19, 2022

Summary
This post showcases various ways to accomplish most data wrangling operations, from basic filtering/mutating to pivots and non-equi joins, with both data.table and the Tidyverse (dplyr, tidyr, purrr, stringr).

v1: 2022-05-19

v2: 2022-05-26

  • Improved the section on keys (for ordering & filtering)
  • Adding a section for translations of Tidyr (and other similar packages)
  • Capping tables to display 15 rows max when unfolded
  • Improving table display (stripping, hiding the contents of nested columns, …)

v3: 2022-07-20

  • Updating data.table’s examples of dynamic programming using env
  • Added new entries in processing examples
  • Added new entries to Tidyr & Others: expand + complete, transpose/rotation, …
  • Added pivot_wider examples to match the dcast ones in the Pivots section
  • Added some new examples here and there across the Basic Operations section
  • Added an entry for operating inside nested data.frames/data.tables
  • Added a processing example for run-length encoding (i.e. successive event tagging)

v4: 2022-08-05

  • Improved pivot section: example of one-hot encoding (and reverse operation) + better examples of partial pivots with .value
  • Added tidyr::uncount() (row duplication) example.
  • Improved both light & dark themes (code highlight, tables, …)

v5: 2023-03-12

  • Revamped the whole document with grouped tabsets by framework for better readability
  • Revamped the whole Basic Operations section: better structure, reworked examples, …
  • Revamped the whole Joins section: better structure, new examples (e.g. join_by), better explanations, …
  • Updated code to reflect recent updates of the Tidyverse:
    • dplyr (1.1.0): .by, reframe, join_by, consecutive_id, …
    • purrr (1.0.0): list_rbind, list_cbind, …
    • tidyr (1.3.0): updated the separate/separate_rows section to the newer separate_wider/longer_*
  • Updated code to reflect recent updates of data.table (1.14.9): let, DT(), …

Setup


library(here)        # Working directory management
library(pipebind)    # Piping goodies

library(data.table)  # Fast data manipulation (in-RAM)

library(tibble)      # Extending data.frames             (Tidyverse)
library(dplyr)       # Manipulating data.frames - core   (Tidyverse)
library(tidyr)       # Manipulating data.frames - extras (Tidyverse)
library(stringr)     # Manipulating strings              (Tidyverse)
library(purrr)       # Manipulating lists                (Tidyverse)
library(lubridate)   # Manipulating date/time            (Tidyverse)

library(broom)       # Tidying up models output          (Tidymodels)

data.table::setDTthreads(parallel::detectCores(logical = FALSE))
Applying a custom theme to all gt tables
#-----------------------#
####🔺gt knit_prints ####
#-----------------------#

library(knitr)
library(gt)

knit_print.grouped_df <- function(x, options, ...) {
  if ("grouped_df" %in% class(x)) x <- ungroup(x)
  
  cl <- intersect(class(x), c("data.table", "data.frame"))[1]
  nrows <- ifelse(!is.null(options$total_rows), as.numeric(options$total_rows), dim(x)[1])
  is_open <- ifelse(!is.null(options[["details-open"]]), as.logical(options[["details-open"]]), FALSE)
  
  cat(str_glue("\n<details{ifelse(is_open, ' open', '')}>\n"))
  cat("<summary>\n")
  cat(str_glue("\n*{cl} [{scales::label_comma()(nrows)} x {dim(x)[2]}]*\n"))
  cat("</summary>\n<br>\n")
  print(gt::as_raw_html(style_table(x, nrows)))
  cat("</details>\n\n")
}

registerS3method("knit_print", "grouped_df", knit_print.grouped_df)

knit_print.data.frame <- function(x, options, ...) {
  cl <- intersect(class(x), c("data.table", "data.frame"))[1]
  nrows <- ifelse(!is.null(options$total_rows), as.numeric(options$total_rows), dim(x)[1])
  is_open <- ifelse(!is.null(options[["details-open"]]), as.logical(options[["details-open"]]), FALSE)
  
  cat(str_glue("\n<details{ifelse(is_open, ' open', '')}>\n"))
  cat("<summary>\n")
  cat(str_glue("\n*{cl} [{scales::label_comma()(nrows)} x {dim(x)[2]}]*\n"))
  cat("</summary>\n<br>\n")
  print(gt::as_raw_html(style_table(x, nrows)))
  cat("</details>\n\n")
}

registerS3method("knit_print", "data.frame", knit_print.data.frame)

1 Basic Operations


data.table general syntax:

DT[row selector (filter/sort), col selector (select/mutate/summarize/reframe/rename), modifiers (group/join by)]

Data

MT <- as.data.table(mtcars)
IRIS <- as.data.table(iris)[, Species := as.character(Species)]

1.1 Arrange / Order

1.1.1 Basic ordering

mtcars |> arrange(desc(cyl))
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21 6 160 110 3.9 2.62 16.46 0 1 4 4
[ omitted 17 entries ]
mtcars |> arrange(desc(cyl), gear)
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
[ omitted 17 entries ]
MT[order(-cyl)]
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21 6 160 110 3.9 2.62 16.46 0 1 4 4
[ omitted 17 entries ]
MT[order(-cyl, gear)]
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
[ omitted 17 entries ]
Alternatives
MT[fsort(cyl, decreasing = TRUE)]

setorder(MT, -cyl, gear)[]

setorderv(MT, c("cyl", "gear"), c(-1 ,1))[]

Ordering on a character column

IRIS[chorder(Species)]
data.table [150 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 135 entries ]

1.1.2 Ordering with keys

  • Keys physically reorders the dataset within the RAM (by reference)
    • No memory is used for sorting (other than marking which columns is the key)
  • The dataset is marked with an attribute “sorted”
  • The dataset is always sorted in ascending order, with NA first
  • Using keyby instead of by when grouping will set the grouping factors as keys
Tip

See this SO post for more information on keys.

setkey(MT, cyl, gear)

setkeyv(MT, c("cyl", "gear"))

MT
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
[ omitted 17 entries ]

To see over which keys (if any) the dataset is currently ordered:

haskey(MT)

[1] TRUE

key(MT)

[1] “cyl” “gear”

Warning

Unless our task involves repeated subsetting on the same column, the speed gain from key-based subsetting could effectively be nullified by the time needed to reorder the data in RAM, especially for large datasets.

1.1.3 Ordering with (secondary) indices

  • setindex creates an index for the provided columns, but doesn’t physically reorder the dataset in RAM.
  • It computes the ordering vector of the dataset’s rows according to the provided columns in an additional attribute called index
setindex(MT, cyl, gear)

setindexv(MT, c("cyl", "gear"))

MT
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

We can see the additional index attribute added to the data.table:

[1] "names"             "row.names"         "class"            
[4] ".internal.selfref" "index"            

We can get the currently used indices with:

[1] “cyl__gear”

Adding a new index doesn’t remove a previously existing one:

setindex(MT, hp)

indices(MT)

[1] “cyl__gear” “hp”

We can thus use indices to pre-compute the ordering for the columns (or combinations of columns) that we will be using to group or subset by frequently !

1.2 Subset / Filter

1.2.1 Basic filtering

mtcars |> filter(cyl >= 6 & disp < 180)
data.frame [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
iris |> filter(Species %in% c("setosa"))
data.frame [50 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 35 entries ]
MT[cyl >= 6 & disp < 180]
data.table [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
IRIS[Species %chin% c("setosa")]
data.table [50 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 35 entries ]

For non-regex character filtering, use %chin% (which is a character-optimized version of %in%)

1.2.2 Filter based on a range

mtcars |> filter(between(disp, 200, 300))
data.frame [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
MT[disp %between% c(200, 300)]
data.table [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3

1.2.3 Filter with a pattern

mtcars |> filter(str_detect(disp, "^\\d{3}\\."))
data.frame [9 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
MT[disp %like% "^\\d{3}\\."]
data.table [9 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
Variants
IRIS[Species %flike% "set"] # Fixed (not regex)

IRIS[Species %ilike% "Set"] # Ignore case

IRIS[Species %plike% "(?=set)"] # Perl-like regex

1.2.4 Filter on row number (slicing)

mtcars |> slice(1) # slice_head(n = 1)
data.frame [1 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
mtcars |> slice(n()) # slice_tail(n = 1)
data.frame [1 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2

Slice a random sample of rows:

mtcars |> slice_sample(n = 5)
data.frame [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
15 8 301 335 3.54 3.57 14.6 0 1 5 8
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
MT[1]
data.table [1 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
MT[.N]
data.table [1 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2

Slice a random sample of rows:

MT[sample(.N, 5)]
data.table [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4

1.2.5 Filter distinct/unique rows

mtcars |> distinct(mpg, hp, .keep_all = TRUE)
data.frame [31 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
[ omitted 16 entries ]

Number of unique rows/values

n_distinct(mtcars$gear)

[1] 3

unique(MT, by = c("mpg", "hp")) # cols = other_cols_to_keep
data.table [31 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
[ omitted 16 entries ]

Number of unique rows/values

uniqueN(MT, by = "gear")

[1] 3

1.2.6 Filter by keys

When keys or indices are defined, we can filter based on them, which is often a lot faster.

Tip

We do not even need to specify the column name we are filtering on: the values will be attributed to the keys in order.

setkey(MT, cyl)

MT[.(6)] # Equivalent to MT[cyl == 6]
data.table [7 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
setkey(MT, cyl, gear)

MT[.(6, 4)] # Equivalent to MT[cyl == 6 & gear == 4]
data.table [4 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4

1.2.7 Filter by indices

To filter by indices, we can use the on argument, which creates a temporary secondary index on the fly (if it doesn’t already exist).

IRIS["setosa", on = "Species"]
data.table [50 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 35 entries ]

Since the time to compute the secondary indices is quite small, we don’t have to use setindex, unless the task involves repeated subsetting on the same columns.

Tip

When using on with multiple values, the nomatch = NULL argument avoids creating combinations that do not exist in the original data (i.e. for cyl == 5 here)

MT[.(4:6, 4), on = c("cyl", "gear"), nomatch = NULL]
data.table [12 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4

1.2.8 Filtering on multiple columns

Filtering with one function taking multiple columns:

f_dat <- \(d) with(d, gear > cyl) # Function taking the data and comparing fix columns

f_dyn <- \(x, y) x > y # Function taking dynamic columns and comparing them
cols <- c("gear", "cyl")

Manually:

mtcars |> filter(f_dyn(gear, cyl))
data.frame [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

Dynamically:

Taking column names:

mtcars |> filter(f_dyn(!!!syms(cols)))
data.frame [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

Taking the data:

mtcars |> filter(f_dat(cur_data()))
data.frame [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

Manually:

MT[f_dyn(gear, cyl),]
data.table [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

Dynamically:

Taking column names:

MT[do.call(f_dyn, args), env = list(args = as.list(cols))] # exec(f_dyn, !!!args)
data.table [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

Taking the data:

MT[f_dat(MT),] # Can't use .SD in i
data.table [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

In two steps:

We can’t use .SD in the i clause of a data.table

But we can bypass that constraint by doing the operation in two steps:
- Obtaining a vector stating if each row of the table matches or not the conditions
- Filtering the original table based on the vector

MT[MT[, f_dat(.SD)]]
data.table [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

Combining multiple filtering functions:

This function filters rows that have 2 or more non-zero decimals, and we’re going to call it on multiple columns:

decp <- \(x) str_length(str_remove(as.character(abs(x)), ".*\\.")) >= 2
cols <- c("drat", "wt", "qsec")

Manually:

mtcars |> filter(decp(drat) & decp(wt) & decp(qsec))
data.frame [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2

Dynamically:

mtcars |> filter(if_all(cols, decp))
data.frame [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2

Manually:

MT[decp(drat) & decp(wt) & decp(qsec), ]
data.table [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2

Dynamically:

MT[Reduce(`&`, lapply(mget(cols), decp)), ]
data.table [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
Alternatives
MT[Reduce(`&`, lapply(MT[, ..cols], decp)), ]

MT[Reduce(`&`, lapply(v1, decp)), env = list(v1 = as.list(cols))]

In two steps:

MT[MT[, Reduce(`&`, lapply(.SD, decp)), .SDcols = cols]]
data.table [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2

1.3 Rename

Note

setnames changes column names in-place

Manually:

mtcars |> rename(CYL = cyl, MPG = mpg)
data.frame [32 x 11]
MPG CYL disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

Dynamically:

mtcars |> rename_with(\(c) toupper(c), .cols = matches("^d"))
data.frame [32 x 11]
mpg cyl DISP hp DRAT wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

Manually:

setnames(copy(MT), c("cyl", "mpg"), c("CYL", "MPG"))[]
data.table [32 x 11]
MPG CYL disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

Dynamically:

setnames(copy(MT), grep("^d", colnames(MT)), toupper)[]
data.table [32 x 11]
mpg cyl DISP hp DRAT wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

1.4 Select

1.4.1 Basic selection

MT |> select(matches("cyl|disp"))
data.table [32 x 2]
cyl disp
6 160
6 160
4 108
6 258
8 360
6 225
8 360
4 146.7
4 140.8
6 167.6
6 167.6
8 275.8
8 275.8
8 275.8
8 472
[ omitted 17 entries ]



Remove a column:

mtcars |> select(!cyl) # select(-cyl)
data.frame [32 x 10]
mpg disp hp drat wt qsec vs am gear carb
21 160 110 3.9 2.62 16.46 0 1 4 4
21 160 110 3.9 2.875 17.02 0 1 4 4
22.8 108 93 3.85 2.32 18.61 1 1 4 1
21.4 258 110 3.08 3.215 19.44 1 0 3 1
18.7 360 175 3.15 3.44 17.02 0 0 3 2
18.1 225 105 2.76 3.46 20.22 1 0 3 1
14.3 360 245 3.21 3.57 15.84 0 0 3 4
24.4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 275.8 180 3.07 3.78 18 0 0 3 3
10.4 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
MT[, .(mpg, disp)]
data.table [32 x 2]
mpg disp
21 160
21 160
22.8 108
21.4 258
18.7 360
18.1 225
14.3 360
24.4 146.7
22.8 140.8
19.2 167.6
17.8 167.6
16.4 275.8
17.3 275.8
15.2 275.8
10.4 472
[ omitted 17 entries ]
Alternatives
MT[ , .SD, .SDcols = c("mpg", "disp")]

MT[, .SD, .SDcols = patterns("mpg|disp")]

Remove a column:

MT[, !"cyl"] # MT[, -"cyl"]
data.table [32 x 10]
mpg disp hp drat wt qsec vs am gear carb
21 160 110 3.9 2.62 16.46 0 1 4 4
21 160 110 3.9 2.875 17.02 0 1 4 4
22.8 108 93 3.85 2.32 18.61 1 1 4 1
21.4 258 110 3.08 3.215 19.44 1 0 3 1
18.7 360 175 3.15 3.44 17.02 0 0 3 2
18.1 225 105 2.76 3.46 20.22 1 0 3 1
14.3 360 245 3.21 3.57 15.84 0 0 3 4
24.4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 275.8 180 3.07 3.78 18 0 0 3 3
10.4 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

In-place:

copy(MT)[, cyl := NULL][]
data.table [32 x 10]
mpg disp hp drat wt qsec vs am gear carb
21 160 110 3.9 2.62 16.46 0 1 4 4
21 160 110 3.9 2.875 17.02 0 1 4 4
22.8 108 93 3.85 2.32 18.61 1 1 4 1
21.4 258 110 3.08 3.215 19.44 1 0 3 1
18.7 360 175 3.15 3.44 17.02 0 0 3 2
18.1 225 105 2.76 3.46 20.22 1 0 3 1
14.3 360 245 3.21 3.57 15.84 0 0 3 4
24.4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 275.8 180 3.07 3.78 18 0 0 3 3
10.4 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

Select & Extract:

mtcars |> pull(disp)
 [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0 304.0 350.0
[25] 400.0  79.0 120.3  95.1 351.0 145.0 301.0 121.0

Select & Rename:

mtcars |> select(dispp = disp)
data.frame [32 x 1]
dispp
160
160
108
258
360
225
360
146.7
140.8
167.6
167.6
275.8
275.8
275.8
472
[ omitted 17 entries ]

Select & Extract:

MT[, disp]
 [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0 304.0 350.0
[25] 400.0  79.0 120.3  95.1 351.0 145.0 301.0 121.0

Select & Rename:

MT[, .(dispp = disp)]
data.table [32 x 1]
dispp
160
160
108
258
360
225
360
146.7
140.8
167.6
167.6
275.8
275.8
275.8
472
[ omitted 17 entries ]

1.4.2 Dynamic selection

1.4.2.1 By name

cols <- c("cyl", "disp")
mtcars |> select(all_of(cols)) # select(!!cols)
data.frame [32 x 2]
cyl disp
6 160
6 160
4 108
6 258
8 360
6 225
8 360
4 146.7
4 140.8
6 167.6
6 167.6
8 275.8
8 275.8
8 275.8
8 472
[ omitted 17 entries ]



Removing a column:

mtcars |> select(!{{cols}}) # select(-matches(cols))
data.frame [32 x 9]
mpg hp drat wt qsec vs am gear carb
21 110 3.9 2.62 16.46 0 1 4 4
21 110 3.9 2.875 17.02 0 1 4 4
22.8 93 3.85 2.32 18.61 1 1 4 1
21.4 110 3.08 3.215 19.44 1 0 3 1
18.7 175 3.15 3.44 17.02 0 0 3 2
18.1 105 2.76 3.46 20.22 1 0 3 1
14.3 245 3.21 3.57 15.84 0 0 3 4
24.4 62 3.69 3.19 20 1 0 4 2
22.8 95 3.92 3.15 22.9 1 0 4 2
19.2 123 3.92 3.44 18.3 1 0 4 4
17.8 123 3.92 3.44 18.9 1 0 4 4
16.4 180 3.07 4.07 17.4 0 0 3 3
17.3 180 3.07 3.73 17.6 0 0 3 3
15.2 180 3.07 3.78 18 0 0 3 3
10.4 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
MT[, ..cols]
data.table [32 x 2]
cyl disp
6 160
6 160
4 108
6 258
8 360
6 225
8 360
4 146.7
4 140.8
6 167.6
6 167.6
8 275.8
8 275.8
8 275.8
8 472
[ omitted 17 entries ]
Alternatives
MT[, mget(cols)] # Retired

MT[, cols, with = FALSE] # Retired

MT[, .SD, .SDcols = cols]

MT[, j, env = list(j = as.list(cols))]

Removing a column:

MT[, !..cols]
data.table [32 x 9]
mpg hp drat wt qsec vs am gear carb
21 110 3.9 2.62 16.46 0 1 4 4
21 110 3.9 2.875 17.02 0 1 4 4
22.8 93 3.85 2.32 18.61 1 1 4 1
21.4 110 3.08 3.215 19.44 1 0 3 1
18.7 175 3.15 3.44 17.02 0 0 3 2
18.1 105 2.76 3.46 20.22 1 0 3 1
14.3 245 3.21 3.57 15.84 0 0 3 4
24.4 62 3.69 3.19 20 1 0 4 2
22.8 95 3.92 3.15 22.9 1 0 4 2
19.2 123 3.92 3.44 18.3 1 0 4 4
17.8 123 3.92 3.44 18.9 1 0 4 4
16.4 180 3.07 4.07 17.4 0 0 3 3
17.3 180 3.07 3.73 17.6 0 0 3 3
15.2 180 3.07 3.78 18 0 0 3 3
10.4 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
Alternatives
MT[, .SD, .SDcols = !cols]

MT[, -j, env = list(j = I(cols))]

In-place:

copy(MT)[, (cols) := NULL][]
data.table [32 x 9]
mpg hp drat wt qsec vs am gear carb
21 110 3.9 2.62 16.46 0 1 4 4
21 110 3.9 2.875 17.02 0 1 4 4
22.8 93 3.85 2.32 18.61 1 1 4 1
21.4 110 3.08 3.215 19.44 1 0 3 1
18.7 175 3.15 3.44 17.02 0 0 3 2
18.1 105 2.76 3.46 20.22 1 0 3 1
14.3 245 3.21 3.57 15.84 0 0 3 4
24.4 62 3.69 3.19 20 1 0 4 2
22.8 95 3.92 3.15 22.9 1 0 4 2
19.2 123 3.92 3.44 18.3 1 0 4 4
17.8 123 3.92 3.44 18.9 1 0 4 4
16.4 180 3.07 4.07 17.4 0 0 3 3
17.3 180 3.07 3.73 17.6 0 0 3 3
15.2 180 3.07 3.78 18 0 0 3 3
10.4 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

1.4.2.2 By pattern

mtcars |> select(-matches("^d"))
data.frame [32 x 9]
mpg cyl hp wt qsec vs am gear carb
21 6 110 2.62 16.46 0 1 4 4
21 6 110 2.875 17.02 0 1 4 4
22.8 4 93 2.32 18.61 1 1 4 1
21.4 6 110 3.215 19.44 1 0 3 1
18.7 8 175 3.44 17.02 0 0 3 2
18.1 6 105 3.46 20.22 1 0 3 1
14.3 8 245 3.57 15.84 0 0 3 4
24.4 4 62 3.19 20 1 0 4 2
22.8 4 95 3.15 22.9 1 0 4 2
19.2 6 123 3.44 18.3 1 0 4 4
17.8 6 123 3.44 18.9 1 0 4 4
16.4 8 180 4.07 17.4 0 0 3 3
17.3 8 180 3.73 17.6 0 0 3 3
15.2 8 180 3.78 18 0 0 3 3
10.4 8 205 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
mtcars |> select(where(\(x) all(x != 0))) # Only keep columns where no value == 0
data.frame [32 x 9]
mpg cyl disp hp drat wt qsec gear carb
21 6 160 110 3.9 2.62 16.46 4 4
21 6 160 110 3.9 2.875 17.02 4 4
22.8 4 108 93 3.85 2.32 18.61 4 1
21.4 6 258 110 3.08 3.215 19.44 3 1
18.7 8 360 175 3.15 3.44 17.02 3 2
18.1 6 225 105 2.76 3.46 20.22 3 1
14.3 8 360 245 3.21 3.57 15.84 3 4
24.4 4 146.7 62 3.69 3.19 20 4 2
22.8 4 140.8 95 3.92 3.15 22.9 4 2
19.2 6 167.6 123 3.92 3.44 18.3 4 4
17.8 6 167.6 123 3.92 3.44 18.9 4 4
16.4 8 275.8 180 3.07 4.07 17.4 3 3
17.3 8 275.8 180 3.07 3.73 17.6 3 3
15.2 8 275.8 180 3.07 3.78 18 3 3
10.4 8 472 205 2.93 5.25 17.98 3 4
[ omitted 17 entries ]
MT[, .SD, .SDcols = !patterns("^d")]
data.table [32 x 9]
mpg cyl hp wt qsec vs am gear carb
21 6 110 2.62 16.46 0 1 4 4
21 6 110 2.875 17.02 0 1 4 4
22.8 4 93 2.32 18.61 1 1 4 1
21.4 6 110 3.215 19.44 1 0 3 1
18.7 8 175 3.44 17.02 0 0 3 2
18.1 6 105 3.46 20.22 1 0 3 1
14.3 8 245 3.57 15.84 0 0 3 4
24.4 4 62 3.19 20 1 0 4 2
22.8 4 95 3.15 22.9 1 0 4 2
19.2 6 123 3.44 18.3 1 0 4 4
17.8 6 123 3.44 18.9 1 0 4 4
16.4 8 180 4.07 17.4 0 0 3 3
17.3 8 180 3.73 17.6 0 0 3 3
15.2 8 180 3.78 18 0 0 3 3
10.4 8 205 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
MT[, .SD, .SDcols = \(x) all(x != 0)] # Only keep columns where no value == 0
data.table [32 x 9]
mpg cyl disp hp drat wt qsec gear carb
21 6 160 110 3.9 2.62 16.46 4 4
21 6 160 110 3.9 2.875 17.02 4 4
22.8 4 108 93 3.85 2.32 18.61 4 1
21.4 6 258 110 3.08 3.215 19.44 3 1
18.7 8 360 175 3.15 3.44 17.02 3 2
18.1 6 225 105 2.76 3.46 20.22 3 1
14.3 8 360 245 3.21 3.57 15.84 3 4
24.4 4 146.7 62 3.69 3.19 20 4 2
22.8 4 140.8 95 3.92 3.15 22.9 4 2
19.2 6 167.6 123 3.92 3.44 18.3 4 4
17.8 6 167.6 123 3.92 3.44 18.9 4 4
16.4 8 275.8 180 3.07 4.07 17.4 3 3
17.3 8 275.8 180 3.07 3.73 17.6 3 3
15.2 8 275.8 180 3.07 3.78 18 3 3
10.4 8 472 205 2.93 5.25 17.98 3 4
[ omitted 17 entries ]
Alternatives
copy(MT)[, grep("^d", colnames(MT)) := NULL][] # In place (column deletion)

MT[, MT[, sapply(.SD, \(x) all(x != 0))], with = FALSE]

1.4.2.3 By column type

iris |> select(where(\(x) !is.numeric(x)))
data.frame [150 x 1]
Species
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
[ omitted 135 entries ]
IRIS[, .SD, .SDcols = !is.numeric]
data.table [150 x 1]
Species
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
[ omitted 135 entries ]

1.5 Mutate / Transmute

data.table can mutate in 2 ways:
- Using = creates a new DT with the new columns only (like dplyr::transmute)
- Using := (or let) modifies the current dt in place (like dplyr::mutate)

The function modifying a column should be the same size as the original column (or group).
If only one value is provided with :=, it will be recycled to the whole column/group.

If the number of values provided is smaller than the original column/group:
- With := or let, an error will be raised, asking to manually specify how to recycle the values.
- With =, it will behave like dplyr::summarize (if a grouping has been specified).

1.5.1 Basic transmute

Only keeping the transformed columns.

mtcars |> transmute(cyl = cyl * 2)
data.frame [32 x 1]
cyl
12
12
8
12
16
12
16
8
8
12
12
16
16
16
16
[ omitted 17 entries ]
MT[, .(cyl = cyl * 2)]
data.table [32 x 1]
cyl
12
12
8
12
16
12
16
8
8
12
12
16
16
16
16
[ omitted 17 entries ]

Transmute & Extract:

MT[, (cyl = cyl * 2)]
 [1] 12 12  8 12 16 12 16  8  8 12 12 16 16 16 16 16 16  8  8  8  8 16 16 16 16
[26]  8  8  8 16 12 16  8

1.5.2 Basic mutate

Modifies the transformed column in-place and keeps every other column as-is.

mtcars |> mutate(cyl = 200)
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 200 160 110 3.9 2.62 16.46 0 1 4 4
21 200 160 110 3.9 2.875 17.02 0 1 4 4
22.8 200 108 93 3.85 2.32 18.61 1 1 4 1
21.4 200 258 110 3.08 3.215 19.44 1 0 3 1
18.7 200 360 175 3.15 3.44 17.02 0 0 3 2
18.1 200 225 105 2.76 3.46 20.22 1 0 3 1
14.3 200 360 245 3.21 3.57 15.84 0 0 3 4
24.4 200 146.7 62 3.69 3.19 20 1 0 4 2
22.8 200 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 200 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 200 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 200 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 200 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 200 275.8 180 3.07 3.78 18 0 0 3 3
10.4 200 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
mtcars |> mutate(cyl = 200, gear = 5)
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 200 160 110 3.9 2.62 16.46 0 1 5 4
21 200 160 110 3.9 2.875 17.02 0 1 5 4
22.8 200 108 93 3.85 2.32 18.61 1 1 5 1
21.4 200 258 110 3.08 3.215 19.44 1 0 5 1
18.7 200 360 175 3.15 3.44 17.02 0 0 5 2
18.1 200 225 105 2.76 3.46 20.22 1 0 5 1
14.3 200 360 245 3.21 3.57 15.84 0 0 5 4
24.4 200 146.7 62 3.69 3.19 20 1 0 5 2
22.8 200 140.8 95 3.92 3.15 22.9 1 0 5 2
19.2 200 167.6 123 3.92 3.44 18.3 1 0 5 4
17.8 200 167.6 123 3.92 3.44 18.9 1 0 5 4
16.4 200 275.8 180 3.07 4.07 17.4 0 0 5 3
17.3 200 275.8 180 3.07 3.73 17.6 0 0 5 3
15.2 200 275.8 180 3.07 3.78 18 0 0 5 3
10.4 200 472 205 2.93 5.25 17.98 0 0 5 4
[ omitted 17 entries ]


mtcars |> mutate(mean_cyl = mean(cyl, na.rm = TRUE))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb mean_cyl
21 6 160 110 3.9 2.62 16.46 0 1 4 4 6.188
21 6 160 110 3.9 2.875 17.02 0 1 4 4 6.188
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 6.188
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 6.188
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 6.188
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 6.188
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 6.188
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 6.188
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 6.188
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 6.188
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 6.188
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 6.188
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 6.188
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 6.188
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 6.188
[ omitted 17 entries ]
mtcars |> mutate(gear_plus = lead(gear))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb gear_plus
21 6 160 110 3.9 2.62 16.46 0 1 4 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 3
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 3
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 3
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 4
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 3
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 3
[ omitted 17 entries ]
copy(MT)[, cyl := 200][]
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 200 160 110 3.9 2.62 16.46 0 1 4 4
21 200 160 110 3.9 2.875 17.02 0 1 4 4
22.8 200 108 93 3.85 2.32 18.61 1 1 4 1
21.4 200 258 110 3.08 3.215 19.44 1 0 3 1
18.7 200 360 175 3.15 3.44 17.02 0 0 3 2
18.1 200 225 105 2.76 3.46 20.22 1 0 3 1
14.3 200 360 245 3.21 3.57 15.84 0 0 3 4
24.4 200 146.7 62 3.69 3.19 20 1 0 4 2
22.8 200 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 200 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 200 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 200 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 200 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 200 275.8 180 3.07 3.78 18 0 0 3 3
10.4 200 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
copy(MT)[, let(cyl = 200, gear = 5)][]
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 200 160 110 3.9 2.62 16.46 0 1 5 4
21 200 160 110 3.9 2.875 17.02 0 1 5 4
22.8 200 108 93 3.85 2.32 18.61 1 1 5 1
21.4 200 258 110 3.08 3.215 19.44 1 0 5 1
18.7 200 360 175 3.15 3.44 17.02 0 0 5 2
18.1 200 225 105 2.76 3.46 20.22 1 0 5 1
14.3 200 360 245 3.21 3.57 15.84 0 0 5 4
24.4 200 146.7 62 3.69 3.19 20 1 0 5 2
22.8 200 140.8 95 3.92 3.15 22.9 1 0 5 2
19.2 200 167.6 123 3.92 3.44 18.3 1 0 5 4
17.8 200 167.6 123 3.92 3.44 18.9 1 0 5 4
16.4 200 275.8 180 3.07 4.07 17.4 0 0 5 3
17.3 200 275.8 180 3.07 3.73 17.6 0 0 5 3
15.2 200 275.8 180 3.07 3.78 18 0 0 5 3
10.4 200 472 205 2.93 5.25 17.98 0 0 5 4
[ omitted 17 entries ]
Alternatives
copy(MT)[, `:=`(cyl = 200, gear = 5)][]

copy(MT)[, c("cyl", "gear") := .(200, 5)][]
copy(MT)[, mean_cyl := mean(cyl, na.rm = TRUE)][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb mean_cyl
21 6 160 110 3.9 2.62 16.46 0 1 4 4 6.188
21 6 160 110 3.9 2.875 17.02 0 1 4 4 6.188
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 6.188
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 6.188
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 6.188
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 6.188
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 6.188
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 6.188
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 6.188
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 6.188
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 6.188
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 6.188
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 6.188
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 6.188
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 6.188
[ omitted 17 entries ]
copy(MT)[, gearplus := shift(gear, 1, type = "lead")][] # lead, lag, cyclic
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb gearplus
21 6 160 110 3.9 2.62 16.46 0 1 4 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 3
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 3
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 3
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 4
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 3
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 3
[ omitted 17 entries ]

1.5.3 Dynamic trans/mutate

LHS <- "mean_mpg"
RHS <- "mpg"
mtcars |> mutate({{LHS}} := mean(mpg))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb mean_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 20.091
21 6 160 110 3.9 2.875 17.02 0 1 4 4 20.091
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 20.091
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 20.091
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 20.091
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 20.091
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 20.091
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 20.091
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 20.091
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 20.091
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 20.091
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 20.091
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 20.091
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 20.091
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 20.091
[ omitted 17 entries ]
mtcars |> mutate("{LHS}" := mean(.data[[RHS]]))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb mean_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 20.091
21 6 160 110 3.9 2.875 17.02 0 1 4 4 20.091
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 20.091
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 20.091
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 20.091
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 20.091
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 20.091
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 20.091
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 20.091
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 20.091
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 20.091
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 20.091
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 20.091
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 20.091
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 20.091
[ omitted 17 entries ]
mtcars |> mutate({{LHS}} := cur_data()[[RHS]] |> mean())
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb mean_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 20.091
21 6 160 110 3.9 2.875 17.02 0 1 4 4 20.091
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 20.091
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 20.091
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 20.091
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 20.091
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 20.091
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 20.091
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 20.091
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 20.091
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 20.091
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 20.091
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 20.091
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 20.091
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 20.091
[ omitted 17 entries ]
mtcars |> mutate({{LHS}} := pick({{ RHS }}) |> unlist() |> mean())
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb mean_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 20.091
21 6 160 110 3.9 2.875 17.02 0 1 4 4 20.091
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 20.091
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 20.091
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 20.091
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 20.091
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 20.091
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 20.091
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 20.091
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 20.091
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 20.091
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 20.091
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 20.091
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 20.091
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 20.091
[ omitted 17 entries ]
copy(MT)[, (LHS) := mean(mpg)][] # (LHS) <=> c(LHS)
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb mean_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 20.091
21 6 160 110 3.9 2.875 17.02 0 1 4 4 20.091
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 20.091
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 20.091
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 20.091
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 20.091
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 20.091
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 20.091
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 20.091
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 20.091
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 20.091
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 20.091
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 20.091
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 20.091
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 20.091
[ omitted 17 entries ]
copy(MT)[, j := mean(mpg), env = list(j = LHS)][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb mean_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 20.091
21 6 160 110 3.9 2.875 17.02 0 1 4 4 20.091
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 20.091
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 20.091
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 20.091
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 20.091
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 20.091
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 20.091
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 20.091
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 20.091
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 20.091
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 20.091
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 20.091
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 20.091
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 20.091
[ omitted 17 entries ]
copy(MT)[, c(LHS) := mean(get(RHS))][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb mean_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 20.091
21 6 160 110 3.9 2.875 17.02 0 1 4 4 20.091
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 20.091
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 20.091
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 20.091
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 20.091
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 20.091
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 20.091
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 20.091
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 20.091
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 20.091
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 20.091
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 20.091
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 20.091
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 20.091
[ omitted 17 entries ]
copy(MT)[, x := mean(y), env = list(x = LHS, y = RHS)][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb mean_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 20.091
21 6 160 110 3.9 2.875 17.02 0 1 4 4 20.091
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 20.091
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 20.091
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 20.091
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 20.091
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 20.091
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 20.091
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 20.091
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 20.091
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 20.091
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 20.091
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 20.091
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 20.091
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 20.091
[ omitted 17 entries ]

1.5.4 Conditional trans/mutate

Mutate everything based on multiple conditions:

One condition:

mtcars |> mutate(Size = if_else(cyl >= 6, "BIG", "small", missing = "Unk"))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb Size
21 6 160 110 3.9 2.62 16.46 0 1 4 4 BIG
21 6 160 110 3.9 2.875 17.02 0 1 4 4 BIG
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 small
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 BIG
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 BIG
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 BIG
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 BIG
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 small
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 small
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 BIG
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 BIG
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 BIG
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 BIG
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 BIG
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 BIG
[ omitted 17 entries ]

Nested conditions:

mtcars |> mutate(Size = case_when(
  cyl %between% c(2,4) ~ "small",
  cyl %between% c(4,8) ~ "BIG",
  .default = "Unk"
))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb Size
21 6 160 110 3.9 2.62 16.46 0 1 4 4 BIG
21 6 160 110 3.9 2.875 17.02 0 1 4 4 BIG
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 small
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 BIG
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 BIG
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 BIG
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 BIG
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 small
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 small
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 BIG
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 BIG
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 BIG
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 BIG
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 BIG
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 BIG
[ omitted 17 entries ]

Mutate only rows meeting conditions:

mtcars |> mutate(BIG = case_when(am == 1 ~ cyl >= 6))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb BIG
21 6 160 110 3.9 2.62 16.46 0 1 4 4 TRUE
21 6 160 110 3.9 2.875 17.02 0 1 4 4 TRUE
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 FALSE
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 NA
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 NA
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 NA
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 NA
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 NA
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 NA
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 NA
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 NA
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 NA
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 NA
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 NA
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 NA
[ omitted 17 entries ]

Mutate everything based on multiple conditions:

One condition:

copy(MT)[, Size := fifelse(cyl >= 6, "BIG", "small", na = "Unk")][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb Size
21 6 160 110 3.9 2.62 16.46 0 1 4 4 BIG
21 6 160 110 3.9 2.875 17.02 0 1 4 4 BIG
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 small
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 BIG
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 BIG
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 BIG
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 BIG
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 small
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 small
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 BIG
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 BIG
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 BIG
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 BIG
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 BIG
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 BIG
[ omitted 17 entries ]

Nested conditions:

copy(MT)[, Size := fcase(
  cyl %between% c(2,4), "small", 
  cyl %between% c(4,8), "BIG",
  default = "Unk"
)][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb Size
21 6 160 110 3.9 2.62 16.46 0 1 4 4 BIG
21 6 160 110 3.9 2.875 17.02 0 1 4 4 BIG
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 small
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 BIG
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 BIG
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 BIG
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 BIG
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 small
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 small
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 BIG
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 BIG
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 BIG
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 BIG
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 BIG
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 BIG
[ omitted 17 entries ]

Mutate only rows meeting conditions:

copy(MT)[am == 1, BIG := cyl >= 6][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb BIG
21 6 160 110 3.9 2.62 16.46 0 1 4 4 TRUE
21 6 160 110 3.9 2.875 17.02 0 1 4 4 TRUE
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 FALSE
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 NA
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 NA
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 NA
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 NA
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 NA
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 NA
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 NA
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 NA
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 NA
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 NA
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 NA
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 NA
[ omitted 17 entries ]

1.5.5 Complex trans/mutate

1.5.5.1 Column-wise operations

new <- c("min_mpg", "min_disp")
old <- c("mpg", "disp")

Apply one function to multiple columns:

mtcars |> mutate(across(c("mpg", "disp"), min, .names = "min_{col}"))
data.frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg min_disp
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 71.1
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 71.1
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 71.1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 71.1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 71.1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 71.1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 71.1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 71.1
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 71.1
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 71.1
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 71.1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 71.1
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 71.1
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 71.1
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 71.1
[ omitted 17 entries ]




As a transmute:

mtcars |> transmute(across(c("mpg", "disp"), min, .names = "min_{col}"))
data.frame [32 x 2]
min_mpg min_disp
10.4 71.1
10.4 71.1
10.4 71.1
10.4 71.1
10.4 71.1
10.4 71.1
10.4 71.1
10.4 71.1
10.4 71.1
10.4 71.1
10.4 71.1
10.4 71.1
10.4 71.1
10.4 71.1
10.4 71.1
[ omitted 17 entries ]



Dynamically:

mtcars |> mutate(across(all_of(old), min, .names = "min_{col}"))
data.frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg min_disp
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 71.1
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 71.1
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 71.1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 71.1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 71.1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 71.1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 71.1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 71.1
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 71.1
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 71.1
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 71.1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 71.1
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 71.1
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 71.1
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 71.1
[ omitted 17 entries ]
copy(MT)[
    , c("min_mpg", "min_disp") := lapply(.SD, min), .SDcols = c("mpg", "disp")
  ][]
data.table [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg min_disp
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 71.1
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 71.1
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 71.1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 71.1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 71.1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 71.1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 71.1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 71.1
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 71.1
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 71.1
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 71.1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 71.1
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 71.1
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 71.1
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 71.1
[ omitted 17 entries ]
copy(MT)[, c("min_mpg", "min_disp") := lapply(.(mpg, disp), min)][]

As a transmute:

A second step is needed to add min_ before the names:

(MT[, lapply(.SD[, .(mpg, disp)], min)] |> bind(d, setnames(d, names(d), \(x) paste0("min_", x))))[]
data.table [1 x 2]
min_mpg min_disp
10.4 71.1

Dynamically:

copy(MT)[, c(new) := lapply(mget(old), min)][]
data.table [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg min_disp
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 71.1
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 71.1
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 71.1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 71.1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 71.1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 71.1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 71.1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 71.1
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 71.1
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 71.1
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 71.1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 71.1
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 71.1
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 71.1
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 71.1
[ omitted 17 entries ]
copy(MT)[, c(new) := lapply(x, min), env = list(x = as.list(old))][]

Apply multiple functions to one or multiple column:

col <- "mpg"
cols <- c("mpg", "disp")
mtcars |> mutate(min_mpg = min(mpg), max_mpg = max(mpg))
data.frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg max_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 33.9
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 33.9
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 33.9
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 33.9
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 33.9
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 33.9
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 33.9
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 33.9
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 33.9
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 33.9
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 33.9
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 33.9
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 33.9
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 33.9
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 33.9
[ omitted 17 entries ]
mtcars |> mutate(across(mpg, list(min = min, max = max), .names = "{fn}_{col}"))
data.frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg max_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 33.9
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 33.9
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 33.9
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 33.9
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 33.9
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 33.9
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 33.9
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 33.9
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 33.9
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 33.9
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 33.9
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 33.9
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 33.9
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 33.9
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 33.9
[ omitted 17 entries ]



Multiple columns:

mtcars |> mutate(across(matches("mpg|disp"), list(min = min, max = max), .names = "{fn}_{col}"))
data.frame [32 x 15]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg max_mpg min_disp max_disp
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 33.9 71.1 472
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 33.9 71.1 472
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 33.9 71.1 472
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 33.9 71.1 472
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 33.9 71.1 472
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 33.9 71.1 472
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 33.9 71.1 472
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 33.9 71.1 472
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 33.9 71.1 472
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 33.9 71.1 472
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 33.9 71.1 472
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 33.9 71.1 472
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 33.9 71.1 472
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 33.9 71.1 472
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 33.9 71.1 472
[ omitted 17 entries ]
mtcars |> mutate(across(cols, list(min = \(x) min(x), max = \(x) max(x)), .names = "{fn}_{col}"))
data.frame [32 x 15]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg max_mpg min_disp max_disp
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 33.9 71.1 472
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 33.9 71.1 472
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 33.9 71.1 472
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 33.9 71.1 472
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 33.9 71.1 472
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 33.9 71.1 472
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 33.9 71.1 472
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 33.9 71.1 472
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 33.9 71.1 472
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 33.9 71.1 472
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 33.9 71.1 472
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 33.9 71.1 472
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 33.9 71.1 472
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 33.9 71.1 472
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 33.9 71.1 472
[ omitted 17 entries ]
copy(MT)[, let(min_mpg = min(mpg), max_mpg = max(mpg))][]
data.table [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg max_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 33.9
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 33.9
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 33.9
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 33.9
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 33.9
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 33.9
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 33.9
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 33.9
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 33.9
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 33.9
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 33.9
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 33.9
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 33.9
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 33.9
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 33.9
[ omitted 17 entries ]
copy(MT)[, c("min_mpg", "max_mpg") := .(min(mpg), max(mpg))][]
data.table [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg max_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 33.9
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 33.9
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 33.9
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 33.9
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 33.9
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 33.9
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 33.9
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 33.9
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 33.9
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 33.9
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 33.9
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 33.9
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 33.9
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 33.9
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 33.9
[ omitted 17 entries ]
Alternatives
copy(MT)[, c("min_mpg", "max_mpg") := 
           lapply(.(mpg), \(x) list(min(x), max(x))) |> do.call(rbind, args = _)
        ][]

copy(MT)[, c("min_mpg", "max_mpg") := 
           lapply(.(get(col)), \(x) list(min(x), max(x))) |> unlist(recursive = FALSE)
        ][]

Multiple columns:

copy(MT)[, c("min_mpg", "min_disp", "max_mpg", "max_disp") := 
           lapply(.SD, \(x) list(min(x), max(x))) |> do.call(rbind, args = _), 
         .SDcols = cols][]
data.table [32 x 15]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg min_disp max_mpg max_disp
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 71.1 33.9 472
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 71.1 33.9 472
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 71.1 33.9 472
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 71.1 33.9 472
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 71.1 33.9 472
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 71.1 33.9 472
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 71.1 33.9 472
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 71.1 33.9 472
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 71.1 33.9 472
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 71.1 33.9 472
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 71.1 33.9 472
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 71.1 33.9 472
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 71.1 33.9 472
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 71.1 33.9 472
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 71.1 33.9 472
[ omitted 17 entries ]
copy(MT)[, outer(c("min", "max"), cols, str_c, sep = "_") |> t() |> as.vector() := 
           lapply(.SD, \(x) list(min(x), max(x))) |> do.call(rbind, args = _), 
         .SDcols = cols][]
data.table [32 x 15]
mpg cyl disp hp drat wt qsec vs am gear carb min_mpg min_disp max_mpg max_disp
21 6 160 110 3.9 2.62 16.46 0 1 4 4 10.4 71.1 33.9 472
21 6 160 110 3.9 2.875 17.02 0 1 4 4 10.4 71.1 33.9 472
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 10.4 71.1 33.9 472
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 10.4 71.1 33.9 472
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 10.4 71.1 33.9 472
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 10.4 71.1 33.9 472
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 10.4 71.1 33.9 472
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 10.4 71.1 33.9 472
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 10.4 71.1 33.9 472
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10.4 71.1 33.9 472
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 10.4 71.1 33.9 472
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 10.4 71.1 33.9 472
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 10.4 71.1 33.9 472
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 10.4 71.1 33.9 472
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 10.4 71.1 33.9 472
[ omitted 17 entries ]

1.5.5.2 Row-wise operations

Apply one function to multiple columns (row-wise):

mtcars |> rowwise() |> mutate(rsum = sum(c_across(where(is.numeric)))) |> ungroup()
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb rsum
21 6 160 110 3.9 2.62 16.46 0 1 4 4 328.98
21 6 160 110 3.9 2.875 17.02 0 1 4 4 329.795
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 259.58
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 426.135
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 590.31
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 385.54
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 656.92
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 270.98
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 299.57
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 350.46
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 349.66
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 510.74
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 511.5
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 509.85
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 728.56
[ omitted 17 entries ]
mtcars |> mutate(rsum = pmap_dbl(across(where(is.numeric)), \(...) sum(c(...))))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb rsum
21 6 160 110 3.9 2.62 16.46 0 1 4 4 328.98
21 6 160 110 3.9 2.875 17.02 0 1 4 4 329.795
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 259.58
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 426.135
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 590.31
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 385.54
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 656.92
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 270.98
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 299.57
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 350.46
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 349.66
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 510.74
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 511.5
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 509.85
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 728.56
[ omitted 17 entries ]

Hybrid base R-Tidyverse:

mtcars |> mutate(rsum = apply(across(where(is.numeric)), 1, sum))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb rsum
21 6 160 110 3.9 2.62 16.46 0 1 4 4 328.98
21 6 160 110 3.9 2.875 17.02 0 1 4 4 329.795
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 259.58
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 426.135
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 590.31
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 385.54
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 656.92
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 270.98
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 299.57
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 350.46
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 349.66
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 510.74
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 511.5
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 509.85
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 728.56
[ omitted 17 entries ]
mtcars |> mutate(rsum = rowSums(across(where(is.numeric))))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb rsum
21 6 160 110 3.9 2.62 16.46 0 1 4 4 328.98
21 6 160 110 3.9 2.875 17.02 0 1 4 4 329.795
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 259.58
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 426.135
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 590.31
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 385.54
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 656.92
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 270.98
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 299.57
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 350.46
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 349.66
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 510.74
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 511.5
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 509.85
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 728.56
[ omitted 17 entries ]
copy(MT)[, rsum := rowSums(.SD), .SDcols = is.numeric][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb rsum
21 6 160 110 3.9 2.62 16.46 0 1 4 4 328.98
21 6 160 110 3.9 2.875 17.02 0 1 4 4 329.795
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 259.58
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 426.135
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 590.31
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 385.54
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 656.92
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 270.98
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 299.57
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 350.46
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 349.66
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 510.74
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 511.5
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 509.85
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 728.56
[ omitted 17 entries ]
copy(MT)[, rsum := apply(.SD, 1, sum), .SDcols = is.numeric][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb rsum
21 6 160 110 3.9 2.62 16.46 0 1 4 4 328.98
21 6 160 110 3.9 2.875 17.02 0 1 4 4 329.795
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 259.58
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 426.135
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 590.31
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 385.54
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 656.92
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 270.98
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 299.57
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 350.46
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 349.66
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 510.74
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 511.5
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 509.85
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 728.56
[ omitted 17 entries ]

Apply multiple functions to multiple columns (row-wise)

mtcars |> 
  mutate(pmap_dfr(across(where(is.numeric)), \(...) list(mean = mean(c(...)), sum = sum(c(...)))))
data.frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb mean sum
21 6 160 110 3.9 2.62 16.46 0 1 4 4 29.907 328.98
21 6 160 110 3.9 2.875 17.02 0 1 4 4 29.981 329.795
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 23.598 259.58
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 38.74 426.135
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 53.665 590.31
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 35.049 385.54
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 59.72 656.92
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 24.635 270.98
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 27.234 299.57
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 31.86 350.46
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 31.787 349.66
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 46.431 510.74
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 46.5 511.5
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 46.35 509.85
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 66.233 728.56
[ omitted 17 entries ]
Alternatives
mtcars |> 
  mutate(
    pmap(across(where(is.numeric)), \(...) list(mean = mean(c(...)), sum = sum(c(...)))) |> 
      bind_rows()
  )

Hybrid base R-Tidyverse:

mtcars |> 
  mutate(apply(across(where(is.numeric)), 1, \(x) list(mean = mean(x), sum = sum(x))) |> bind_rows())
data.frame [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb mean sum
21 6 160 110 3.9 2.62 16.46 0 1 4 4 29.907 328.98
21 6 160 110 3.9 2.875 17.02 0 1 4 4 29.981 329.795
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 23.598 259.58
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 38.74 426.135
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 53.665 590.31
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 35.049 385.54
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 59.72 656.92
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 24.635 270.98
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 27.234 299.57
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 31.86 350.46
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 31.787 349.66
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 46.431 510.74
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 46.5 511.5
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 46.35 509.85
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 66.233 728.56
[ omitted 17 entries ]
copy(MT)[, c("rmean", "rsum") := 
           apply(.SD, 1, \(x) list(mean(x), sum(x))) |> rbindlist(), 
         .SDcols = is.numeric][]
data.table [32 x 13]
mpg cyl disp hp drat wt qsec vs am gear carb rmean rsum
21 6 160 110 3.9 2.62 16.46 0 1 4 4 29.907 328.98
21 6 160 110 3.9 2.875 17.02 0 1 4 4 29.981 329.795
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 23.598 259.58
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 38.74 426.135
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 53.665 590.31
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 35.049 385.54
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 59.72 656.92
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 24.635 270.98
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 27.234 299.57
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 31.86 350.46
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 31.787 349.66
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 46.431 510.74
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 46.5 511.5
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 46.35 509.85
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 66.233 728.56
[ omitted 17 entries ]

Apply an anonymous function inside the DT:

MT[, {
    print(summary(mpg))
    x <- cyl + gear
    .(RN = 1:.N, CG = x)
  }
]
Min. 1st Qu. Median Mean 3rd Qu. Max. 10.40 15.43 19.20 20.09 22.80 33.90
data.table [32 x 2]
RN CG
1 10
2 10
3 8
4 9
5 11
6 9
7 11
8 8
9 8
10 10
11 10
12 11
13 11
14 11
15 11
[ omitted 17 entries ]

1.6 Group / Aggregate

Note

The examples listed apply a grouping but do nothing (using .SD to simply keep all columns as is)

cols <- c("cyl", "disp")
cols_missing <- c("cyl", "disp", "missing_col")

1.6.1 Basic grouping

mtcars |> group_by(cyl, gear)
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

Dynamic grouping:

mtcars |> group_by(across(all_of(cols)))
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

Use any_of if you expect some columns to be missing in the data.

mtcars |> group_by(across(any_of(cols_missing)))
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
MT[, .SD, by = .(cyl, gear)]
data.table [32 x 11]
cyl gear mpg disp hp drat wt qsec vs am carb
6 4 21 160 110 3.9 2.62 16.46 0 1 4
6 4 21 160 110 3.9 2.875 17.02 0 1 4
6 4 19.2 167.6 123 3.92 3.44 18.3 1 0 4
6 4 17.8 167.6 123 3.92 3.44 18.9 1 0 4
4 4 22.8 108 93 3.85 2.32 18.61 1 1 1
4 4 24.4 146.7 62 3.69 3.19 20 1 0 2
4 4 22.8 140.8 95 3.92 3.15 22.9 1 0 2
4 4 32.4 78.7 66 4.08 2.2 19.47 1 1 1
4 4 30.4 75.7 52 4.93 1.615 18.52 1 1 2
4 4 33.9 71.1 65 4.22 1.835 19.9 1 1 1
4 4 27.3 79 66 4.08 1.935 18.9 1 1 1
4 4 21.4 121 109 4.11 2.78 18.6 1 1 2
6 3 21.4 258 110 3.08 3.215 19.44 1 0 1
6 3 18.1 225 105 2.76 3.46 20.22 1 0 1
8 3 18.7 360 175 3.15 3.44 17.02 0 0 2
[ omitted 17 entries ]

Dynamic grouping:

MT[, .SD, by = cols]
data.table [32 x 11]
cyl disp mpg hp drat wt qsec vs am gear carb
6 160 21 110 3.9 2.62 16.46 0 1 4 4
6 160 21 110 3.9 2.875 17.02 0 1 4 4
4 108 22.8 93 3.85 2.32 18.61 1 1 4 1
6 258 21.4 110 3.08 3.215 19.44 1 0 3 1
8 360 18.7 175 3.15 3.44 17.02 0 0 3 2
8 360 14.3 245 3.21 3.57 15.84 0 0 3 4
6 225 18.1 105 2.76 3.46 20.22 1 0 3 1
4 146.7 24.4 62 3.69 3.19 20 1 0 4 2
4 140.8 22.8 95 3.92 3.15 22.9 1 0 4 2
6 167.6 19.2 123 3.92 3.44 18.3 1 0 4 4
6 167.6 17.8 123 3.92 3.44 18.9 1 0 4 4
8 275.8 16.4 180 3.07 4.07 17.4 0 0 3 3
8 275.8 17.3 180 3.07 3.73 17.6 0 0 3 3
8 275.8 15.2 180 3.07 3.78 18 0 0 3 3
8 472 10.4 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

To handle potentially missing columns:

MT[, .SD, by = intersect(cols_missing, colnames(MT))]
data.table [32 x 11]
cyl disp mpg hp drat wt qsec vs am gear carb
6 160 21 110 3.9 2.62 16.46 0 1 4 4
6 160 21 110 3.9 2.875 17.02 0 1 4 4
4 108 22.8 93 3.85 2.32 18.61 1 1 4 1
6 258 21.4 110 3.08 3.215 19.44 1 0 3 1
8 360 18.7 175 3.15 3.44 17.02 0 0 3 2
8 360 14.3 245 3.21 3.57 15.84 0 0 3 4
6 225 18.1 105 2.76 3.46 20.22 1 0 3 1
4 146.7 24.4 62 3.69 3.19 20 1 0 4 2
4 140.8 22.8 95 3.92 3.15 22.9 1 0 4 2
6 167.6 19.2 123 3.92 3.44 18.3 1 0 4 4
6 167.6 17.8 123 3.92 3.44 18.9 1 0 4 4
8 275.8 16.4 180 3.07 4.07 17.4 0 0 3 3
8 275.8 17.3 180 3.07 3.73 17.6 0 0 3 3
8 275.8 15.2 180 3.07 3.78 18 0 0 3 3
8 472 10.4 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

1.6.2 Current group info

mtcars |> 
  group_by(cyl) |> 
  filter(cur_group_id() == 1) |> # To only keep one plot
  group_walk(\(d, g) with(d, plot(hp, mpg, main = paste("Cyl:", g$cyl))))

Use the .BY argument to get the current group name:

MT[, with(.SD, plot(hp, mpg, main = paste("Cyl:", .BY))), keyby = cyl]

1.7 Row numbers & indices

1.7.1 Adding row or group indices

.I: Row indices
.N: Number of rows

.GRP: Group indices
.NGRP: Number of groups

1.7.1.1 Adding rows indices:

mtcars |> mutate(I = row_number())
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb I
21 6 160 110 3.9 2.62 16.46 0 1 4 4 1
21 6 160 110 3.9 2.875 17.02 0 1 4 4 2
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 3
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 4
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 5
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 6
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 7
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 8
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 9
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 11
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 12
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 13
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 14
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 15
[ omitted 17 entries ]
copy(MT)[ , I := .I][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb I
21 6 160 110 3.9 2.62 16.46 0 1 4 4 1
21 6 160 110 3.9 2.875 17.02 0 1 4 4 2
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 3
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 4
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 5
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 6
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 7
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 8
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 9
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 10
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 11
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 12
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 13
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 14
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 15
[ omitted 17 entries ]

1.7.1.2 Adding group indices:

Adding group indices (same index for each group):

mtcars |> summarize(GRP = cur_group_id(), .by = cyl)
data.frame [3 x 2]
cyl GRP
6 1
4 2
8 3

Mutate instead of summarize:

mtcars |> mutate(GRP = cur_group_id(), .by = cyl)
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb GRP
21 6 160 110 3.9 2.62 16.46 0 1 4 4 1
21 6 160 110 3.9 2.875 17.02 0 1 4 4 1
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 3
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 3
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 1
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 3
[ omitted 17 entries ]

Adding row numbers within each group:

mtcars |> mutate(I_GRP = row_number(), .by = gear)
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb I_GRP
21 6 160 110 3.9 2.62 16.46 0 1 4 4 1
21 6 160 110 3.9 2.875 17.02 0 1 4 4 2
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 3
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 3
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 4
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 5
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 6
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 7
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 5
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 6
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 7
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 8
[ omitted 17 entries ]

Adding group indices (same index for each group):

MT[, .GRP, by = cyl]
data.table [3 x 2]
cyl GRP
6 1
4 2
8 3

Mutate instead of summarize:

copy(MT)[, GRP := .GRP, by = cyl][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb GRP
21 6 160 110 3.9 2.62 16.46 0 1 4 4 1
21 6 160 110 3.9 2.875 17.02 0 1 4 4 1
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 3
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 3
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 1
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 3
[ omitted 17 entries ]

Adding row numbers within each group:

copy(MT)[, I_GRP := 1:.N, by = gear][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb I_GRP
21 6 160 110 3.9 2.62 16.46 0 1 4 4 1
21 6 160 110 3.9 2.875 17.02 0 1 4 4 2
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 3
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 3
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 4
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 5
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 6
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 7
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 5
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 6
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 7
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 8
[ omitted 17 entries ]
copy(MT)[, I_GRP := rowid(gear)][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb I_GRP
21 6 160 110 3.9 2.62 16.46 0 1 4 4 1
21 6 160 110 3.9 2.875 17.02 0 1 4 4 2
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 3
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 3
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 4
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 5
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 6
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 7
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 5
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 6
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 7
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 8
[ omitted 17 entries ]

1.7.2 Filtering based on row numbers (slicing)

1.7.2.1 Extracting a specific row

mtcars |> dplyr::first()
data.frame [1 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
mtcars |> dplyr::last()
data.frame [1 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
mtcars |> dplyr::nth(5)
data.frame [1 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
MT[1,] # data.table::first(MT)
data.table [1 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
MT[.N,] # data.table::last(MT)
data.table [1 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
MT[5,]
data.table [1 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2

1.7.2.2 Slicing rows

tail(mtcars, 10)
data.frame [10 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
mtcars |> slice((n()-9):n())
data.frame [10 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
mtcars |> slice_tail(n = 10)
data.frame [10 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
tail(MT, 10)
data.table [10 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
MT[(.N-9):.N]
data.table [10 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
MT[MT[, .I[(.N-9):.N]]] # Gets the last 10 rows' indices and filters based on them
data.table [10 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2

1.7.2.3 Slicing groups

Random sample by group:

mtcars |> slice_sample(n = 5, by = cyl)
data.frame [15 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
15 8 301 335 3.54 3.57 14.6 0 1 5 8
10.4 8 460 215 3 5.424 17.82 0 0 3 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4

Filter groups by condition:

mtcars |> filter(n() >= 8, .by = cyl)
data.frame [25 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
[ omitted 10 entries ]
mtcars |> group_by(cyl) |> group_modify(\(d,g) if (nrow(d) >= 8) d else data.frame())
data.frame [25 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1
4 24.4 146.7 62 3.69 3.19 20 1 0 4 2
4 22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
4 32.4 78.7 66 4.08 2.2 19.47 1 1 4 1
4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
4 33.9 71.1 65 4.22 1.835 19.9 1 1 4 1
4 21.5 120.1 97 3.7 2.465 20.01 1 0 3 1
4 27.3 79 66 4.08 1.935 18.9 1 1 4 1
4 26 120.3 91 4.43 2.14 16.7 0 1 5 2
4 30.4 95.1 113 3.77 1.513 16.9 1 1 5 2
4 21.4 121 109 4.11 2.78 18.6 1 1 4 2
8 18.7 360 175 3.15 3.44 17.02 0 0 3 2
8 14.3 360 245 3.21 3.57 15.84 0 0 3 4
8 16.4 275.8 180 3.07 4.07 17.4 0 0 3 3
8 17.3 275.8 180 3.07 3.73 17.6 0 0 3 3
[ omitted 10 entries ]

Random sample by group:

MT[, .SD[sample(.N, 5)], keyby = cyl]
data.table [15 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
4 27.3 79 66 4.08 1.935 18.9 1 1 4 1
4 22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
4 24.4 146.7 62 3.69 3.19 20 1 0 4 2
4 32.4 78.7 66 4.08 2.2 19.47 1 1 4 1
4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
6 18.1 225 105 2.76 3.46 20.22 1 0 3 1
6 19.7 145 175 3.62 2.77 15.5 0 1 5 6
6 21 160 110 3.9 2.875 17.02 0 1 4 4
6 21.4 258 110 3.08 3.215 19.44 1 0 3 1
6 19.2 167.6 123 3.92 3.44 18.3 1 0 4 4
8 15.2 275.8 180 3.07 3.78 18 0 0 3 3
8 10.4 472 205 2.93 5.25 17.98 0 0 3 4
8 18.7 360 175 3.15 3.44 17.02 0 0 3 2
8 17.3 275.8 180 3.07 3.73 17.6 0 0 3 3
8 15.5 318 150 2.76 3.52 16.87 0 0 3 2

Filter groups by condition:

MT[, if(.N >= 8) .SD, by = cyl]
data.table [25 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1
4 24.4 146.7 62 3.69 3.19 20 1 0 4 2
4 22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
4 32.4 78.7 66 4.08 2.2 19.47 1 1 4 1
4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
4 33.9 71.1 65 4.22 1.835 19.9 1 1 4 1
4 21.5 120.1 97 3.7 2.465 20.01 1 0 3 1
4 27.3 79 66 4.08 1.935 18.9 1 1 4 1
4 26 120.3 91 4.43 2.14 16.7 0 1 5 2
4 30.4 95.1 113 3.77 1.513 16.9 1 1 5 2
4 21.4 121 109 4.11 2.78 18.6 1 1 4 2
8 18.7 360 175 3.15 3.44 17.02 0 0 3 2
8 14.3 360 245 3.21 3.57 15.84 0 0 3 4
8 16.4 275.8 180 3.07 4.07 17.4 0 0 3 3
8 17.3 275.8 180 3.07 3.73 17.6 0 0 3 3
[ omitted 10 entries ]
MT[, .SD[.N >= 8], by = cyl]
data.table [25 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1
4 24.4 146.7 62 3.69 3.19 20 1 0 4 2
4 22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
4 32.4 78.7 66 4.08 2.2 19.47 1 1 4 1
4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
4 33.9 71.1 65 4.22 1.835 19.9 1 1 4 1
4 21.5 120.1 97 3.7 2.465 20.01 1 0 3 1
4 27.3 79 66 4.08 1.935 18.9 1 1 4 1
4 26 120.3 91 4.43 2.14 16.7 0 1 5 2
4 30.4 95.1 113 3.77 1.513 16.9 1 1 5 2
4 21.4 121 109 4.11 2.78 18.6 1 1 4 2
8 18.7 360 175 3.15 3.44 17.02 0 0 3 2
8 14.3 360 245 3.21 3.57 15.84 0 0 3 4
8 16.4 275.8 180 3.07 4.07 17.4 0 0 3 3
8 17.3 275.8 180 3.07 3.73 17.6 0 0 3 3
[ omitted 10 entries ]

1.7.3 Extracting row indices

1.7.3.1 Getting the row numbers of specific observations:

Row number of the first and last observation of each group:

mtcars |> reframe(I = cur_group_rows()[c(1, n())], .by = cyl)
data.frame [6 x 2]
cyl I
6 1
6 30
4 3
4 32
8 5
8 31

… while keeping all other columns:

mtcars |> mutate(I = row_number()) |> slice(c(1, n()), .by = cyl)
data.frame [6 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb I
21 6 160 110 3.9 2.62 16.46 0 1 4 4 1
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6 30
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 3
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 32
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 5
15 8 301 335 3.54 3.57 14.6 0 1 5 8 31

Row number of the first and last observation of each group:

MT[, .I[c(1, .N)], by = cyl]
data.table [6 x 2]
cyl V1
6 1
6 30
4 3
4 32
8 5
8 31

… while keeping all other columns:

copy(MT)[, I := .I][, .SD[c(1, .N)], by = cyl]
data.table [6 x 12]
cyl mpg disp hp drat wt qsec vs am gear carb I
6 21 160 110 3.9 2.62 16.46 0 1 4 4 1
6 19.7 145 175 3.62 2.77 15.5 0 1 5 6 30
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1 3
4 21.4 121 109 4.11 2.78 18.6 1 1 4 2 32
8 18.7 360 175 3.15 3.44 17.02 0 0 3 2 5
8 15 301 335 3.54 3.57 14.6 0 1 5 8 31

1.7.3.2 Extracting row indices after filtering


Extracting row numbers in the original dataset:

mtcars |> mutate(I = row_number()) |> filter(gear == 4) |> pull(I)

[1] 1 2 3 8 9 10 11 18 19 20 26 32

Extracting row numbers in the new dataset (after filtering):

mtcars |> filter(gear == 4) |> mutate(I = row_number()) |> pull(I)

[1] 1 2 3 4 5 6 7 8 9 10 11 12

Warning

.I gives the vector of row numbers after any subsetting/filtering has been done

Extracting row numbers in the original dataset:

MT[, .I[gear == 4]]

[1] 1 2 3 8 9 10 11 18 19 20 26 32

Extracting row numbers in the new dataset (after filtering):

MT[gear == 4, .I]

[1] 1 2 3 4 5 6 7 8 9 10 11 12

1.8 Relocate

1.8.1 Basic reordering

mtcars |> relocate(cyl, .after = last_col())
data.frame [32 x 11]
mpg disp hp drat wt qsec vs am gear carb cyl
21 160 110 3.9 2.62 16.46 0 1 4 4 6
21 160 110 3.9 2.875 17.02 0 1 4 4 6
22.8 108 93 3.85 2.32 18.61 1 1 4 1 4
21.4 258 110 3.08 3.215 19.44 1 0 3 1 6
18.7 360 175 3.15 3.44 17.02 0 0 3 2 8
18.1 225 105 2.76 3.46 20.22 1 0 3 1 6
14.3 360 245 3.21 3.57 15.84 0 0 3 4 8
24.4 146.7 62 3.69 3.19 20 1 0 4 2 4
22.8 140.8 95 3.92 3.15 22.9 1 0 4 2 4
19.2 167.6 123 3.92 3.44 18.3 1 0 4 4 6
17.8 167.6 123 3.92 3.44 18.9 1 0 4 4 6
16.4 275.8 180 3.07 4.07 17.4 0 0 3 3 8
17.3 275.8 180 3.07 3.73 17.6 0 0 3 3 8
15.2 275.8 180 3.07 3.78 18 0 0 3 3 8
10.4 472 205 2.93 5.25 17.98 0 0 3 4 8
[ omitted 17 entries ]


Relocate a new column (mutate + relocate):

mtcars |> mutate(GRP = cur_group_id(), .by = cyl, .before = 1)
data.frame [32 x 12]
GRP mpg cyl disp hp drat wt qsec vs am gear carb
1 21 6 160 110 3.9 2.62 16.46 0 1 4 4
1 21 6 160 110 3.9 2.875 17.02 0 1 4 4
2 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
1 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
3 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
1 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
3 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
2 24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
2 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
1 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
1 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
3 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
3 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
3 15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
3 10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
setcolorder(copy(MT), "cyl", after = last(colnames(MT)))[]
data.table [32 x 11]
mpg disp hp drat wt qsec vs am gear carb cyl
21 160 110 3.9 2.62 16.46 0 1 4 4 6
21 160 110 3.9 2.875 17.02 0 1 4 4 6
22.8 108 93 3.85 2.32 18.61 1 1 4 1 4
21.4 258 110 3.08 3.215 19.44 1 0 3 1 6
18.7 360 175 3.15 3.44 17.02 0 0 3 2 8
18.1 225 105 2.76 3.46 20.22 1 0 3 1 6
14.3 360 245 3.21 3.57 15.84 0 0 3 4 8
24.4 146.7 62 3.69 3.19 20 1 0 4 2 4
22.8 140.8 95 3.92 3.15 22.9 1 0 4 2 4
19.2 167.6 123 3.92 3.44 18.3 1 0 4 4 6
17.8 167.6 123 3.92 3.44 18.9 1 0 4 4 6
16.4 275.8 180 3.07 4.07 17.4 0 0 3 3 8
17.3 275.8 180 3.07 3.73 17.6 0 0 3 3 8
15.2 275.8 180 3.07 3.78 18 0 0 3 3 8
10.4 472 205 2.93 5.25 17.98 0 0 3 4 8
[ omitted 17 entries ]
setcolorder(copy(MT), c(setdiff(colnames(MT), "cyl"), "cyl"))[]
data.table [32 x 11]
mpg disp hp drat wt qsec vs am gear carb cyl
21 160 110 3.9 2.62 16.46 0 1 4 4 6
21 160 110 3.9 2.875 17.02 0 1 4 4 6
22.8 108 93 3.85 2.32 18.61 1 1 4 1 4
21.4 258 110 3.08 3.215 19.44 1 0 3 1 6
18.7 360 175 3.15 3.44 17.02 0 0 3 2 8
18.1 225 105 2.76 3.46 20.22 1 0 3 1 6
14.3 360 245 3.21 3.57 15.84 0 0 3 4 8
24.4 146.7 62 3.69 3.19 20 1 0 4 2 4
22.8 140.8 95 3.92 3.15 22.9 1 0 4 2 4
19.2 167.6 123 3.92 3.44 18.3 1 0 4 4 6
17.8 167.6 123 3.92 3.44 18.9 1 0 4 4 6
16.4 275.8 180 3.07 4.07 17.4 0 0 3 3 8
17.3 275.8 180 3.07 3.73 17.6 0 0 3 3 8
15.2 275.8 180 3.07 3.78 18 0 0 3 3 8
10.4 472 205 2.93 5.25 17.98 0 0 3 4 8
[ omitted 17 entries ]

Relocate a new column (mutate + relocate):

setcolorder(copy(MT)[ , GRP := .GRP, by = cyl], "GRP")[]
data.table [32 x 12]
GRP mpg cyl disp hp drat wt qsec vs am gear carb
1 21 6 160 110 3.9 2.62 16.46 0 1 4 4
1 21 6 160 110 3.9 2.875 17.02 0 1 4 4
2 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
1 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
3 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
1 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
3 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
2 24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
2 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
1 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
1 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
3 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
3 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
3 15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
3 10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

1.8.2 Reordering by column names

mtcars |> select(sort(tidyselect::peek_vars()))
data.frame [32 x 11]
am carb cyl disp drat gear hp mpg qsec vs wt
1 4 6 160 3.9 4 110 21 16.46 0 2.62
1 4 6 160 3.9 4 110 21 17.02 0 2.875
1 1 4 108 3.85 4 93 22.8 18.61 1 2.32
0 1 6 258 3.08 3 110 21.4 19.44 1 3.215
0 2 8 360 3.15 3 175 18.7 17.02 0 3.44
0 1 6 225 2.76 3 105 18.1 20.22 1 3.46
0 4 8 360 3.21 3 245 14.3 15.84 0 3.57
0 2 4 146.7 3.69 4 62 24.4 20 1 3.19
0 2 4 140.8 3.92 4 95 22.8 22.9 1 3.15
0 4 6 167.6 3.92 4 123 19.2 18.3 1 3.44
0 4 6 167.6 3.92 4 123 17.8 18.9 1 3.44
0 3 8 275.8 3.07 3 180 16.4 17.4 0 4.07
0 3 8 275.8 3.07 3 180 17.3 17.6 0 3.73
0 3 8 275.8 3.07 3 180 15.2 18 0 3.78
0 4 8 472 2.93 3 205 10.4 17.98 0 5.25
[ omitted 17 entries ]
mtcars |> select(carb, sort(tidyselect::peek_vars()))
data.frame [32 x 11]
carb am cyl disp drat gear hp mpg qsec vs wt
4 1 6 160 3.9 4 110 21 16.46 0 2.62
4 1 6 160 3.9 4 110 21 17.02 0 2.875
1 1 4 108 3.85 4 93 22.8 18.61 1 2.32
1 0 6 258 3.08 3 110 21.4 19.44 1 3.215
2 0 8 360 3.15 3 175 18.7 17.02 0 3.44
1 0 6 225 2.76 3 105 18.1 20.22 1 3.46
4 0 8 360 3.21 3 245 14.3 15.84 0 3.57
2 0 4 146.7 3.69 4 62 24.4 20 1 3.19
2 0 4 140.8 3.92 4 95 22.8 22.9 1 3.15
4 0 6 167.6 3.92 4 123 19.2 18.3 1 3.44
4 0 6 167.6 3.92 4 123 17.8 18.9 1 3.44
3 0 8 275.8 3.07 3 180 16.4 17.4 0 4.07
3 0 8 275.8 3.07 3 180 17.3 17.6 0 3.73
3 0 8 275.8 3.07 3 180 15.2 18 0 3.78
4 0 8 472 2.93 3 205 10.4 17.98 0 5.25
[ omitted 17 entries ]
data.table [32 x 11]
am carb cyl disp drat gear hp mpg qsec vs wt
1 4 6 160 3.9 4 110 21 16.46 0 2.62
1 4 6 160 3.9 4 110 21 17.02 0 2.875
1 1 4 108 3.85 4 93 22.8 18.61 1 2.32
0 1 6 258 3.08 3 110 21.4 19.44 1 3.215
0 2 8 360 3.15 3 175 18.7 17.02 0 3.44
0 1 6 225 2.76 3 105 18.1 20.22 1 3.46
0 4 8 360 3.21 3 245 14.3 15.84 0 3.57
0 2 4 146.7 3.69 4 62 24.4 20 1 3.19
0 2 4 140.8 3.92 4 95 22.8 22.9 1 3.15
0 4 6 167.6 3.92 4 123 19.2 18.3 1 3.44
0 4 6 167.6 3.92 4 123 17.8 18.9 1 3.44
0 3 8 275.8 3.07 3 180 16.4 17.4 0 4.07
0 3 8 275.8 3.07 3 180 17.3 17.6 0 3.73
0 3 8 275.8 3.07 3 180 15.2 18 0 3.78
0 4 8 472 2.93 3 205 10.4 17.98 0 5.25
[ omitted 17 entries ]
setcolorder(copy(MT), c("carb", sort(setdiff(colnames(MT), "carb"))))[]
data.table [32 x 11]
carb am cyl disp drat gear hp mpg qsec vs wt
4 1 6 160 3.9 4 110 21 16.46 0 2.62
4 1 6 160 3.9 4 110 21 17.02 0 2.875
1 1 4 108 3.85 4 93 22.8 18.61 1 2.32
1 0 6 258 3.08 3 110 21.4 19.44 1 3.215
2 0 8 360 3.15 3 175 18.7 17.02 0 3.44
1 0 6 225 2.76 3 105 18.1 20.22 1 3.46
4 0 8 360 3.21 3 245 14.3 15.84 0 3.57
2 0 4 146.7 3.69 4 62 24.4 20 1 3.19
2 0 4 140.8 3.92 4 95 22.8 22.9 1 3.15
4 0 6 167.6 3.92 4 123 19.2 18.3 1 3.44
4 0 6 167.6 3.92 4 123 17.8 18.9 1 3.44
3 0 8 275.8 3.07 3 180 16.4 17.4 0 4.07
3 0 8 275.8 3.07 3 180 17.3 17.6 0 3.73
3 0 8 275.8 3.07 3 180 15.2 18 0 3.78
4 0 8 472 2.93 3 205 10.4 17.98 0 5.25
[ omitted 17 entries ]

1.9 Summarize/Reframe

With data.table, one needs to use the = operator to summarize. It takes a function that returns a list of values smaller than the original column (or group) size. By default, it will only keep the modified columns (like a transmute).

1.9.1 Basic summary

mtcars |> summarize(mean_cyl = mean(cyl))
data.frame [1 x 1]
mean_cyl
6.188
MT[, .(mean_cyl = mean(cyl))]
data.table [1 x 1]
mean_cyl
6.188

1.9.2 Grouped summary

By default, dplyr::summarize will arrange the result by the grouping factor:

mtcars |> summarize(N = n(), .by = cyl)
data.frame [3 x 2]
cyl N
6 7
4 11
8 14

To order by the grouping factor, use group_by() instead of .by:

mtcars |> group_by(cyl) |> summarize(N = n())
data.frame [3 x 2]
cyl N
4 11
6 7
8 14

By default, data.table keeps the order the groups originally appear in:

MT[, .N, by = cyl]
data.table [3 x 2]
cyl N
6 7
4 11
8 14

To order by the grouping factor, use keyby instead of by:

MT[, .N, keyby = cyl]
data.table [3 x 2]
cyl N
4 11
6 7
8 14

Grouped on a temporary variable:

mtcars |> group_by(cyl > 6) |> summarize(N = n())
data.frame [2 x 2]
cyl > 6 N
FALSE 18
TRUE 14
MT[, .N, by = .(cyl > 6)]
data.table [2 x 2]
cyl N
FALSE 18
TRUE 14

1.9.3 Column-wise summary

1.9.3.1 Apply one function to multiple columns:

mtcars |> summarize(across(everything(), mean), .by = cyl)
data.frame [3 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
6 19.743 183.314 122.286 3.586 3.117 17.977 0.571 0.429 3.857 3.429
4 26.664 105.136 82.636 4.071 2.286 19.137 0.909 0.727 4.091 1.545
8 15.1 353.1 209.214 3.229 3.999 16.772 0 0.143 3.286 3.5

By column type:

mtcars |> summarize(across(where(is.double), mean), .by = cyl)
data.frame [3 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
6 19.743 183.314 122.286 3.586 3.117 17.977 0.571 0.429 3.857 3.429
4 26.664 105.136 82.636 4.071 2.286 19.137 0.909 0.727 4.091 1.545
8 15.1 353.1 209.214 3.229 3.999 16.772 0 0.143 3.286 3.5

By matching column names:

mtcars |> summarize(across(matches("^d"), mean), .by = cyl)
data.frame [3 x 3]
cyl disp drat
6 183.314 3.586
4 105.136 4.071
8 353.1 3.229
MT[, lapply(.SD, mean), by = cyl]
data.table [3 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
6 19.743 183.314 122.286 3.586 3.117 17.977 0.571 0.429 3.857 3.429
4 26.664 105.136 82.636 4.071 2.286 19.137 0.909 0.727 4.091 1.545
8 15.1 353.1 209.214 3.229 3.999 16.772 0 0.143 3.286 3.5

By column type:

MT[, lapply(.SD[, -"cyl"], mean), by = cyl, .SDcols = is.double]
data.table [3 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
6 19.743 183.314 122.286 3.586 3.117 17.977 0.571 0.429 3.857 3.429
4 26.664 105.136 82.636 4.071 2.286 19.137 0.909 0.727 4.091 1.545
8 15.1 353.1 209.214 3.229 3.999 16.772 0 0.143 3.286 3.5

By matching column names:

MT[, lapply(.SD, mean), by = cyl, .SDcols = patterns("^d")]
data.table [3 x 3]
cyl disp drat
6 183.314 3.586
4 105.136 4.071
8 353.1 3.229

1.9.3.2 Applying multiple functions to one column:

mtcars |> summarize(mean(mpg), sd(mpg), .by = cyl)
data.frame [3 x 3]
cyl mean(mpg) sd(mpg)
6 19.743 1.454
4 26.664 4.51
8 15.1 2.56



With column names:

mtcars |> summarize(mean = mean(mpg), sd = sd(mpg), .by = cyl)
data.frame [3 x 3]
cyl mean sd
6 19.743 1.454
4 26.664 4.51
8 15.1 2.56
mtcars |> summarize(across(mpg, list(mean = mean, sd = sd), .names = "{fn}"), .by = cyl)
data.frame [3 x 3]
cyl mean sd
6 19.743 1.454
4 26.664 4.51
8 15.1 2.56
MT[, .(mean(mpg), sd(mpg)), by = cyl]
data.table [3 x 3]
cyl V1 V2
6 19.743 1.454
4 26.664 4.51
8 15.1 2.56
MT[, lapply(.(mpg), \(x) list(mean(x), sd(x))) |> rbindlist(), by = cyl]
data.table [3 x 3]
cyl V1 V2
6 19.743 1.454
4 26.664 4.51
8 15.1 2.56

With column names:

MT[, .(mean = mean(mpg), sd = sd(mpg)), by = cyl]
data.table [3 x 3]
cyl mean sd
6 19.743 1.454
4 26.664 4.51
8 15.1 2.56
MT[, lapply(.SD, \(x) list(mean = mean(x), sd = sd(x))) |> rbindlist(), by = cyl, .SDcols = "mpg"]
data.table [3 x 3]
cyl mean sd
6 19.743 1.454
4 26.664 4.51
8 15.1 2.56

1.9.3.3 Apply multiple functions to multiple columns:

Note

Depending on the output we want (i.e. having the function’s output as columns or rows), we can either provide a list of functions to apply (list_of_fns), or a function returning a list (fn_returning_list).

cols <- c("mpg", "hp")

list_of_fns <- list(mean = \(x) mean(x), sd = \(x) sd(x))

fn_returning_list <- \(x) list(mean = mean(x), sd = sd(x))

One column per function, one row per variable:

reframe(mtcars, map_dfr(pick(all_of(cols)), fn_returning_list, .id = "Var"), .by = cyl)
data.frame [6 x 4]
cyl Var mean sd
6 mpg 19.743 1.454
6 hp 122.286 24.26
4 mpg 26.664 4.51
4 hp 82.636 20.935
8 mpg 15.1 2.56
8 hp 209.214 50.977
Alternatives
reframe(mtcars, map(pick(all_of(cols)), fn_returning_list) |> bind_rows(.id = "Var"), .by = cyl)

One column per variable, one row per function:

reframe(mtcars, map_dfr(list_of_fns, \(f) map(pick(all_of(cols)), f), .id = "Fn"), .by = cyl)
data.frame [6 x 4]
cyl Fn mpg hp
6 mean 19.743 122.286
6 sd 1.454 24.26
4 mean 26.664 82.636
4 sd 4.51 20.935
8 mean 15.1 209.214
8 sd 2.56 50.977

One column per function/variable combination:

summarize(mtcars, across(all_of(cols), list_of_fns, .names = "{col}.{fn}"), .by = cyl)
data.frame [3 x 5]
cyl mpg.mean mpg.sd hp.mean hp.sd
6 19.743 1.454 122.286 24.26
4 26.664 4.51 82.636 20.935
8 15.1 2.56 209.214 50.977

One column per function, one row per variable:

MT[, lapply(.SD, fn_returning_list) |> rbindlist(idcol = "Var"), by = cyl, .SDcols = cols]
data.table [6 x 4]
cyl Var mean sd
6 mpg 19.743 1.454
6 hp 122.286 24.26
4 mpg 26.664 4.51
4 hp 82.636 20.935
8 mpg 15.1 2.56
8 hp 209.214 50.977



One column per variable, one row per function:

MT[, lapply(list_of_fns, \(f) lapply(.SD, f)) |> rbindlist(idcol = "Fn"), by = cyl, .SDcols = cols]
data.table [6 x 4]
cyl Fn mpg hp
6 mean 19.743 122.286
6 sd 1.454 24.26
4 mean 26.664 82.636
4 sd 4.51 20.935
8 mean 15.1 209.214
8 sd 2.56 50.977

One column per function/variable combination:

MT[, lapply(.SD, fn_returning_list) |> unlist(recursive = FALSE), by = cyl, .SDcols = cols]
data.table [3 x 5]
cyl mpg.mean mpg.sd hp.mean hp.sd
6 19.743 1.454 122.286 24.26
4 26.664 4.51 82.636 20.935
8 15.1 2.56 209.214 50.977
data.table [3 x 5]
cyl mpg.mean mpg.sd hp.mean hp.sd
6 19.743 1.454 122.286 24.26
4 26.664 4.51 82.636 20.935
8 15.1 2.56 209.214 50.977

Different column order & naming scheme:

MT[, 
  lapply(list_of_fns, \(f) lapply(.SD, f)) |> 
    unlist(recursive = FALSE),
  by = cyl, .SDcols = cols
]
data.table [3 x 5]
cyl mean.mpg mean.hp sd.mpg sd.hp
6 19.743 122.286 1.454 24.26
4 26.664 82.636 4.51 20.935
8 15.1 209.214 2.56 50.977

Using dcast (see next section for more on pivots):

dcast(MT, cyl ~ ., fun.agg = list_of_fns, value.var = cols) # list(mean, sd)
data.table [3 x 5]
cyl mpg_mean hp_mean mpg_sd hp_sd
4 26.664 82.636 4.51 20.935
6 19.743 122.286 1.454 24.26
8 15.1 209.214 2.56 50.977

2 Pivots


2.1 Melt / Longer

Data:

(fam1 <- as.data.frame(FAM1))
data.frame [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA
(fam2 <- as.data.frame(FAM2))
data.frame [5 x 8]
family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1 30 1998-11-26 2000-01-29 NA 1 2 NA
2 27 1996-06-22 NA NA 2 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1
4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1
5 29 2000-12-05 2005-02-28 NA 2 1 NA

2.1.1 Basic Melt/Longer

Tip

data.table::melt does partial argument matching and thus accepts shortened versions of its arguments. E.g.: variable.name <=> variable (or var), value.name <=> value (or val), measure.vars <=> measure, id.vars <=> id, pattern <=> pat, …

One group of columns –> single value column

pivot_longer(FAM1, cols = matches("dob_"), names_to = "variable")
data.frame [15 x 4]
family_id age_mother variable value
1 30 dob_child1 1998-11-26
1 30 dob_child2 2000-01-29
1 30 dob_child3 NA
2 27 dob_child1 1996-06-22
2 27 dob_child2 NA
2 27 dob_child3 NA
3 26 dob_child1 2002-07-11
3 26 dob_child2 2004-04-05
3 26 dob_child3 2007-09-02
4 32 dob_child1 2004-10-10
4 32 dob_child2 2009-08-27
4 32 dob_child3 2012-07-21
5 29 dob_child1 2000-12-05
5 29 dob_child2 2005-02-28
5 29 dob_child3 NA
melt(FAM1, measure.vars = c("dob_child1", "dob_child2", "dob_child3"))
data.table [15 x 4]
family_id age_mother variable value
1 30 dob_child1 1998-11-26
2 27 dob_child1 1996-06-22
3 26 dob_child1 2002-07-11
4 32 dob_child1 2004-10-10
5 29 dob_child1 2000-12-05
1 30 dob_child2 2000-01-29
2 27 dob_child2 NA
3 26 dob_child2 2004-04-05
4 32 dob_child2 2009-08-27
5 29 dob_child2 2005-02-28
1 30 dob_child3 NA
2 27 dob_child3 NA
3 26 dob_child3 2007-09-02
4 32 dob_child3 2012-07-21
5 29 dob_child3 NA
melt(FAM1, measure = patterns("^dob_"))
data.table [15 x 4]
family_id age_mother variable value
1 30 dob_child1 1998-11-26
2 27 dob_child1 1996-06-22
3 26 dob_child1 2002-07-11
4 32 dob_child1 2004-10-10
5 29 dob_child1 2000-12-05
1 30 dob_child2 2000-01-29
2 27 dob_child2 NA
3 26 dob_child2 2004-04-05
4 32 dob_child2 2009-08-27
5 29 dob_child2 2005-02-28
1 30 dob_child3 NA
2 27 dob_child3 NA
3 26 dob_child3 2007-09-02
4 32 dob_child3 2012-07-21
5 29 dob_child3 NA

One group of columns –> multiple value columns

# No direct equivalent
melt(FAM1, measure = patterns(child1 = "child1$", child2 = "child2$|child3$"))
data.table [10 x 5]
family_id age_mother variable child1 child2
1 30 1 1998-11-26 2000-01-29
2 27 1 1996-06-22 NA
3 26 1 2002-07-11 2004-04-05
4 32 1 2004-10-10 2009-08-27
5 29 1 2000-12-05 2005-02-28
1 30 2 NA NA
2 27 2 NA NA
3 26 2 NA 2007-09-02
4 32 2 NA 2012-07-21
5 29 2 NA NA

2.1.2 Merging multiple yes/no columns

Melting multiple presence/absence columns into a single variable:

Data:

(MOVIES_WIDE <- as.data.table(movies_wide))
data.table [3 x 4]
ID action adventure animation
1 1 0 0
2 1 1 0
3 1 1 1
pivot_longer(
    movies_wide, -ID, names_to = "Genre", 
    values_transform = \(x) ifelse(x == 0, NA, x), values_drop_na = TRUE
  ) |> select(-value)
data.frame [6 x 2]
ID Genre
1 action
2 action
2 adventure
3 action
3 adventure
3 animation
melt(MOVIES_WIDE, id.vars = "ID", var = "Genre")[value != 0][order(ID), -"value"]
data.table [6 x 2]
ID Genre
1 action
2 action
2 adventure
3 action
3 adventure
3 animation

2.1.3 Partial pivot

Multiple groups of columns –> Multiple value columns

Using .value:

Tip

Using the .value special identifier allows to do a “half” pivot: the values that would be listed as rows under .value are instead used as columns.

pivot_longer(fam2, matches("^dob|^gender"), names_to = c(".value", "child"), names_sep = "_child")
data.frame [15 x 5]
family_id age_mother child dob gender
1 30 1 1998-11-26 1
1 30 2 2000-01-29 2
1 30 3 NA NA
2 27 1 1996-06-22 2
2 27 2 NA NA
2 27 3 NA NA
3 26 1 2002-07-11 2
3 26 2 2004-04-05 2
3 26 3 2007-09-02 1
4 32 1 2004-10-10 1
4 32 2 2009-08-27 1
4 32 3 2012-07-21 1
5 29 1 2000-12-05 2
5 29 2 2005-02-28 1
5 29 3 NA NA

Using .value:

melt(FAM2, measure = patterns("^dob", "^gender"), val = c("dob", "gender"), var = "child")
data.table [15 x 5]
family_id age_mother child dob gender
1 30 1 1998-11-26 1
2 27 1 1996-06-22 2
3 26 1 2002-07-11 2
4 32 1 2004-10-10 1
5 29 1 2000-12-05 2
1 30 2 2000-01-29 2
2 27 2 NA NA
3 26 2 2004-04-05 2
4 32 2 2009-08-27 1
5 29 2 2005-02-28 1
1 30 3 NA NA
2 27 3 NA NA
3 26 3 2007-09-02 1
4 32 3 2012-07-21 1
5 29 3 NA NA

Manually:

colA <- str_subset(colnames(FAM2), "^dob")
colB <- str_subset(colnames(FAM2), "^gender")

melt(FAM2, measure = list(colA, colB), val = c("dob", "gender"), var = "child")
data.table [15 x 5]
family_id age_mother child dob gender
1 30 1 1998-11-26 1
2 27 1 1996-06-22 2
3 26 1 2002-07-11 2
4 32 1 2004-10-10 1
5 29 1 2000-12-05 2
1 30 2 2000-01-29 2
2 27 2 NA NA
3 26 2 2004-04-05 2
4 32 2 2009-08-27 1
5 29 2 2005-02-28 1
1 30 3 NA NA
2 27 3 NA NA
3 26 3 2007-09-02 1
4 32 3 2012-07-21 1
5 29 3 NA NA
Alternatives
melt(FAM2, measure = list(a, b), val = c("dob", "gender"), var = "child") |> 
  substitute2(env = list(a = I(str_subset(colnames(FAM2), "^dob")), b = I(str_subset(colnames(FAM2), "^gender")))) |> eval()

Using measure and value.name:

melt(FAM2, measure = measure(value.name, child = \(x) as.integer(x), sep = "_child"))
data.table [15 x 5]
family_id age_mother child dob gender
1 30 1 1998-11-26 1
2 27 1 1996-06-22 2
3 26 1 2002-07-11 2
4 32 1 2004-10-10 1
5 29 1 2000-12-05 2
1 30 2 2000-01-29 2
2 27 2 NA NA
3 26 2 2004-04-05 2
4 32 2 2009-08-27 1
5 29 2 2005-02-28 1
1 30 3 NA NA
2 27 3 NA NA
3 26 3 2007-09-02 1
4 32 3 2012-07-21 1
5 29 3 NA NA
Alternatives
melt(FAM2, measure = measurev(list(value.name = NULL, child = as.integer), pat = "(.*)_child(\\d)"))

2.2 Dcast / Wider

General idea:
- Pivot around the combination of id.vars (LHS of the formula)
- The measure.vars (RHS of the formula) are the ones whose values become column names
- The value.var are the ones the values are taken from to fill the new columns

Data:

(fam1l <- as.data.frame(FAM1L))
data.frame [15 x 4]
family_id age_mother variable value
1 30 dob_child1 1998-11-26
2 27 dob_child1 1996-06-22
3 26 dob_child1 2002-07-11
4 32 dob_child1 2004-10-10
5 29 dob_child1 2000-12-05
1 30 dob_child2 2000-01-29
2 27 dob_child2 NA
3 26 dob_child2 2004-04-05
4 32 dob_child2 2009-08-27
5 29 dob_child2 2005-02-28
1 30 dob_child3 NA
2 27 dob_child3 NA
3 26 dob_child3 2007-09-02
4 32 dob_child3 2012-07-21
5 29 dob_child3 NA
(fam2l <- as.data.frame(FAM2L))
data.frame [15 x 5]
family_id age_mother child dob gender
1 30 1 1998-11-26 1
2 27 1 1996-06-22 2
3 26 1 2002-07-11 2
4 32 1 2004-10-10 1
5 29 1 2000-12-05 2
1 30 2 2000-01-29 2
2 27 2 NA NA
3 26 2 2004-04-05 2
4 32 2 2009-08-27 1
5 29 2 2005-02-28 1
1 30 3 NA NA
2 27 3 NA NA
3 26 3 2007-09-02 1
4 32 3 2012-07-21 1
5 29 3 NA NA

2.2.1 Basic Dcast/Wider

pivot_wider(fam1l, id_cols = c("family_id", "age_mother"), names_from = "variable")
data.frame [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA
dcast(FAM1L, family_id + age_mother ~ variable)
data.table [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA

Using all the columns as IDs:

pivot_wider(fam1l, names_from = variable)
data.frame [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA
Note

By default, id_cols = everything()

FAM1L |> dcast(... ~ variable)
data.table [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA
Note

... <=> “every unused column”

Multiple value columns –> Multiple groups of columns:

pivot_wider(
  fam2l, id_cols = c("family_id", "age_mother"), values_from = c("dob", "gender"), 
  names_from = "child", names_sep = "_child"
)
data.frame [5 x 8]
family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1 30 1998-11-26 2000-01-29 NA 1 2 NA
2 27 1996-06-22 NA NA 2 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1
4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1
5 29 2000-12-05 2005-02-28 NA 2 1 NA
dcast(FAM2L, family_id + age_mother ~ child, value.var = c("dob", "gender"), sep = "_child")
data.table [5 x 8]
family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1 30 1998-11-26 2000-01-29 NA 1 2 NA
2 27 1996-06-22 NA NA 2 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1
4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1
5 29 2000-12-05 2005-02-28 NA 2 1 NA
dcast(FAM2L, ... ~ child, value.var = c("dob", "gender"), sep = "_child")
data.table [5 x 8]
family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1 30 1998-11-26 2000-01-29 NA 1 2 NA
2 27 1996-06-22 NA NA 2 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1
4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1
5 29 2000-12-05 2005-02-28 NA 2 1 NA

Dynamic names in the formula:

var_name <- "variable"

id_vars <- c("family_id", "age_mother")
pivot_wider(fam1l, id_cols = c(family_id, age_mother), names_from = {{ var_name }})
data.frame [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA


Multiple dynamic names:

pivot_wider(fam1l, id_cols = all_of(id_vars), names_from = variable)
data.frame [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA


dcast(FAM1L, family_id + age_mother ~ base::get(var_name))
data.table [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA
dcast(FAM1L, family_id + age_mother ~ x) |> substitute2(env = list(x = var_name)) |> eval()
data.table [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA

Multiple dynamic names:

dcast(FAM1L, str_c(str_c(id_vars, collapse = " + "), " ~ variable"))
data.table [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA
dcast(FAM1L, x + y ~ variable) |> substitute2(env = list(x = id_vars[1], y = id_vars[2])) |> eval()
data.table [5 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA

2.2.2 Renaming (prefix/suffix) the columns

pivot_wider(fam1l, names_from = variable, values_from = value, names_prefix = "Attr: ")
data.frame [5 x 5]
family_id age_mother Attr: dob_child1 Attr: dob_child2 Attr: dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA
pivot_wider(fam1l, names_from = variable, values_from = value, names_glue = "Attr: {variable}")
data.frame [5 x 5]
family_id age_mother Attr: dob_child1 Attr: dob_child2 Attr: dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA
dcast(FAM1L, family_id + age_mother ~ paste0("Attr: ", variable))
data.table [5 x 5]
family_id age_mother Attr: dob_child1 Attr: dob_child2 Attr: dob_child3
1 30 1998-11-26 2000-01-29 NA
2 27 1996-06-22 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28 NA

2.2.3 Unused combinations

Warning

The logic is inverted between dplyr (keep) and data.table (drop):

pivot_wider(fam1l, names_from = variable, values_from = value, id_expand = T, names_expand = F)
data.frame [25 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 26 NA NA NA
1 27 NA NA NA
1 29 NA NA NA
1 30 1998-11-26 2000-01-29 NA
1 32 NA NA NA
2 26 NA NA NA
2 27 1996-06-22 NA NA
2 29 NA NA NA
2 30 NA NA NA
2 32 NA NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
3 27 NA NA NA
3 29 NA NA NA
3 30 NA NA NA
3 32 NA NA NA
[ omitted 10 entries ]
dcast(FAM1L, family_id + age_mother ~ variable, drop = c(FALSE, TRUE)) # (drop_LHS, drop_RHS)
data.table [25 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
1 26 NA NA NA
1 27 NA NA NA
1 29 NA NA NA
1 30 1998-11-26 2000-01-29 NA
1 32 NA NA NA
2 26 NA NA NA
2 27 1996-06-22 NA NA
2 29 NA NA NA
2 30 NA NA NA
2 32 NA NA NA
3 26 2002-07-11 2004-04-05 2007-09-02
3 27 NA NA NA
3 29 NA NA NA
3 30 NA NA NA
3 32 NA NA NA
[ omitted 10 entries ]

2.2.4 Subsetting

fam1l |> filter(value >= lubridate::ymd(20030101)) |> 
  pivot_wider(id_cols = c("family_id", "age_mother"), names_from = "variable")
data.frame [3 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
4 32 2004-10-10 2009-08-27 2012-07-21
3 26 NA 2004-04-05 2007-09-02
5 29 NA 2005-02-28 NA
Warning

AFAIK, pivot_wider can’t do this on its own.

dcast(FAM1L, family_id + age_mother ~ variable, subset = .(value >= lubridate::ymd(20030101)))
data.table [3 x 5]
family_id age_mother dob_child1 dob_child2 dob_child3
3 26 NA 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 NA 2005-02-28 NA

2.2.5 Aggregating

In data.table, not specifying the column holding the measure vars (the names) will result in an empty column counting the number of columns that should have been created for all the measures (i.e. the length() of the result).

(pivot_wider(fam1l, id_cols = c(family_id, age_mother), names_from = variable, values_fn = length)
  |> mutate(length = apply(pick(matches("_child")), 1, \(x) sum(x))) 
  |> select(-matches("^dob_"))
)
data.frame [5 x 3]
family_id age_mother length
1 30 3
2 27 3
3 26 3
4 32 3
5 29 3
dcast(FAM1L, family_id + age_mother ~ .)
data.table [5 x 3]
family_id age_mother .
1 30 3
2 27 3
3 26 3
4 32 3
5 29 3

Customizing the default behavior (length()) using the fun.aggregate (<=> fun.agg or fun) argument:

Here, we count the number of child for each each combination of (family_id + age_mother) -> sum all non-NA value

(pivot_wider(
    fam1l, id_cols = c(family_id, age_mother), names_from = variable, values_fn = \(x) !is.na(x)
  ) 
  |> mutate(child_count = apply(pick(matches("_child")), 1, \(x) sum(x)))
  |> select(-matches("^dob_"))
)
data.frame [5 x 3]
family_id age_mother child_count
1 30 2
2 27 1
3 26 3
4 32 3
5 29 2
Alternatives
(pivot_wider(fam1l, id_cols = c(family_id, age_mother), names_from = variable, values_fn = \(x) !is.na(x))
  |> mutate(child_count = pmap_int(pick(matches("_child")), \(...) sum(...)))
  |> select(-matches("^dob_"))
)

(pivot_wider(fam1l, id_cols = c(family_id, age_mother), names_from = variable, values_fn = \(x) !is.na(x))
  |> rowwise()
  |> mutate(child_count = sum(c_across(matches("_child"))))
  |> ungroup()
  |> select(-matches("^dob_"))
)
(dcast(FAM1L, family_id + age_mother ~ ., fun = \(x) sum(!is.na(x))) |> setnames(".", "child_count"))
data.table [5 x 3]
family_id age_mother child_count
1 30 2
2 27 1
3 26 3
4 32 3
5 29 2

Applying multiple fun.agg:

Data:

(DTL <- data.table(
    id1 = sample(5, 20, TRUE), 
    id2 = sample(2, 20, TRUE), 
    group = sample(letters[1:2], 20, TRUE), 
    v1 = runif(20), 
    v2 = 1L
  )
)
data.table [20 x 5]
id1 id2 group v1 v2
2 1 a 0.002 1
3 2 b 0.432 1
3 2 a 0.434 1
5 2 a 0.621 1
5 2 a 0.66 1
5 1 b 0.868 1
5 1 b 0.113 1
3 1 b 0.638 1
2 2 b 0.055 1
3 2 b 0.277 1
3 2 a 0.539 1
1 2 a 0.629 1
5 2 b 0.567 1
2 1 a 0.146 1
5 1 a 0.855 1
[ omitted 5 entries ]
  • Multiple aggregation functions applied to one variable:
(pivot_wider(
    DTL, id_cols = c("id1", "id2"), names_from = "group", values_from = "v1",
    names_glue = "{.value}_{.name}", names_vary = "slowest", names_sort = TRUE,
    values_fn = \(x) tibble("sum" = sum(x), "mean" = mean(x))
  ) 
  |> unnest(cols = starts_with("v1"), names_sep = "_")
)
data.frame [9 x 6]
id1 id2 v1_a_sum v1_a_mean v1_b_sum v1_b_mean
2 1 0.148 0.074 NA NA
3 2 0.973 0.486 0.709 0.354
5 2 1.28 0.64 1.159 0.58
5 1 1.536 0.512 0.981 0.491
3 1 NA NA 0.638 0.638
2 2 NA NA 0.055 0.055
1 2 0.629 0.629 NA NA
4 1 0.793 0.793 NA NA
4 2 NA NA 0.361 0.361

  • Multiple aggregation functions applied to multiple variables (all combinations):
(DTL |> pivot_wider(
    id_cols = c("id1", "id2"), names_from = "group", names_vary = "slowest", names_sort = TRUE,
    values_from = c("v1", "v2"), values_fn = \(x) tibble("sum" = sum(x), "mean" = mean(x))
  ) 
  |> unnest(cols = matches("^v1|^v2"), names_sep = "_")
)
data.frame [9 x 10]
id1 id2 v1_a_sum v1_a_mean v2_a_sum v2_a_mean v1_b_sum v1_b_mean v2_b_sum v2_b_mean
2 1 0.148 0.074 2 1 NA NA NA NA
3 2 0.973 0.486 2 1 0.709 0.354 2 1
5 2 1.28 0.64 2 1 1.159 0.58 2 1
5 1 1.536 0.512 3 1 0.981 0.491 2 1
3 1 NA NA NA NA 0.638 0.638 1 1
2 2 NA NA NA NA 0.055 0.055 1 1
1 2 0.629 0.629 1 1 NA NA NA NA
4 1 0.793 0.793 1 1 NA NA NA NA
4 2 NA NA NA NA 0.361 0.361 1 1

  • Multiple aggregation functions applied to multiple variables (one-to-one):
# Not possible with pivot_wider AFAIK
  • Multiple aggregation functions applied to one variable:
dcast(DTL, id1 + id2 ~ group, fun = list(sum, mean), value.var = "v1")
data.table [9 x 6]
id1 id2 v1_sum_a v1_sum_b v1_mean_a v1_mean_b
1 2 0.629 0 0.629 NaN
2 1 0.148 0 0.074 NaN
2 2 0 0.055 NaN 0.055
3 1 0 0.638 NaN 0.638
3 2 0.973 0.709 0.486 0.354
4 1 0.793 0 0.793 NaN
4 2 0 0.361 NaN 0.361
5 1 1.536 0.981 0.512 0.491
5 2 1.28 1.159 0.64 0.58




  • Multiple aggregation functions applied to multiple variables (all combinations):
dcast(DTL, id1 + id2 ~ group, fun = list(sum, mean), value.var = c("v1", "v2"))
data.table [9 x 10]
id1 id2 v1_sum_a v1_sum_b v2_sum_a v2_sum_b v1_mean_a v1_mean_b v2_mean_a v2_mean_b
1 2 0.629 0 1 0 0.629 NaN 1 NaN
2 1 0.148 0 2 0 0.074 NaN 1 NaN
2 2 0 0.055 0 1 NaN 0.055 NaN 1
3 1 0 0.638 0 1 NaN 0.638 NaN 1
3 2 0.973 0.709 2 2 0.486 0.354 1 1
4 1 0.793 0 1 0 0.793 NaN 1 NaN
4 2 0 0.361 0 1 NaN 0.361 NaN 1
5 1 1.536 0.981 3 2 0.512 0.491 1 1
5 2 1.28 1.159 2 2 0.64 0.58 1 1




  • Multiple aggregation functions applied to multiple variables (one-to-one):

Here, we apply sum to v1 (for both group a & b), and mean to v2 (for both group a & b)

dcast(DTL, id1 + id2 ~ group, fun = list(sum, mean), value.var = list("v1", "v2"))
data.table [9 x 6]
id1 id2 v1_sum_a v1_sum_b v2_mean_a v2_mean_b
1 2 0.629 0 1 NaN
2 1 0.148 0 1 NaN
2 2 0 0.055 NaN 1
3 1 0 0.638 NaN 1
3 2 0.973 0.709 1 1
4 1 0.793 0 1 NaN
4 2 0 0.361 NaN 1
5 1 1.536 0.981 1 1
5 2 1.28 1.159 1 1

2.2.6 One-hot encoding

Making each level of a variable into a presence/absence column:

movies_long
data.frame [6 x 3]
ID Genre OtherCol
1 action 0.768
2 action 0.145
2 adventure 0.749
3 action 0.975
3 adventure 0.381
3 animation 0.09
pivot_wider(
  movies_long, names_from = "Genre", values_from = "Genre", 
  values_fn = \(x) !is.na(x), values_fill = FALSE
)
data.frame [6 x 5]
ID OtherCol action adventure animation
1 0.768 TRUE FALSE FALSE
2 0.145 TRUE FALSE FALSE
2 0.749 FALSE TRUE FALSE
3 0.975 TRUE FALSE FALSE
3 0.381 FALSE TRUE FALSE
3 0.09 FALSE FALSE TRUE
dcast(MOVIES_LONG, ... ~ Genre, value.var = "Genre", fun = \(x) !is.na(x), fill = FALSE)
data.table [6 x 5]
ID OtherCol action adventure animation
1 0.768 TRUE FALSE FALSE
2 0.145 TRUE FALSE FALSE
2 0.749 FALSE TRUE FALSE
3 0.09 FALSE FALSE TRUE
3 0.381 FALSE TRUE FALSE
3 0.975 TRUE FALSE FALSE

3 Joins


3.1 Mutating Joins

The purpose of mutating joins is to add columns/information from one table to another, by matching their rows.

Data:

(CITIES <- as.data.table(cities))
data.table [10 x 3]
city_id city country_id
1 Barcelona 9
2 Bergen 8
3 Bern 10
4 Helsinki 4
5 Linz 1
6 Punaauia 6
7 Queenstown 7
8 Rouen 5
9 Sosua 3
10 Trondheim 8
(COUNTRIES <- as.data.table(countries))
data.table [9 x 2]
country_id country
1 Austria
2 Canada
3 Dominican Republic
4 Finland
5 France
6 French Polynesia
7 New-Zealand
8 Norway
9 Spain

3.1.1 Left/Right Join

Both left & right joins append the columns of one table to those of another, in the order they are given (i.e. columns of the first table will appear first in the result). However, how rows are matched (and how the ones not finding a match are handled) depends on the type of join:
- Left joins match on the rows of the first (left) table. Unmatched rows from the left table will be kept, but not the right’s.
- Right joins match on the rows of the second (right) table. Unmatched rows from the right table will be kept, but not the left’s.

Example

To find out which country each city belongs to, we’re going to merge countries into cities.

Here, we want to add data to the cities table by matching each city to a country (by their country_id). The ideal output would have the columns of cities first, and keep all rows from cities, even if unmatched: thus we will use a left join.

  • As a left join:
left_join(cities, countries, by = "country_id", multiple = "all")
data.frame [10 x 4]
city_id city country_id country
1 Barcelona 9 Spain
2 Bergen 8 Norway
3 Bern 10 NA
4 Helsinki 4 Finland
5 Linz 1 Austria
6 Punaauia 6 French Polynesia
7 Queenstown 7 New-Zealand
8 Rouen 5 France
9 Sosua 3 Dominican Republic
10 Trondheim 8 Norway
data.table natively only supports right joins

It filters the rows of the first table by those of the second (FIRST[SECOND]), but only keeps the unmatched rows from the second table.

The normal output of the join
CITIES[COUNTRIES, on = .(country_id)]
data.table [10 x 4]
city_id city country_id country
5 Linz 1 Austria
NA NA 2 Canada
9 Sosua 3 Dominican Republic
4 Helsinki 4 Finland
8 Rouen 5 France
6 Punaauia 6 French Polynesia
7 Queenstown 7 New-Zealand
2 Bergen 8 Norway
10 Trondheim 8 Norway
1 Barcelona 9 Spain

The unmatched rows from countries were kept, but not the ones from cities. Here are two possible workarounds:

Inverting the two tables (countries first), and then inverting the order of the columns in the result:

COUNTRIES[CITIES, .(city_id, city, country_id, country), on = .(country_id)]
data.table [10 x 4]
city_id city country_id country
1 Barcelona 9 Spain
2 Bergen 8 Norway
3 Bern 10 NA
4 Helsinki 4 Finland
5 Linz 1 Austria
6 Punaauia 6 French Polynesia
7 Queenstown 7 New-Zealand
8 Rouen 5 France
9 Sosua 3 Dominican Republic
10 Trondheim 8 Norway

Adding the columns of countries (in-place) to cities during the join:

copy(CITIES)[COUNTRIES, c("country_id", "country") := list(i.country_id, i.country), on = .(country_id)][]
data.table [10 x 4]
city_id city country_id country
1 Barcelona 9 Spain
2 Bergen 8 Norway
3 Bern 10 NA
4 Helsinki 4 Finland
5 Linz 1 Austria
6 Punaauia 6 French Polynesia
7 Queenstown 7 New-Zealand
8 Rouen 5 France
9 Sosua 3 Dominican Republic
10 Trondheim 8 Norway

We could accomplish a similar result with a right join by inverting the order of appearance of the columns. But the order of the columns in the result will be less ideal (countries first):

  • As a right join:
right_join(countries, cities, by = "country_id", multiple = "all")
data.frame [10 x 4]
country_id country city_id city
1 Austria 5 Linz
3 Dominican Republic 9 Sosua
4 Finland 4 Helsinki
5 France 8 Rouen
6 French Polynesia 6 Punaauia
7 New-Zealand 7 Queenstown
8 Norway 2 Bergen
8 Norway 10 Trondheim
9 Spain 1 Barcelona
10 NA 3 Bern
COUNTRIES[CITIES, on = .(country_id)][order(country_id)]
data.table [10 x 4]
country_id country city_id city
1 Austria 5 Linz
3 Dominican Republic 9 Sosua
4 Finland 4 Helsinki
5 France 8 Rouen
6 French Polynesia 6 Punaauia
7 New-Zealand 7 Queenstown
8 Norway 2 Bergen
8 Norway 10 Trondheim
9 Spain 1 Barcelona
10 NA 3 Bern

3.1.2 Full Join

Fully merges the two tables, keeping the unmatched rows from both tables.

full_join(cities, countries, by = join_by(country_id))
data.frame [11 x 4]
city_id city country_id country
1 Barcelona 9 Spain
2 Bergen 8 Norway
3 Bern 10 NA
4 Helsinki 4 Finland
5 Linz 1 Austria
6 Punaauia 6 French Polynesia
7 Queenstown 7 New-Zealand
8 Rouen 5 France
9 Sosua 3 Dominican Republic
10 Trondheim 8 Norway
NA NA 2 Canada
merge(CITIES, COUNTRIES, by = "country_id", all = TRUE)[order(city_id), .(city_id, city, country_id, country)]
data.table [11 x 4]
city_id city country_id country
1 Barcelona 9 Spain
2 Bergen 8 Norway
3 Bern 10 NA
4 Helsinki 4 Finland
5 Linz 1 Austria
6 Punaauia 6 French Polynesia
7 Queenstown 7 New-Zealand
8 Rouen 5 France
9 Sosua 3 Dominican Republic
10 Trondheim 8 Norway
NA NA 2 Canada

3.1.3 Cross Join

Generating all combinations of the IDs of both tables.

cross_join(select(cities, city), select(countries, country))
data.frame [90 x 2]
city country
Barcelona Austria
Barcelona Canada
Barcelona Dominican Republic
Barcelona Finland
Barcelona France
Barcelona French Polynesia
Barcelona New-Zealand
Barcelona Norway
Barcelona Spain
Bergen Austria
Bergen Canada
Bergen Dominican Republic
Bergen Finland
Bergen France
Bergen French Polynesia
[ omitted 75 entries ]
CJ(city = CITIES[, city], country = COUNTRIES[, country])
data.table [90 x 2]
city country
Barcelona Austria
Barcelona Canada
Barcelona Dominican Republic
Barcelona Finland
Barcelona France
Barcelona French Polynesia
Barcelona New-Zealand
Barcelona Norway
Barcelona Spain
Bergen Austria
Bergen Canada
Bergen Dominican Republic
Bergen Finland
Bergen France
Bergen French Polynesia
[ omitted 75 entries ]

3.1.4 Inner Join

Merges the columns of both tables and only returns the rows that matched between both tables (no unmatched rows are kept).

inner_join(countries, cities, by = "country_id", multiple = "all")
data.frame [9 x 4]
country_id country city_id city
1 Austria 5 Linz
3 Dominican Republic 9 Sosua
4 Finland 4 Helsinki
5 France 8 Rouen
6 French Polynesia 6 Punaauia
7 New-Zealand 7 Queenstown
8 Norway 2 Bergen
8 Norway 10 Trondheim
9 Spain 1 Barcelona
COUNTRIES[CITIES, on = .(country_id), nomatch = NULL]
data.table [9 x 4]
country_id country city_id city
9 Spain 1 Barcelona
8 Norway 2 Bergen
4 Finland 4 Helsinki
1 Austria 5 Linz
6 French Polynesia 6 Punaauia
7 New-Zealand 7 Queenstown
5 France 8 Rouen
3 Dominican Republic 9 Sosua
8 Norway 10 Trondheim

3.1.5 Self join

Merging the table with itself. Typically used on graph-type data represented as a flat table (e.g. hierarchies).

Data:

data.frame [5 x 4]
id first_name last_name manager_id
1 Maisy Bloom NA
2 Caine Farrow 1
3 Waqar Jarvis 2
4 Lacey-Mai Rahman 2
5 Merryn French 3

The goal here is to find the identity of everyone’s n+1 by merging the table on itself:

left_join(hiera, hiera, by = join_by(manager_id == id))
data.frame [5 x 7]
id first_name.x last_name.x manager_id first_name.y last_name.y manager_id.y
1 Maisy Bloom NA NA NA NA
2 Caine Farrow 1 Maisy Bloom NA
3 Waqar Jarvis 2 Caine Farrow 1
4 Lacey-Mai Rahman 2 Caine Farrow 1
5 Merryn French 3 Waqar Jarvis 2
HIERA[HIERA, on = .(manager_id = id), nomatch = NULL]
data.table [4 x 7]
id first_name last_name manager_id i.first_name i.last_name i.manager_id
2 Caine Farrow 1 Maisy Bloom NA
3 Waqar Jarvis 2 Caine Farrow 1
4 Lacey-Mai Rahman 2 Caine Farrow 1
5 Merryn French 3 Waqar Jarvis 2

3.2 Filtering Joins

Use to filter one table (left) based on another (right): it will only keep the columns from the left table and will either keep (semi join) or discard (anti join) the rows where IDs match between both tables.

3.2.1 Semi join

Note

Will give the same result as an inner join, but will only keep the columns of the first table (no information is added).

Here, it will filter countries to only keep the countries having a matching country_id in the cities table.

semi_join(countries, cities, by = join_by(country_id))
data.frame [8 x 2]
country_id country
1 Austria
3 Dominican Republic
4 Finland
5 France
6 French Polynesia
7 New-Zealand
8 Norway
9 Spain
COUNTRIES[country_id %in% CITIES[, unique(country_id)]]
data.table [8 x 2]
country_id country
1 Austria
3 Dominican Republic
4 Finland
5 France
6 French Polynesia
7 New-Zealand
8 Norway
9 Spain
Alternatives
fsetdiff(COUNTRIES, COUNTRIES[!CITIES, on = "country_id"])

COUNTRIES[!eval(COUNTRIES[!CITIES, on = .(country_id)])]

3.2.2 Anti join

Here, it will filter countries to only keep the countries having no matching country_id in the cities table.

anti_join(countries, cities, by = join_by(country_id))
data.frame [1 x 2]
country_id country
2 Canada
COUNTRIES[!CITIES, on = .(country_id)]
data.table [1 x 2]
country_id country
2 Canada
Alternatives
COUNTRIES[fsetdiff(COUNTRIES[, .(country_id)], CITIES[, .(country_id)])]

3.3 Non-equi joins

Non-equi joins are joins where the the condition to match rows are no longer strict equalities between the tables’ ID columns.

We can divide non-equi joins between:
- Unequality joins: a general unequality condition between IDs, that could result in multiple matches.
- Rolling joins: only keep the match that minimizes the distance between the IDs (i.e. the closest to perfect equality).
- Overlap joins: matching to all values within a range.

Tip

Please refer to this page of the second edition of R4DS for more detailed explanations.

Data:

Events:

data.table [3 x 4]
e.id event e.start e.end
1 Alice’s graduation 2023-06-05 10:00:00 2023-06-05 13:00:00
2 John’s birthday 2023-06-05 12:00:00 2023-06-05 22:00:00
3 Alice & Mark’s wedding 2023-06-07 13:00:00 2023-06-07 18:00:00

Strikes:

data.table [4 x 4]
s.id strike_motive s.start s.end
1 Not enough cheese 2023-06-05 11:00:00 2023-06-05 20:00:00
2 Not enough wine 2023-06-05 14:00:00 2023-06-05 16:00:00
3 Life’s too expensive 2023-06-08 09:00:00 2023-06-08 20:00:00
4 Our team lost some sport event 2023-07-05 16:00:00 2023-07-05 22:00:00

3.3.1 Unequality join

Inequality joins are joins (left, right, inner, …) that use inequalities (<, <=, >=, or >) to specify the matching criteria.

Warning

The condition has to be a simple inequality between existing columns: it cannot be an arbitrary function (e.g. date.x <= min(date.y) * 2 will not work).

  • For each event, which strikes occurred (finished) before the event ?
inner_join(events, strikes, join_by(e.start >= s.end))
data.frame [2 x 8]
e.id event e.start e.end s.id strike_motive s.start s.end
3 Alice & Mark’s wedding 2023-06-07 13:00:00 2023-06-07 18:00:00 1 Not enough cheese 2023-06-05 11:00:00 2023-06-05 20:00:00
3 Alice & Mark’s wedding 2023-06-07 13:00:00 2023-06-07 18:00:00 2 Not enough wine 2023-06-05 14:00:00 2023-06-05 16:00:00
EVENTS[STRIKES, on = .(e.start >= s.end), nomatch = NULL]
data.table [2 x 7]
e.id event e.start e.end s.id strike_motive s.start
3 Alice & Mark’s wedding 2023-06-05 20:00:00 2023-06-07 18:00:00 1 Not enough cheese 2023-06-05 11:00:00
3 Alice & Mark’s wedding 2023-06-05 16:00:00 2023-06-07 18:00:00 2 Not enough wine 2023-06-05 14:00:00
Caution

When specifying an equality or inequality condition, data.table will merge the two columns: only one will remain, with the values of the second column and the name of the first. Here, e.start will have the values of s.end (which will be removed).

I’m not sure if this is a bug or not.

A useful use-case for un-equality joins is to avoid duplicates when generating combinations of items in cross joins:

Data:

data.frame [3 x 2]
id name
1 Alice
2 Mark
3 John

All permutations: with duplicates (order matters)

cross_join(people, people)
data.frame [9 x 4]
id.x name.x id.y name.y
1 Alice 1 Alice
1 Alice 2 Mark
1 Alice 3 John
2 Mark 1 Alice
2 Mark 2 Mark
2 Mark 3 John
3 John 1 Alice
3 John 2 Mark
3 John 3 John

All combinations: without duplicates (order doesn’t matter)

inner_join(people, people, join_by(id < id))
data.frame [3 x 4]
id.x name.x id.y name.y
1 Alice 2 Mark
1 Alice 3 John
2 Mark 3 John

3.3.2 Rolling joins

Rolling joins are a special type of inequality join where instead of getting every row that satisfies the inequality, we get the one where the IDs are the closest to equality.

  • Which strike started the soonest after the beginning an event ?
inner_join(events, strikes, join_by(closest(e.start <= s.start)))
data.frame [3 x 8]
e.id event e.start e.end s.id strike_motive s.start s.end
1 Alice’s graduation 2023-06-05 10:00:00 2023-06-05 13:00:00 1 Not enough cheese 2023-06-05 11:00:00 2023-06-05 20:00:00
2 John’s birthday 2023-06-05 12:00:00 2023-06-05 22:00:00 2 Not enough wine 2023-06-05 14:00:00 2023-06-05 16:00:00
3 Alice & Mark’s wedding 2023-06-07 13:00:00 2023-06-07 18:00:00 3 Life’s too expensive 2023-06-08 09:00:00 2023-06-08 20:00:00

  • Which strike ended the soonest before the start an event ?
inner_join(events, strikes, join_by(closest(e.start >= s.end)))
data.frame [1 x 8]
e.id event e.start e.end s.id strike_motive s.start s.end
3 Alice & Mark’s wedding 2023-06-07 13:00:00 2023-06-07 18:00:00 1 Not enough cheese 2023-06-05 11:00:00 2023-06-05 20:00:00
  • Which strike started the soonest after the beginning an event ?
EVENTS[STRIKES, on = .(e.start == s.start), roll = "nearest"
     ][, .SD[which.min(abs(e.start - e.end))], by = "e.id"]
data.table [3 x 7]
e.id event e.start e.end s.id strike_motive s.end
1 Alice’s graduation 2023-06-05 11:00:00 2023-06-05 13:00:00 1 Not enough cheese 2023-06-05 20:00:00
2 John’s birthday 2023-06-05 14:00:00 2023-06-05 22:00:00 2 Not enough wine 2023-06-05 16:00:00
3 Alice & Mark’s wedding 2023-06-08 09:00:00 2023-06-07 18:00:00 3 Life’s too expensive 2023-06-08 20:00:00
Note

Using the roll argument relaxes the equality constraint of the join (e.start == s.end).


  • Which strike ended the soonest before the start an event ?
EVENTS[STRIKES, on = .(e.start == s.end), roll = -Inf
      ][, .SD[which.min(abs(e.start - e.end))], by = "e.id"]
data.table [1 x 7]
e.id event e.start e.end s.id strike_motive s.start
3 Alice & Mark’s wedding 2023-06-05 20:00:00 2023-06-07 18:00:00 1 Not enough cheese 2023-06-05 11:00:00

3.3.3 Overlap joins

dplyr provides three helper functions to make it easier to work with intervals:
- between(x, y_min, y_max) <=> x >= y_min, x <= y_max: a value of the first table is within a given range of the second
- within(x_min, x_max, y_min, y_max) <=> x_min >= y_min, x_max <= y_max: the ranges of the first table are contained within the second’s
- overlaps(x_min, x_max, y_min, y_max) <=> x_min <= y_max, x_max >= y_min: the two ranges overlap partially or totally, in any direction

  • Between: Which events had a strike staring in the two hours before the beginning of the event ?
Tip

First, we need to create the new “2 hours after the beginning of the event” column since we cannot use arbitrary functions in join_by() (e.g. we cannot do between(s.start, e.start, e.start + hours(2)))

events2 <- mutate(events, e.start_minus2 = e.start - hours(2))
inner_join(strikes, events2, join_by(between(s.start, e.start_minus2, e.start))) |> 
  select(colnames(events), colnames(strikes)) # Re-ordering the columns
data.frame [1 x 8]
e.id event e.start e.end s.id strike_motive s.start s.end
2 John’s birthday 2023-06-05 12:00:00 2023-06-05 22:00:00 1 Not enough cheese 2023-06-05 11:00:00 2023-06-05 20:00:00
Note

By default, the value to match needs to be from the first table, and the range it falls within needs to be from the second table. Depending on the column order we need, this can force us to reorder the columns post-join (as in the above example).

This can be alleviated by manually specifying from which table each column comes from, using x$col and y$col (x referring the to first column).

inner_join(events2, strikes, join_by(between(y$s.start, x$e.start_minus2, x$e.start))) |> 
  select(-e.start_minus2)
data.frame [1 x 8]
e.id event e.start e.end s.id strike_motive s.start s.end
2 John’s birthday 2023-06-05 12:00:00 2023-06-05 22:00:00 1 Not enough cheese 2023-06-05 11:00:00 2023-06-05 20:00:00

Manually:

inner_join(events2, strikes, join_by(e.start_minus2 <= s.start, e.start >= s.start)) |> 
  select(-e.start_minus2)
data.frame [1 x 8]
e.id event e.start e.end s.id strike_motive s.start s.end
2 John’s birthday 2023-06-05 12:00:00 2023-06-05 22:00:00 1 Not enough cheese 2023-06-05 11:00:00 2023-06-05 20:00:00

  • Within: Which strikes occurred entirely within the period of an event ?
inner_join(strikes, events, join_by(within(s.start, s.end, e.start, e.end)), multiple = "all") |> 
  select(colnames(events), colnames(strikes)) # Re-ordering the columns
data.frame [1 x 8]
e.id event e.start e.end s.id strike_motive s.start s.end
2 John’s birthday 2023-06-05 12:00:00 2023-06-05 22:00:00 2 Not enough wine 2023-06-05 14:00:00 2023-06-05 16:00:00
Note

As before, within() requires the first range to be within the second by default, meaning the first table must be the one with the smaller range. Using x$col and y$col resolves the issue of column order.

inner_join(events, strikes, join_by(within(y$s.start, y$s.end, x$e.start, x$e.end)), multiple = "all")
data.frame [1 x 8]
e.id event e.start e.end s.id strike_motive s.start s.end
2 John’s birthday 2023-06-05 12:00:00 2023-06-05 22:00:00 2 Not enough wine 2023-06-05 14:00:00 2023-06-05 16:00:00

Manually:

inner_join(events, strikes, join_by(e.start <= s.start, e.end >= s.end), multiple = "all")
data.frame [1 x 8]
e.id event e.start e.end s.id strike_motive s.start s.end
2 John’s birthday 2023-06-05 12:00:00 2023-06-05 22:00:00 2 Not enough wine 2023-06-05 14:00:00 2023-06-05 16:00:00

  • Overlaps: Which events overlap with each-other ?
inner_join(events, events, join_by(e.id < e.id, overlaps(e.start, e.end, e.start, e.end)))
data.frame [1 x 8]
e.id.x event.x e.start.x e.end.x e.id.y event.y e.start.y e.end.y
1 Alice’s graduation 2023-06-05 10:00:00 2023-06-05 13:00:00 2 John’s birthday 2023-06-05 12:00:00 2023-06-05 22:00:00

Manually:

inner_join(events, events, join_by(e.id < e.id, e.start <= e.end, e.end >= e.start))
data.frame [1 x 8]
e.id.x event.x e.start.x e.end.x e.id.y event.y e.start.y e.end.y
1 Alice’s graduation 2023-06-05 10:00:00 2023-06-05 13:00:00 2 John’s birthday 2023-06-05 12:00:00 2023-06-05 22:00:00


  • Between: Which events had a strike staring in the two hours before the beginning of the event ?
copy(EVENTS)[, e.start_minus2 := e.start - hours(2)
           ][STRIKES, on = .(e.start_minus2 <= s.start, e.start >= s.start), nomatch = NULL
           ][, -"e.start_minus2"]
data.table [1 x 7]
e.id event e.start e.end s.id strike_motive s.end
2 John’s birthday 2023-06-05 11:00:00 2023-06-05 22:00:00 1 Not enough cheese 2023-06-05 20:00:00











  • Within: Which strikes occurred entirely within the period of an event ?
EVENTS[STRIKES, on = .(e.start <= s.start, e.end >= s.end), nomatch = NULL]
data.table [1 x 6]
e.id event e.start e.end s.id strike_motive
2 John’s birthday 2023-06-05 14:00:00 2023-06-05 16:00:00 2 Not enough wine











  • Overlaps: Which events overlap with each-other ?
EVENTS[EVENTS, on = .(e.id < e.id, e.start <= e.end, e.end >= e.start), nomatch = NULL]
data.table [1 x 5]
e.id event e.start e.end i.event
2 Alice’s graduation 2023-06-05 22:00:00 2023-06-05 12:00:00 John’s birthday
setkey(EVENTS, e.start, e.end)

foverlaps(EVENTS, EVENTS, type = "any", mult = "first", nomatch = NULL)[e.id != i.e.id]
data.table [1 x 8]
e.id event e.start e.end i.e.id i.event i.e.start i.e.end
1 Alice’s graduation 2023-06-05 10:00:00 2023-06-05 13:00:00 2 John’s birthday 2023-06-05 12:00:00 2023-06-05 22:00:00

4 Tidyr & Others


4.1 Remove NA

tidyr::drop_na(IRIS, Species)
data.table [150 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 135 entries ]
tidyr::drop_na(IRIS, matches("Sepal"))
data.table [150 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 135 entries ]
na.omit(IRIS, cols = "Species")
data.table [150 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 135 entries ]
na.omit(IRIS, cols = str_subset(colnames(IRIS), "Sepal"))
data.table [150 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 135 entries ]

4.2 Unite

Combine multiple columns into a single one:

mtcars |> tidyr::unite("x", gear, carb, sep = "_")
data.frame [32 x 10]
mpg cyl disp hp drat wt qsec vs am x
21 6 160 110 3.9 2.62 16.46 0 1 4_4
21 6 160 110 3.9 2.875 17.02 0 1 4_4
22.8 4 108 93 3.85 2.32 18.61 1 1 4_1
21.4 6 258 110 3.08 3.215 19.44 1 0 3_1
18.7 8 360 175 3.15 3.44 17.02 0 0 3_2
18.1 6 225 105 2.76 3.46 20.22 1 0 3_1
14.3 8 360 245 3.21 3.57 15.84 0 0 3_4
24.4 4 146.7 62 3.69 3.19 20 1 0 4_2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4_2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4_4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4_4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3_3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3_3
15.2 8 275.8 180 3.07 3.78 18 0 0 3_3
10.4 8 472 205 2.93 5.25 17.98 0 0 3_4
[ omitted 17 entries ]
copy(MT)[, x := paste(gear, carb, sep = "_")][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb x
21 6 160 110 3.9 2.62 16.46 0 1 4 4 4_4
21 6 160 110 3.9 2.875 17.02 0 1 4 4 4_4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 4_1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3_1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 3_2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 3_1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 3_4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 4_2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 4_2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 4_4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 4_4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 3_3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 3_3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 3_3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 3_4
[ omitted 17 entries ]

4.3 Separate / Extract

4.3.1 Separate wider (extract)

(MT.ext <- MT[, .(x = str_c(gear, carb, sep = "_"))])
data.table [32 x 1]
x
4_4
4_4
4_1
3_1
3_2
3_1
3_4
4_2
4_2
4_4
4_4
3_3
3_3
3_3
3_4
[ omitted 17 entries ]

Based on a delimiter:

MT.ext |> separate_wider_delim(x, delim = "_", names = c("gear", "carb"))
data.frame [32 x 2]
gear carb
4 4
4 4
4 1
3 1
3 2
3 1
3 4
4 2
4 2
4 4
4 4
3 3
3 3
3 3
3 4
[ omitted 17 entries ]

Based on a regex:

MT.ext |> separate_wider_regex(x, patterns = c(gear = "\\d{1}", "_", carb = "\\d{1}"))
data.frame [32 x 2]
gear carb
4 4
4 4
4 1
3 1
3 2
3 1
3 4
4 2
4 2
4 4
4 4
3 3
3 3
3 3
3 4
[ omitted 17 entries ]

Based on position:

MT.ext |> separate_wider_position(x, widths  = c(gear = 1, delim = 1, carb = 1))
data.frame [32 x 3]
gear delim carb
4 _ 4
4 _ 4
4 _ 1
3 _ 1
3 _ 2
3 _ 1
3 _ 4
4 _ 2
4 _ 2
4 _ 4
4 _ 4
3 _ 3
3 _ 3
3 _ 3
3 _ 4
[ omitted 17 entries ]
Note

separate_wider_* supersedes both extract and separate.

Old syntax
tidyr::separate(MT.ext, x, into = c("gear", "carb"), sep = "_", remove = TRUE)

tidyr::extract(MT.ext, x, into = c("gear", "carb"), regex = "(.*)_(.*)", remove = TRUE)

Based on a delimiter:

copy(MT.ext)[, c("gear", "carb") := tstrsplit(x, "_", fixed = TRUE)][] 
data.table [32 x 3]
x gear carb
4_4 4 4
4_4 4 4
4_1 4 1
3_1 3 1
3_2 3 2
3_1 3 1
3_4 3 4
4_2 4 2
4_2 4 2
4_4 4 4
4_4 4 4
3_3 3 3
3_3 3 3
3_3 3 3
3_4 3 4
[ omitted 17 entries ]

Based on a regex:

copy(MT.ext)[, c("gear", "carb") := str_extract_all(x, "\\d") |> list_transpose()][]
data.table [32 x 3]
x gear carb
4_4 4 4
4_4 4 4
4_1 4 1
3_1 3 1
3_2 3 2
3_1 3 1
3_4 3 4
4_2 4 2
4_2 4 2
4_4 4 4
4_4 4 4
3_3 3 3
3_3 3 3
3_3 3 3
3_4 3 4
[ omitted 17 entries ]

4.3.2 Separate longer/rows

Separating a row into multiple rows, duplicating the rest of the values.

Data

(SP <- data.table(
  val = c(1,"2,3",4), 
  date = as.Date(c("2020-01-01", "2020-01-02", "2020-01-03"), origin = "1970-01-01")
  )
)
data.table [3 x 2]
val date
1 2020-01-01
2,3 2020-01-02
4 2020-01-03

Based on a delimiter:

SP |> separate_longer_delim(val, delim = ",")
data.frame [4 x 2]
val date
1 2020-01-01
2 2020-01-02
3 2020-01-02
4 2020-01-03

Based on position:

SP |> separate_longer_position(val, width = 1) |> filter(val != ",")
data.frame [4 x 2]
val date
1 2020-01-01
2 2020-01-02
3 2020-01-02
4 2020-01-03
Warning

separate_longer_* now supersedes separate_rows

Old syntax
SP |> separate_rows(val, sep = ",", convert = TRUE)

Solution 1:

copy(SP)[, c(V1 = strsplit(val, ",", fixed = TRUE), .SD), by = val][, let(val = V1, V1 = NULL)][]
data.table [4 x 2]
val date
1 2020-01-01
2 2020-01-02
3 2020-01-02
4 2020-01-03

Solution 2:

SP[, strsplit(val, ",", fixed = TRUE), by = val][SP, on = "val"][, let(val = V1, V1 = NULL)][]
data.table [4 x 2]
val date
1 2020-01-01
2 2020-01-02
3 2020-01-02
4 2020-01-03

Solution 3:

(With type conversion)

SP[, unlist(tstrsplit(val, ",", type.convert = TRUE)), by = val][SP, on = "val"][, let(val = V1, V1 = NULL)][]
data.table [4 x 2]
val date
1 2020-01-01
2 2020-01-02
3 2020-01-02
4 2020-01-03

Solution 4:

copy(SP)[rep(1:.N, lengths(strsplit(val, ",")))][, val := strsplit(val, ","), by = val][]
data.table [4 x 2]
val date
1 2020-01-01
2 2020-01-02
3 2020-01-02
4 2020-01-03

(With type conversion)

copy(SP)[rep(1:.N, lengths(strsplit(val, ",")))
       ][, val := strsplit(val, ","), by = val
       ][, val := utils::type.convert(val, as.is = T)][]
data.table [4 x 2]
val date
1 2020-01-01
2 2020-01-02
3 2020-01-02
4 2020-01-03

4.4 Duplicates

4.4.1 Duplicated rows

4.4.1.1 Only keeping duplicated rows

mtcars |> filter(n() > 1, .by = c(mpg, hp))
data.frame [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
MT[, if(.N > 1) .SD, by = .(mpg, hp)]
data.table [2 x 11]
mpg hp cyl disp drat wt qsec vs am gear carb
21 110 6 160 3.9 2.62 16.46 0 1 4 4
21 110 6 160 3.9 2.875 17.02 0 1 4 4

4.4.1.2 Removing duplicated rows

Note

This is different from distinct/unique, which will keep one of the duplicated rows of each group.

This removes all groups which have duplicated rows.

mtcars |> filter(n() == 1, .by = c(mpg, hp))
data.frame [30 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
[ omitted 15 entries ]
Alternatives
# More convoluted

mtcars |> filter(n() > 1, .by = c(mpg, hp)) |> anti_join(mtcars, y = _)
MT[, if(.N == 1) .SD, by = .(mpg, hp)]
data.table [30 x 11]
mpg hp cyl disp drat wt qsec vs am gear carb
22.8 93 4 108 3.85 2.32 18.61 1 1 4 1
21.4 110 6 258 3.08 3.215 19.44 1 0 3 1
18.7 175 8 360 3.15 3.44 17.02 0 0 3 2
18.1 105 6 225 2.76 3.46 20.22 1 0 3 1
14.3 245 8 360 3.21 3.57 15.84 0 0 3 4
24.4 62 4 146.7 3.69 3.19 20 1 0 4 2
22.8 95 4 140.8 3.92 3.15 22.9 1 0 4 2
19.2 123 6 167.6 3.92 3.44 18.3 1 0 4 4
17.8 123 6 167.6 3.92 3.44 18.9 1 0 4 4
16.4 180 8 275.8 3.07 4.07 17.4 0 0 3 3
17.3 180 8 275.8 3.07 3.73 17.6 0 0 3 3
15.2 180 8 275.8 3.07 3.78 18 0 0 3 3
10.4 205 8 472 2.93 5.25 17.98 0 0 3 4
10.4 215 8 460 3 5.424 17.82 0 0 3 4
14.7 230 8 440 3.23 5.345 17.42 0 0 3 4
[ omitted 15 entries ]
Alternatives
# More convoluted

MT[!MT[, if(.N > 1) .SD, by = .(mpg, hp)], on = colnames(MT)]

fsetdiff(MT, setcolorder(MT[, if(.N > 1) .SD, by = .(mpg, hp)], colnames(MT)))

4.4.2 Duplicated values (per row)

(DUPED <- data.table(
    A = c("A1", "A2", "B3", "A4"), 
    B = c("B1", "B2", "B3", "B4"), 
    C = c("A1", "C2", "D3", "C4"), 
    D = c("A1", "D2", "D3", "D4")
  )
)
data.table [4 x 4]
A B C D
A1 B1 A1 A1
A2 B2 C2 D2
B3 B3 D3 D3
A4 B4 C4 D4
mutate(DUPED, Repeats = apply(
    pick(everything()), 1, \(r) r[which(duplicated(r))] |> unique() |> str_c(collapse = ", ")
  )
)
data.table [4 x 5]
A B C D Repeats
A1 B1 A1 A1 A1
A2 B2 C2 D2
B3 B3 D3 D3 B3, D3
A4 B4 C4 D4
copy(DUPED)[
  , Repeats := apply(.SD, 1, \(r) r[which(duplicated(r))] |> unique() |> str_c(collapse = ", "))
  ][]
data.table [4 x 5]
A B C D Repeats
A1 B1 A1 A1 A1
A2 B2 C2 D2
B3 B3 D3 D3 B3, D3
A4 B4 C4 D4

With duplication counter:

dup_counts <- function(v) {
  rles <- as.data.table(unclass(rle(v[which(duplicated(v))])))[, lengths := lengths + 1]
  paste(apply(rles, 1, \(r) paste0(r[2], " (", r[1], ")")), collapse = ", ")
}
DUPED |> mutate(Repeats = apply(pick(everything()), 1, \(r) dup_counts(r)))
data.table [4 x 5]
A B C D Repeats
A1 B1 A1 A1 A1 (3)
A2 B2 C2 D2
B3 B3 D3 D3 B3 (2), D3 (2)
A4 B4 C4 D4
DUPED[, Repeats := apply(.SD, 1, \(r) dup_counts(r))][]
data.table [4 x 5]
A B C D Repeats
A1 B1 A1 A1 A1 (3)
A2 B2 C2 D2
B3 B3 D3 D3 B3 (2), D3 (2)
A4 B4 C4 D4

4.5 Expand & Complete

Here, we are missing an entry for person B on year 2010, that we want to fill:

(CAR <- data.table(
    year = c(2010,2011,2012,2013,2014,2015,2011,2012,2013,2014,2015), 
    person = c("A","A","A","A","A","A", "B","B","B","B","B"),
    car = c("BMW", "BMW", "AUDI", "AUDI", "AUDI", "Mercedes", "Citroen","Citroen", "Citroen", "Toyota", "Toyota")
  )
)
data.table [11 x 3]
year person car
2 010 A BMW
2 011 A BMW
2 012 A AUDI
2 013 A AUDI
2 014 A AUDI
2 015 A Mercedes
2 011 B Citroen
2 012 B Citroen
2 013 B Citroen
2 014 B Toyota
2 015 B Toyota

4.5.1 Expand

tidyr::expand(CAR, person, year)
data.frame [12 x 2]
person year
A 2 010
A 2 011
A 2 012
A 2 013
A 2 014
A 2 015
B 2 010
B 2 011
B 2 012
B 2 013
B 2 014
B 2 015
CJ(CAR$person, CAR$year, unique = TRUE)
data.table [12 x 2]
V1 V2
A 2 010
A 2 011
A 2 012
A 2 013
A 2 014
A 2 015
B 2 010
B 2 011
B 2 012
B 2 013
B 2 014
B 2 015

4.5.2 Complete

Joins the original dataset with the expanded one:

CAR |> tidyr::complete(person, year)
data.frame [12 x 3]
person year car
A 2 010 BMW
A 2 011 BMW
A 2 012 AUDI
A 2 013 AUDI
A 2 014 AUDI
A 2 015 Mercedes
B 2 010 NA
B 2 011 Citroen
B 2 012 Citroen
B 2 013 Citroen
B 2 014 Toyota
B 2 015 Toyota
CAR[CJ(person, year, unique = TRUE), on = .(person, year)]
data.table [12 x 3]
year person car
2 010 A BMW
2 011 A BMW
2 012 A AUDI
2 013 A AUDI
2 014 A AUDI
2 015 A Mercedes
2 010 B NA
2 011 B Citroen
2 012 B Citroen
2 013 B Citroen
2 014 B Toyota
2 015 B Toyota

4.6 Uncount

Duplicating aggregated rows to get back the un-aggregated version.

Data

cols <- c("Mild", "Moderate", "Severe")

dat_agg
data.frame [10 x 6]
ID Site Domain Mild Moderate Severe
1 23 A1 4 0 0
2 27 A1 0 1 1
3 28 A1 0 1 0
4 29 A1 0 0 1
5 31 A1 0 1 0
6 33 A1 0 1 1
7 41 A1 3 0 1
8 48 A1 0 2 4
9 64 A1 1 0 0
10 66 A1 1 0 0
dat_agg |> 
  pivot_longer(cols = all_of(cols), names_to = "Severity", values_to = "Count") |> 
  uncount(Count) |> 
  mutate(ID_new = row_number(), .after = "ID") |>
  pivot_wider(
    names_from = "Severity", values_from = "Severity", 
    values_fn = \(x) ifelse(is.na(x), 0, 1), values_fill = 0
  )
data.frame [23 x 7]
ID ID_new Site Domain Mild Moderate Severe
1 1 23 A1 1 0 0
1 2 23 A1 1 0 0
1 3 23 A1 1 0 0
1 4 23 A1 1 0 0
2 5 27 A1 0 1 0
2 6 27 A1 0 0 1
3 7 28 A1 0 1 0
4 8 29 A1 0 0 1
5 9 31 A1 0 1 0
6 10 33 A1 0 1 0
6 11 33 A1 0 0 1
7 12 41 A1 1 0 0
7 13 41 A1 1 0 0
7 14 41 A1 1 0 0
7 15 41 A1 0 0 1
[ omitted 8 entries ]

Solution 1:

(melt(DAT_AGG, measure.vars = cols, variable.name = "Severity", value.name = "Count")
  [rep(1:.N, Count)][, ID_new := .I] 
  |> dcast(... ~ Severity, value.var = "Severity", fun.agg = \(x) ifelse(is.na(x), 0, 1), fill = 0)
  |> _[, -"Count"]
)
data.table [23 x 7]
ID Site Domain ID_new Mild Moderate Severe
1 23 A1 1 1 0 0
1 23 A1 2 1 0 0
1 23 A1 3 1 0 0
1 23 A1 4 1 0 0
2 27 A1 10 0 1 0
2 27 A1 16 0 0 1
3 28 A1 11 0 1 0
4 29 A1 17 0 0 1
5 31 A1 12 0 1 0
6 33 A1 13 0 1 0
6 33 A1 18 0 0 1
7 41 A1 19 0 0 1
7 41 A1 5 1 0 0
7 41 A1 6 1 0 0
7 41 A1 7 1 0 0
[ omitted 8 entries ]

Solution 2:

DAT_AGG[Reduce(`c`, sapply(mget(cols), \(x) rep(1:.N, x)))
      ][, (cols) := lapply(.SD, \(x) ifelse(x > 1, 1, x)), .SDcols = cols
      ][order(ID)]
data.table [23 x 6]
ID Site Domain Mild Moderate Severe
1 23 A1 1 0 0
1 23 A1 1 0 0
1 23 A1 1 0 0
1 23 A1 1 0 0
2 27 A1 0 1 1
2 27 A1 0 1 1
3 28 A1 0 1 0
4 29 A1 0 0 1
5 31 A1 0 1 0
6 33 A1 0 1 1
6 33 A1 0 1 1
7 41 A1 1 0 1
7 41 A1 1 0 1
7 41 A1 1 0 1
7 41 A1 1 0 1
[ omitted 8 entries ]

4.7 List / Unlist

When a column contains a simple vector/list of values (of the same type, without structure)

4.7.1 One listed column

Single ID (grouping) column:

Data:

MT_LIST
data.table [3 x 2]
cyl mpg
4 <numeric [11]>
6 <numeric [7]>
8 <numeric [14]>
mt_list |> unnest(cols = mpg)
data.frame [32 x 2]
cyl mpg
6 21
6 21
6 21.4
6 18.1
6 19.2
6 17.8
6 19.7
4 22.8
4 24.4
4 22.8
4 32.4
4 30.4
4 33.9
4 21.5
4 27.3
[ omitted 17 entries ]
MT_LIST[, .(mpg = unlist(mpg)), keyby = cyl]
data.table [32 x 2]
cyl mpg
4 22.8
4 24.4
4 22.8
4 32.4
4 30.4
4 33.9
4 21.5
4 27.3
4 26
4 30.4
4 21.4
6 21
6 21
6 21.4
6 18.1
[ omitted 17 entries ]

Alternative that bypasses the need of grouping when unlisting by growing the data.table back to its original number of rows before unlisting:

MT_LIST[rep(MT_LIST[, .I], lengths(mpg))][, mpg := unlist(MT_LIST$mpg)][]
data.table [32 x 2]
cyl mpg
4 22.8
4 24.4
4 22.8
4 32.4
4 30.4
4 33.9
4 21.5
4 27.3
4 26
4 30.4
4 21.4
6 21
6 21
6 21.4
6 18.1
[ omitted 17 entries ]

Multiple ID (grouping) columns:

Data:

mt_list2
data.frame [8 x 3]
cyl gear mpg
6 4 <numeric [4]>
4 4 <numeric [8]>
6 3 <numeric [2]>
8 3 <numeric [12]>
4 3 <numeric [1]>
4 5 <numeric [2]>
8 5 <numeric [2]>
6 5 <numeric [1]>
mt_list2 |> unnest(cols = mpg) # group_by(cyl, gear) is optional
data.frame [32 x 3]
cyl gear mpg
6 4 21
6 4 21
6 4 19.2
6 4 17.8
4 4 22.8
4 4 24.4
4 4 22.8
4 4 32.4
4 4 30.4
4 4 33.9
4 4 27.3
4 4 21.4
6 3 21.4
6 3 18.1
8 3 18.7
[ omitted 17 entries ]

Solution 1:

MT_LIST2[, .(mpg = unlist(mpg)), by = setdiff(colnames(MT_LIST2), 'mpg')]
data.table [32 x 3]
cyl gear mpg
4 3 21.5
4 4 22.8
4 4 24.4
4 4 22.8
4 4 32.4
4 4 30.4
4 4 33.9
4 4 27.3
4 4 21.4
4 5 26
4 5 30.4
6 3 21.4
6 3 18.1
6 4 21
6 4 21
[ omitted 17 entries ]

Solution 2:

MT_LIST2[rep(MT_LIST2[, .I], lengths(mpg))][, mpg := unlist(MT_LIST2$mpg)][]
data.table [32 x 3]
cyl gear mpg
4 3 21.5
4 4 22.8
4 4 24.4
4 4 22.8
4 4 32.4
4 4 30.4
4 4 33.9
4 4 27.3
4 4 21.4
4 5 26
4 5 30.4
6 3 21.4
6 3 18.1
6 4 21
6 4 21
[ omitted 17 entries ]

4.7.2 Multiple listed column

Data:

mt_list_mult
data.frame [8 x 4]
cyl gear mpg disp
6 4 <numeric [4]> <numeric [4]>
4 4 <numeric [8]> <numeric [8]>
6 3 <numeric [2]> <numeric [2]>
8 3 <numeric [12]> <numeric [12]>
4 3 <numeric [1]> <numeric [1]>
4 5 <numeric [2]> <numeric [2]>
8 5 <numeric [2]> <numeric [2]>
6 5 <numeric [1]> <numeric [1]>
mt_list_mult |> unnest(cols = c(mpg, disp)) # group_by(cyl, gear) is optional
data.frame [32 x 4]
cyl gear mpg disp
6 4 21 160
6 4 21 160
6 4 19.2 167.6
6 4 17.8 167.6
4 4 22.8 108
4 4 24.4 146.7
4 4 22.8 140.8
4 4 32.4 78.7
4 4 30.4 75.7
4 4 33.9 71.1
4 4 27.3 79
4 4 21.4 121
6 3 21.4 258
6 3 18.1 225
8 3 18.7 360
[ omitted 17 entries ]
MT_LIST_MULT[, lapply(.SD, \(c) unlist(c)), by = setdiff(colnames(MT_LIST_MULT), c("mpg", "disp"))]
data.table [32 x 4]
cyl gear mpg disp
4 3 21.5 120.1
4 4 22.8 108
4 4 24.4 146.7
4 4 22.8 140.8
4 4 32.4 78.7
4 4 30.4 75.7
4 4 33.9 71.1
4 4 27.3 79
4 4 21.4 121
4 5 26 120.3
4 5 30.4 95.1
6 3 21.4 258
6 3 18.1 225
6 4 21 160
6 4 21 160
[ omitted 17 entries ]

4.8 Nest / Unnest

When a column contains a data.table/data.frame (with multiple columns, structured)

4.8.1 One nested column

Nesting

mtcars |> tidyr::nest(data = -cyl) # Data is inside tibbles
data.frame [3 x 2]
cyl data
6 <tbl_df [7 x 10]>
4 <tbl_df [11 x 10]>
8 <tbl_df [14 x 10]>
Alternatives
mtcars |> nest_by(cyl) |> ungroup() # Data is inside vctrs_list_of. Returns a rowwise() df

Nesting while keeping the grouping variable inside the nested tables:

mtcars |> tidyr::nest(data = everything(), .by = cyl)
data.frame [3 x 2]
cyl data
6 <tbl_df [7 x 11]>
4 <tbl_df [11 x 11]>
8 <tbl_df [14 x 11]>
MT[, .(data = .(.SD)), keyby = cyl]
data.table [3 x 2]
cyl data
4 <data.table [11 x 10]></data.table>
6 <data.table [7 x 10]></data.table>
8 <data.table [14 x 10]></data.table>

Nesting while keeping the grouping variable inside the nested tables:

MT[, .(data = list(data.table(cyl, .SD))), keyby = cyl]
data.table [3 x 2]
cyl data
4 <data.table [11 x 11]></data.table>
6 <data.table [7 x 11]></data.table>
8 <data.table [14 x 11]></data.table>

Unnesting

Data:

mtcars_nest <- mtcars |> tidyr::nest(data = -cyl)

MT_NEST <- MT[, .(data = .(.SD)), keyby = cyl]
mtcars_nest |> unnest(cols = data) |> ungroup()
data.frame [32 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
6 21 160 110 3.9 2.62 16.46 0 1 4 4
6 21 160 110 3.9 2.875 17.02 0 1 4 4
6 21.4 258 110 3.08 3.215 19.44 1 0 3 1
6 18.1 225 105 2.76 3.46 20.22 1 0 3 1
6 19.2 167.6 123 3.92 3.44 18.3 1 0 4 4
6 17.8 167.6 123 3.92 3.44 18.9 1 0 4 4
6 19.7 145 175 3.62 2.77 15.5 0 1 5 6
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1
4 24.4 146.7 62 3.69 3.19 20 1 0 4 2
4 22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
4 32.4 78.7 66 4.08 2.2 19.47 1 1 4 1
4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
4 33.9 71.1 65 4.22 1.835 19.9 1 1 4 1
4 21.5 120.1 97 3.7 2.465 20.01 1 0 3 1
4 27.3 79 66 4.08 1.935 18.9 1 1 4 1
[ omitted 17 entries ]
MT_NEST[, rbindlist(data), keyby = cyl] # MT_NEST[, do.call(c, data), keyby = cyl]
data.table [32 x 11]
cyl mpg disp hp drat wt qsec vs am gear carb
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1
4 24.4 146.7 62 3.69 3.19 20 1 0 4 2
4 22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
4 32.4 78.7 66 4.08 2.2 19.47 1 1 4 1
4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2
4 33.9 71.1 65 4.22 1.835 19.9 1 1 4 1
4 21.5 120.1 97 3.7 2.465 20.01 1 0 3 1
4 27.3 79 66 4.08 1.935 18.9 1 1 4 1
4 26 120.3 91 4.43 2.14 16.7 0 1 5 2
4 30.4 95.1 113 3.77 1.513 16.9 1 1 5 2
4 21.4 121 109 4.11 2.78 18.6 1 1 4 2
6 21 160 110 3.9 2.62 16.46 0 1 4 4
6 21 160 110 3.9 2.875 17.02 0 1 4 4
6 21.4 258 110 3.08 3.215 19.44 1 0 3 1
6 18.1 225 105 2.76 3.46 20.22 1 0 3 1
[ omitted 17 entries ]

4.8.2 Multiple nested column

Nesting:

(mtcars |> nest(data1 = c(mpg, hp), data2 = !c(cyl, gear, mpg, hp), .by = c(cyl, gear)) -> mt_nest_mult)
data.frame [8 x 4]
cyl gear data1 data2
6 4 <tbl_df [4 x 2]> <tbl_df [4 x 7]>
4 4 <tbl_df [8 x 2]> <tbl_df [8 x 7]>
6 3 <tbl_df [2 x 2]> <tbl_df [2 x 7]>
8 3 <tbl_df [12 x 2]> <tbl_df [12 x 7]>
4 3 <tbl_df [1 x 2]> <tbl_df [1 x 7]>
4 5 <tbl_df [2 x 2]> <tbl_df [2 x 7]>
8 5 <tbl_df [2 x 2]> <tbl_df [2 x 7]>
6 5 <tbl_df [1 x 2]> <tbl_df [1 x 7]>
(MT[, .(data1 = .(.SD[, .(mpg, hp)]), data2 = .(.SD[, !c("mpg", "hp")])), by = .(cyl, gear)] -> MT_NEST_MULT)
data.table [8 x 4]
cyl gear data1 data2
6 4 <data.table [4 x 2]></data.table> <data.table [4 x 7]></data.table>
4 4 <data.table [8 x 2]></data.table> <data.table [8 x 7]></data.table>
6 3 <data.table [2 x 2]></data.table> <data.table [2 x 7]></data.table>
8 3 <data.table [12 x 2]></data.table> <data.table [12 x 7]></data.table>
4 3 <data.table [1 x 2]></data.table> <data.table [1 x 7]></data.table>
4 5 <data.table [2 x 2]></data.table> <data.table [2 x 7]></data.table>
8 5 <data.table [2 x 2]></data.table> <data.table [2 x 7]></data.table>
6 5 <data.table [1 x 2]></data.table> <data.table [1 x 7]></data.table>

Unnesting:

mt_nest_mult |> unnest(cols = c(data1, data2))
data.frame [32 x 11]
cyl gear mpg hp disp drat wt qsec vs am carb
6 4 21 110 160 3.9 2.62 16.46 0 1 4
6 4 21 110 160 3.9 2.875 17.02 0 1 4
6 4 19.2 123 167.6 3.92 3.44 18.3 1 0 4
6 4 17.8 123 167.6 3.92 3.44 18.9 1 0 4
4 4 22.8 93 108 3.85 2.32 18.61 1 1 1
4 4 24.4 62 146.7 3.69 3.19 20 1 0 2
4 4 22.8 95 140.8 3.92 3.15 22.9 1 0 2
4 4 32.4 66 78.7 4.08 2.2 19.47 1 1 1
4 4 30.4 52 75.7 4.93 1.615 18.52 1 1 2
4 4 33.9 65 71.1 4.22 1.835 19.9 1 1 1
4 4 27.3 66 79 4.08 1.935 18.9 1 1 1
4 4 21.4 109 121 4.11 2.78 18.6 1 1 2
6 3 21.4 110 258 3.08 3.215 19.44 1 0 1
6 3 18.1 105 225 2.76 3.46 20.22 1 0 1
8 3 18.7 175 360 3.15 3.44 17.02 0 0 2
[ omitted 17 entries ]

Using a pattern to specify the columns to unnest:

mt_nest_mult |> unnest(cols = matches("data"))
data.frame [32 x 11]
cyl gear mpg hp disp drat wt qsec vs am carb
6 4 21 110 160 3.9 2.62 16.46 0 1 4
6 4 21 110 160 3.9 2.875 17.02 0 1 4
6 4 19.2 123 167.6 3.92 3.44 18.3 1 0 4
6 4 17.8 123 167.6 3.92 3.44 18.9 1 0 4
4 4 22.8 93 108 3.85 2.32 18.61 1 1 1
4 4 24.4 62 146.7 3.69 3.19 20 1 0 2
4 4 22.8 95 140.8 3.92 3.15 22.9 1 0 2
4 4 32.4 66 78.7 4.08 2.2 19.47 1 1 1
4 4 30.4 52 75.7 4.93 1.615 18.52 1 1 2
4 4 33.9 65 71.1 4.22 1.835 19.9 1 1 1
4 4 27.3 66 79 4.08 1.935 18.9 1 1 1
4 4 21.4 109 121 4.11 2.78 18.6 1 1 2
6 3 21.4 110 258 3.08 3.215 19.44 1 0 1
6 3 18.1 105 225 2.76 3.46 20.22 1 0 1
8 3 18.7 175 360 3.15 3.44 17.02 0 0 2
[ omitted 17 entries ]
MT_NEST_MULT[, c(rbindlist(data1), rbindlist(data2)), keyby = .(cyl, gear)]
data.table [32 x 11]
cyl gear mpg hp disp drat wt qsec vs am carb
4 3 21.5 97 120.1 3.7 2.465 20.01 1 0 1
4 4 22.8 93 108 3.85 2.32 18.61 1 1 1
4 4 24.4 62 146.7 3.69 3.19 20 1 0 2
4 4 22.8 95 140.8 3.92 3.15 22.9 1 0 2
4 4 32.4 66 78.7 4.08 2.2 19.47 1 1 1
4 4 30.4 52 75.7 4.93 1.615 18.52 1 1 2
4 4 33.9 65 71.1 4.22 1.835 19.9 1 1 1
4 4 27.3 66 79 4.08 1.935 18.9 1 1 1
4 4 21.4 109 121 4.11 2.78 18.6 1 1 2
4 5 26 91 120.3 4.43 2.14 16.7 0 1 2
4 5 30.4 113 95.1 3.77 1.513 16.9 1 1 2
6 3 21.4 110 258 3.08 3.215 19.44 1 0 1
6 3 18.1 105 225 2.76 3.46 20.22 1 0 1
6 4 21 110 160 3.9 2.62 16.46 0 1 4
6 4 21 110 160 3.9 2.875 17.02 0 1 4
[ omitted 17 entries ]

Using a pattern to specify the columns to unnest:

MT_NEST_MULT[, 
  do.call(c, unname(lapply(.SD, \(c) rbindlist(c)))), .SDcols = patterns('data'), 
  keyby = .(cyl, gear)
]
data.table [32 x 11]
cyl gear mpg hp disp drat wt qsec vs am carb
4 3 21.5 97 120.1 3.7 2.465 20.01 1 0 1
4 4 22.8 93 108 3.85 2.32 18.61 1 1 1
4 4 24.4 62 146.7 3.69 3.19 20 1 0 2
4 4 22.8 95 140.8 3.92 3.15 22.9 1 0 2
4 4 32.4 66 78.7 4.08 2.2 19.47 1 1 1
4 4 30.4 52 75.7 4.93 1.615 18.52 1 1 2
4 4 33.9 65 71.1 4.22 1.835 19.9 1 1 1
4 4 27.3 66 79 4.08 1.935 18.9 1 1 1
4 4 21.4 109 121 4.11 2.78 18.6 1 1 2
4 5 26 91 120.3 4.43 2.14 16.7 0 1 2
4 5 30.4 113 95.1 3.77 1.513 16.9 1 1 2
6 3 21.4 110 258 3.08 3.215 19.44 1 0 1
6 3 18.1 105 225 2.76 3.46 20.22 1 0 1
6 4 21 110 160 3.9 2.62 16.46 0 1 4
6 4 21 110 160 3.9 2.875 17.02 0 1 4
[ omitted 17 entries ]
data.table [32 x 11]
cyl gear mpg hp disp drat wt qsec vs am carb
4 3 21.5 97 120.1 3.7 2.465 20.01 1 0 1
4 4 22.8 93 108 3.85 2.32 18.61 1 1 1
4 4 24.4 62 146.7 3.69 3.19 20 1 0 2
4 4 22.8 95 140.8 3.92 3.15 22.9 1 0 2
4 4 32.4 66 78.7 4.08 2.2 19.47 1 1 1
4 4 30.4 52 75.7 4.93 1.615 18.52 1 1 2
4 4 33.9 65 71.1 4.22 1.835 19.9 1 1 1
4 4 27.3 66 79 4.08 1.935 18.9 1 1 1
4 4 21.4 109 121 4.11 2.78 18.6 1 1 2
4 5 26 91 120.3 4.43 2.14 16.7 0 1 2
4 5 30.4 113 95.1 3.77 1.513 16.9 1 1 2
6 3 21.4 110 258 3.08 3.215 19.44 1 0 1
6 3 18.1 105 225 2.76 3.46 20.22 1 0 1
6 4 21 110 160 3.9 2.62 16.46 0 1 4
6 4 21 110 160 3.9 2.875 17.02 0 1 4
[ omitted 17 entries ]

4.8.3 Operate on nested/list columns

Data:

mt_nest
data.frame [3 x 2]
cyl data
6 <tbl_df [7 x 10]>
4 <tbl_df [11 x 10]>
8 <tbl_df [14 x 10]>

Creating a new column using the nested data:

Keeping the nested column:

mt_nest |> mutate(sum = sum(unlist(data)), .by = cyl)
data.frame [3 x 3]
cyl data sum
6 <tbl_df [7 x 10]> 2 508.16
4 <tbl_df [11 x 10]> 2 719.233
8 <tbl_df [14 x 10]> 8 516.809

Dropping the nested column:

mt_nest |> summarize(sum = sum(unlist(data)), .by = cyl)
data.frame [3 x 2]
cyl sum
6 2 508.16
4 2 719.233
8 8 516.809

Keeping the nested column:

copy(MT_NEST)[, sum := sapply(data, \(r) sum(r)), by = cyl][]
data.table [3 x 3]
cyl data sum
6 <data.table [7 x 10]></data.table> 2 508.16
4 <data.table [11 x 10]></data.table> 2 719.233
8 <data.table [14 x 10]></data.table> 8 516.809

Dropping the nested column:

MT_NEST[, .(sum = sapply(data, \(r) sum(r))), by = cyl]
data.table [3 x 2]
cyl sum
6 2 508.16
4 2 719.233
8 8 516.809

Creating multiple new columns using the nested data:

linreg <- \(data) lm(mpg ~ hp, data = data) |> broom::tidy()
mt_nest |> group_by(cyl) |> group_modify(\(d, g) linreg(unnest(d, everything()))) |> ungroup()
data.frame [6 x 6]
cyl term estimate std.error statistic p.value
4 (Intercept) 35.983 5.201 6.918 0
4 hp −0.113 0.061 −1.843 0.098
6 (Intercept) 20.674 3.304 6.256 0.002
6 hp −0.008 0.027 −0.286 0.786
8 (Intercept) 18.08 2.988 6.052 0
8 hp −0.014 0.014 −1.025 0.326
MT_NEST[, rbindlist(lapply(data, \(ndt) linreg(ndt))), keyby = cyl][]
data.table [6 x 6]
cyl term estimate std.error statistic p.value
4 (Intercept) 35.983 5.201 6.918 0
4 hp −0.113 0.061 −1.843 0.098
6 (Intercept) 20.674 3.304 6.256 0.002
6 hp −0.008 0.027 −0.286 0.786
8 (Intercept) 18.08 2.988 6.052 0
8 hp −0.014 0.014 −1.025 0.326

Operating inside the nested data:

mt_nest |> 
  mutate(data = map(data, \(t) mutate(t, sum = pmap_dbl(pick(everything()), sum)))) |> 
  unnest(data)
data.frame [32 x 12]
cyl mpg disp hp drat wt qsec vs am gear carb sum
6 21 160 110 3.9 2.62 16.46 0 1 4 4 322.98
6 21 160 110 3.9 2.875 17.02 0 1 4 4 323.795
6 21.4 258 110 3.08 3.215 19.44 1 0 3 1 420.135
6 18.1 225 105 2.76 3.46 20.22 1 0 3 1 379.54
6 19.2 167.6 123 3.92 3.44 18.3 1 0 4 4 344.46
6 17.8 167.6 123 3.92 3.44 18.9 1 0 4 4 343.66
6 19.7 145 175 3.62 2.77 15.5 0 1 5 6 373.59
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1 255.58
4 24.4 146.7 62 3.69 3.19 20 1 0 4 2 266.98
4 22.8 140.8 95 3.92 3.15 22.9 1 0 4 2 295.57
4 32.4 78.7 66 4.08 2.2 19.47 1 1 4 1 209.85
4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2 191.165
4 33.9 71.1 65 4.22 1.835 19.9 1 1 4 1 202.955
4 21.5 120.1 97 3.7 2.465 20.01 1 0 3 1 269.775
4 27.3 79 66 4.08 1.935 18.9 1 1 4 1 204.215
[ omitted 17 entries ]
Alternatives
mt_nest |> 
  mutate(across(data, \(ts) map(ts, \(t) mutate(t, sum = apply(pick(everything()), 1, sum))))) |> 
  unnest(data)
Using the nplyr package
library(nplyr)

mt_nest |> 
  nplyr::nest_mutate(data, sum = apply(pick(everything()), 1, sum)) |> 
  unnest(data)
copy(MT_NEST)[, data := lapply(data, \(dt) dt[, sum := apply(.SD, 1, sum)])
            ][, rbindlist(data), keyby = cyl]
data.table [32 x 12]
cyl mpg disp hp drat wt qsec vs am gear carb sum
4 22.8 108 93 3.85 2.32 18.61 1 1 4 1 255.58
4 24.4 146.7 62 3.69 3.19 20 1 0 4 2 266.98
4 22.8 140.8 95 3.92 3.15 22.9 1 0 4 2 295.57
4 32.4 78.7 66 4.08 2.2 19.47 1 1 4 1 209.85
4 30.4 75.7 52 4.93 1.615 18.52 1 1 4 2 191.165
4 33.9 71.1 65 4.22 1.835 19.9 1 1 4 1 202.955
4 21.5 120.1 97 3.7 2.465 20.01 1 0 3 1 269.775
4 27.3 79 66 4.08 1.935 18.9 1 1 4 1 204.215
4 26 120.3 91 4.43 2.14 16.7 0 1 5 2 268.57
4 30.4 95.1 113 3.77 1.513 16.9 1 1 5 2 269.683
4 21.4 121 109 4.11 2.78 18.6 1 1 4 2 284.89
6 21 160 110 3.9 2.62 16.46 0 1 4 4 322.98
6 21 160 110 3.9 2.875 17.02 0 1 4 4 323.795
6 21.4 258 110 3.08 3.215 19.44 1 0 3 1 420.135
6 18.1 225 105 2.76 3.46 20.22 1 0 3 1 379.54
[ omitted 17 entries ]

4.9 Rotate / Transpose

(MT_SUMMARY <- MT[, tidy(summary(mpg)), by = cyl])
data.table [3 x 7]
cyl minimum q1 median mean q3 maximum
6 17.8 18.65 19.7 19.743 21 21.4
4 21.4 22.8 26 26.664 30.4 33.9
8 10.4 14.4 15.2 15.1 16.25 19.2

Using pivots:

MT_SUMMARY |> 
  pivot_longer(!cyl, names_to = "Statistic") |> 
  pivot_wider(id_cols = "Statistic", names_from = "cyl", names_prefix = "Cyl ")
data.frame [6 x 4]
Statistic Cyl 6 Cyl 4 Cyl 8
minimum 17.8 21.4 10.4
q1 18.65 22.8 14.4
median 19.7 26 15.2
mean 19.743 26.664 15.1
q3 21 30.4 16.25
maximum 21.4 33.9 19.2
MT_SUMMARY |> 
  melt(id.vars = "cyl", variable.name = "Statistic") |> 
  dcast(Statistic ~ paste0("Cyl ", cyl))
data.table [6 x 4]
Statistic Cyl 4 Cyl 6 Cyl 8
minimum 21.4 17.8 10.4
q1 22.8 18.65 14.4
median 26 19.7 15.2
mean 26.664 19.743 15.1
q3 30.4 21 16.25
maximum 33.9 21.4 19.2

With dedicated functions:

# No function exists to do this AFAIK
data.table::transpose(MT_SUMMARY, keep.names = "Statistic", make.names = 1)
data.table [6 x 4]
Statistic 6 4 8
minimum 17.8 21.4 10.4
q1 18.65 22.8 14.4
median 19.7 26 15.2
mean 19.743 26.664 15.1
q3 21 30.4 16.25
maximum 21.4 33.9 19.2

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       Ubuntu 22.04.3 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  C.UTF-8
 ctype    C.UTF-8
 tz       Europe/Paris
 date     2024-02-07
 pandoc   3.1.11
 Quarto   1.5.9

─ Packages ───────────────────────────────────────────────────────────────────
 ! package    * version date (UTC) lib source
 P broom      * 1.0.5   2023-06-09 [?] CRAN (R 4.3.0)
 P crayon     * 1.5.2   2022-09-29 [?] CRAN (R 4.3.0)
 P data.table * 1.15.0  2024-01-30 [?] CRAN (R 4.3.1)
 P dplyr      * 1.1.4   2023-11-17 [?] CRAN (R 4.3.1)
 P ggplot2    * 3.4.4   2023-10-12 [?] CRAN (R 4.3.1)
 P gt         * 0.10.0  2023-10-07 [?] CRAN (R 4.3.1)
 P here       * 1.0.1   2020-12-13 [?] CRAN (R 4.3.0)
 P knitr      * 1.44    2023-09-11 [?] CRAN (R 4.3.0)
 P lubridate  * 1.9.3   2023-09-27 [?] CRAN (R 4.3.1)
 P pipebind   * 0.1.2   2023-08-30 [?] CRAN (R 4.3.0)
 P purrr      * 1.0.2   2023-08-10 [?] CRAN (R 4.3.0)
 P stringr    * 1.5.0   2022-12-02 [?] CRAN (R 4.3.0)
 P tibble     * 3.2.1   2023-03-20 [?] CRAN (R 4.3.0)
 P tidyr      * 1.3.0   2023-01-24 [?] CRAN (R 4.3.0)

 [1] /home/mar/Dev/Projects/R/ma-riviere.com/renv/library/R-4.3/x86_64-pc-linux-gnu
 [2] /home/mar/.cache/R/renv/sandbox/R-4.3/x86_64-pc-linux-gnu/9a444a72

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────
Back to top

Citation

BibTeX citation:
@online{rivière2022,
  author = {Rivière, Marc-Aurèle},
  title = {Data Wrangling with Data.table and the {Tidyverse}},
  date = {2022-05-19},
  url = {https://ma-riviere.com/content/code/posts/data.table},
  langid = {en},
  abstract = {This post showcases various ways to accomplish most data
    wrangling operations, from basic filtering/mutating to pivots and
    non-equi joins, with both `data.table` and the Tidyverse (`dplyr`,
    `tidyr`, `purrr`, `stringr`).}
}
For attribution, please cite this work as:
Rivière, M.-A. (2022, May 19). Data wrangling with data.table and the Tidyverse. https://ma-riviere.com/content/code/posts/data.table