Summary

This post showcases various ways to accomplish most data wrangling operations, from basic filtering/mutating to pivots and non-equi joins, with both data.table and the Tidyverse (dplyr, tidyr, purrr, stringr).

🆕 Expand for Version History

v1: 2022-05-19

v2: 2022-05-26

Improved the section on keys (for ordering & filtering)
Adding a section for translations of Tidyr (and other similar packages)
Capping tables to display 15 rows max when unfolded
Improving table display (stripping, hiding the contents of nested columns, …)

v3: 2022-07-20

Updating data.table’s examples of dynamic programming using env
Added new entries in processing examples
Added new entries to Tidyr & Others: expand + complete, transpose/rotation, …
Added pivot_wider examples to match the dcast ones in the Pivots section
Added some new examples here and there across the Basic Operations section
Added an entry for operating inside nested data.frames/data.tables
Added a processing example for run-length encoding (i.e. successive event tagging)

v4: 2022-08-05

Improved pivot section: example of one-hot encoding (and reverse operation) + better examples of partial pivots with .value
Added tidyr::uncount() (row duplication) example.
Improved both light & dark themes (code highlight, tables, …)

v5: 2023-03-12

Revamped the whole document with grouped tabsets by framework for better readability
Revamped the whole Basic Operations section: better structure, reworked examples, …
Revamped the whole Joins section: better structure, new examples (e.g. join_by), better explanations, …
Updated code to reflect recent updates of the Tidyverse:
- dplyr (1.1.0): .by, reframe, join_by, consecutive_id, …
- purrr (1.0.0): list_rbind, list_cbind, …
- tidyr (1.3.0): updated the separate/separate_rows section to the newer separate_wider/longer_*
Updated code to reflect recent updates of data.table (1.14.9): let, DT(), …

Setup

library(here)        # Working directory management
library(pipebind)    # Piping goodies

library(data.table)  # Fast data manipulation (in-RAM)

library(tibble)      # Extending data.frames             (Tidyverse)
library(dplyr)       # Manipulating data.frames - core   (Tidyverse)
library(tidyr)       # Manipulating data.frames - extras (Tidyverse)
library(stringr)     # Manipulating strings              (Tidyverse)
library(purrr)       # Manipulating lists                (Tidyverse)
library(lubridate)   # Manipulating date/time            (Tidyverse)

library(broom)       # Tidying up models output          (Tidymodels)

data.table::setDTthreads(parallel::detectCores(logical = FALSE))

Quarto/knitr setup

Applying a custom theme to all gt tables

#-----------------------#
####🔺gt knit_prints ####
#-----------------------#

library(knitr)
library(gt)

knit_print.grouped_df <- function(x, options, ...) {
  if ("grouped_df" %in% class(x)) x <- ungroup(x)
  
  cl <- intersect(class(x), c("data.table", "data.frame"))[1]
  nrows <- ifelse(!is.null(options$total_rows), as.numeric(options$total_rows), dim(x)[1])
  is_open <- ifelse(!is.null(options[["details-open"]]), as.logical(options[["details-open"]]), FALSE)
  
  cat(str_glue("\n<details{ifelse(is_open, ' open', '')}>\n"))
  cat("<summary>\n")
  cat(str_glue("\n*{cl} [{scales::label_comma()(nrows)} x {dim(x)[2]}]*\n"))
  cat("</summary>\n<br>\n")
  print(gt::as_raw_html(style_table(x, nrows)))
  cat("</details>\n\n")
}

registerS3method("knit_print", "grouped_df", knit_print.grouped_df)

knit_print.data.frame <- function(x, options, ...) {
  cl <- intersect(class(x), c("data.table", "data.frame"))[1]
  nrows <- ifelse(!is.null(options$total_rows), as.numeric(options$total_rows), dim(x)[1])
  is_open <- ifelse(!is.null(options[["details-open"]]), as.logical(options[["details-open"]]), FALSE)
  
  cat(str_glue("\n<details{ifelse(is_open, ' open', '')}>\n"))
  cat("<summary>\n")
  cat(str_glue("\n*{cl} [{scales::label_comma()(nrows)} x {dim(x)[2]}]*\n"))
  cat("</summary>\n<br>\n")
  print(gt::as_raw_html(style_table(x, nrows)))
  cat("</details>\n\n")
}

registerS3method("knit_print", "data.frame", knit_print.data.frame)

1 Basic Operations

data.table general syntax:

DT[row selector (filter/sort), col selector (select/mutate/summarize/reframe/rename), modifiers (group/join by)]

Data

MT <- as.data.table(mtcars)
IRIS <- as.data.table(iris)[, Species := as.character(Species)]

1.1 Arrange / Order

mtcars |> arrange(desc(cyl))

data.frame [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
15	8	301	335	3.54	3.57	14.6	0	1	5	8
21	6	160	110	3.9	2.62	16.46	0	1	4	4
[ omitted 17 entries ]

mtcars |> arrange(desc(cyl), gear)

data.frame [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
15	8	301	335	3.54	3.57	14.6	0	1	5	8
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
[ omitted 17 entries ]

MT[order(-cyl)]

data.table [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
15	8	301	335	3.54	3.57	14.6	0	1	5	8
21	6	160	110	3.9	2.62	16.46	0	1	4	4
[ omitted 17 entries ]

MT[order(-cyl, gear)]

data.table [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
15	8	301	335	3.54	3.57	14.6	0	1	5	8
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
[ omitted 17 entries ]

Alternatives

MT[fsort(cyl, decreasing = TRUE)]

setorder(MT, -cyl, gear)[]

setorderv(MT, c("cyl", "gear"), c(-1 ,1))[]

Ordering on a character column

IRIS[chorder(Species)]

data.table [150 x 5]

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
4.6	3.4	1.4	0.3	setosa
5	3.4	1.5	0.2	setosa
4.4	2.9	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa
5.4	3.7	1.5	0.2	setosa
4.8	3.4	1.6	0.2	setosa
4.8	3	1.4	0.1	setosa
4.3	3	1.1	0.1	setosa
5.8	4	1.2	0.2	setosa
[ omitted 135 entries ]

1.1.2 Ordering with keys

Keys physically reorders the dataset within the RAM (by reference)
- No memory is used for sorting (other than marking which columns is the key)
The dataset is marked with an attribute “sorted”
The dataset is always sorted in ascending order, with NA first
Using keyby instead of by when grouping will set the grouping factors as keys

Tip

See this SO post for more information on keys.

setkey(MT, cyl, gear)

setkeyv(MT, c("cyl", "gear"))

MT

data.table [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
32.4	4	78.7	66	4.08	2.2	19.47	1	1	4	1
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
33.9	4	71.1	65	4.22	1.835	19.9	1	1	4	1
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
[ omitted 17 entries ]

To see over which keys (if any) the dataset is currently ordered:

haskey(MT)

[1] TRUE

key(MT)

[1] “cyl” “gear”

Warning

Unless our task involves repeated subsetting on the same column, the speed gain from key-based subsetting could effectively be nullified by the time needed to reorder the data in RAM, especially for large datasets.

1.1.3 Ordering with (secondary) indices

setindex creates an index for the provided columns, but doesn’t physically reorder the dataset in RAM.
It computes the ordering vector of the dataset’s rows according to the provided columns in an additional attribute called index

setindex(MT, cyl, gear)

setindexv(MT, c("cyl", "gear"))

MT

data.table [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

We can see the additional index attribute added to the data.table:

names(attributes(MT))

[1] "names"             "row.names"         "class"            
[4] ".internal.selfref" "index"

We can get the currently used indices with:

indices(MT)

[1] “cyl__gear”

Adding a new index doesn’t remove a previously existing one:

setindex(MT, hp)

indices(MT)

[1] “cyl__gear” “hp”

We can thus use indices to pre-compute the ordering for the columns (or combinations of columns) that we will be using to group or subset by frequently !

1.2 Subset / Filter

1.2.1 Basic filtering

Tidyverse
data.table

mtcars |> filter(cyl >= 6 & disp < 180)

data.frame [5 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6

iris |> filter(Species %in% c("setosa"))

data.frame [50 x 5]

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
4.6	3.4	1.4	0.3	setosa
5	3.4	1.5	0.2	setosa
4.4	2.9	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa
5.4	3.7	1.5	0.2	setosa
4.8	3.4	1.6	0.2	setosa
4.8	3	1.4	0.1	setosa
4.3	3	1.1	0.1	setosa
5.8	4	1.2	0.2	setosa
[ omitted 35 entries ]

MT[cyl >= 6 & disp < 180]

data.table [5 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6

IRIS[Species %chin% c("setosa")]

data.table [50 x 5]

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
4.6	3.4	1.4	0.3	setosa
5	3.4	1.5	0.2	setosa
4.4	2.9	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa
5.4	3.7	1.5	0.2	setosa
4.8	3.4	1.6	0.2	setosa
4.8	3	1.4	0.1	setosa
4.3	3	1.1	0.1	setosa
5.8	4	1.2	0.2	setosa
[ omitted 35 entries ]

For non-regex character filtering, use %chin% (which is a character-optimized version of %in%)

1.2.2 Filter based on a range

mtcars |> filter(between(disp, 200, 300))

data.frame [5 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	gear	carb
21.4	6	258	110	3.08	3.215	19.44	1	3	1
18.1	6	225	105	2.76	3.46	20.22	1	3	1
16.4	8	275.8	180	3.07	4.07	17.4	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	3	3

MT[disp %between% c(200, 300)]

data.table [5 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	gear	carb
21.4	6	258	110	3.08	3.215	19.44	1	3	1
18.1	6	225	105	2.76	3.46	20.22	1	3	1
16.4	8	275.8	180	3.07	4.07	17.4	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	3	3

1.2.3 Filter with a pattern

mtcars |> filter(str_detect(disp, "^\\d{3}\\."))

data.frame [9 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2

MT[disp %like% "^\\d{3}\\."]

data.table [9 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2

Variants

IRIS[Species %flike% "set"] # Fixed (not regex)

IRIS[Species %ilike% "Set"] # Ignore case

IRIS[Species %plike% "(?=set)"] # Perl-like regex

1.2.4 Filter on row number (slicing)

Tidyverse
data.table

mtcars |> slice(1) # slice_head(n = 1)

data.frame [1 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4

mtcars |> slice(n()) # slice_tail(n = 1)

data.frame [1 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2

Slice a random sample of rows:

mtcars |> slice_sample(n = 5)

data.frame [5 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
15	8	301	335	3.54	3.57	14.6	0	1	5	8
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2

MT[1]

data.table [1 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4

MT[.N]

data.table [1 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2

Slice a random sample of rows:

MT[sample(.N, 5)]

data.table [5 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
33.9	4	71.1	65	4.22	1.835	19.9	1	1	4	1
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4

1.2.5 Filter distinct/unique rows

Tidyverse
data.table

mtcars |> distinct(mpg, hp, .keep_all = TRUE)

data.frame [31 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
[ omitted 16 entries ]

Number of unique rows/values

n_distinct(mtcars$gear)

[1] 3

unique(MT, by = c("mpg", "hp")) # cols = other_cols_to_keep

data.table [31 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
[ omitted 16 entries ]

Number of unique rows/values

uniqueN(MT, by = "gear")

[1] 3

1.2.6 Filter by keys

When keys or indices are defined, we can filter based on them, which is often a lot faster.

Tip

We do not even need to specify the column name we are filtering on: the values will be attributed to the keys in order.

setkey(MT, cyl)

MT[.(6)] # Equivalent to MT[cyl == 6]

data.table [7 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6

setkey(MT, cyl, gear)

MT[.(6, 4)] # Equivalent to MT[cyl == 6 & gear == 4]

data.table [4 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4

1.2.7 Filter by indices

To filter by indices, we can use the on argument, which creates a temporary secondary index on the fly (if it doesn’t already exist).

IRIS["setosa", on = "Species"]

data.table [50 x 5]

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
4.6	3.4	1.4	0.3	setosa
5	3.4	1.5	0.2	setosa
4.4	2.9	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa
5.4	3.7	1.5	0.2	setosa
4.8	3.4	1.6	0.2	setosa
4.8	3	1.4	0.1	setosa
4.3	3	1.1	0.1	setosa
5.8	4	1.2	0.2	setosa
[ omitted 35 entries ]

Since the time to compute the secondary indices is quite small, we don’t have to use setindex, unless the task involves repeated subsetting on the same columns.

Tip

When using on with multiple values, the nomatch = NULL argument avoids creating combinations that do not exist in the original data (i.e. for cyl == 5 here)

MT[.(4:6, 4), on = c("cyl", "gear"), nomatch = NULL]

data.table [12 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
32.4	4	78.7	66	4.08	2.2	19.47	1	1	4	1
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
33.9	4	71.1	65	4.22	1.835	19.9	1	1	4	1
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4

1.2.8 Filtering on multiple columns

Filtering with one function taking multiple columns:

f_dat <- \(d) with(d, gear > cyl) # Function taking the data and comparing fix columns

f_dyn <- \(x, y) x > y # Function taking dynamic columns and comparing them

cols <- c("gear", "cyl")

Tidyverse
data.table

Manually:

mtcars |> filter(f_dyn(gear, cyl))

data.frame [2 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2

Dynamically:

Taking column names:

mtcars |> filter(f_dyn(!!!syms(cols)))

data.frame [2 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2

Taking the data:

mtcars |> filter(f_dat(cur_data()))

data.frame [2 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2

Manually:

MT[f_dyn(gear, cyl),]

data.table [2 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2

Dynamically:

Taking column names:

MT[do.call(f_dyn, args), env = list(args = as.list(cols))] # exec(f_dyn, !!!args)

data.table [2 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2

Taking the data:

MT[f_dat(MT),] # Can't use .SD in i

data.table [2 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2

In two steps:

We can’t use .SD in the i clause of a data.table

But we can bypass that constraint by doing the operation in two steps:
- Obtaining a vector stating if each row of the table matches or not the conditions
- Filtering the original table based on the vector

MT[MT[, f_dat(.SD)]]

data.table [2 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2

Combining multiple filtering functions:

This function filters rows that have 2 or more non-zero decimals, and we’re going to call it on multiple columns:

decp <- \(x) str_length(str_remove(as.character(abs(x)), ".*\\.")) >= 2

cols <- c("drat", "wt", "qsec")

Tidyverse
data.table

Manually:

mtcars |> filter(decp(drat) & decp(wt) & decp(qsec))

data.frame [13 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2

Dynamically:

mtcars |> filter(if_all(cols, decp))

data.frame [13 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2

Manually:

MT[decp(drat) & decp(wt) & decp(qsec), ]

data.table [13 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2

Dynamically:

MT[Reduce(`&`, lapply(mget(cols), decp)), ]

data.table [13 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2

Alternatives

MT[Reduce(`&`, lapply(MT[, ..cols], decp)), ]

MT[Reduce(`&`, lapply(v1, decp)), env = list(v1 = as.list(cols))]

In two steps:

MT[MT[, Reduce(`&`, lapply(.SD, decp)), .SDcols = cols]]

data.table [13 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2

1.3 Rename

Note

setnames changes column names in-place

Tidyverse
data.table

Manually:

mtcars |> rename(CYL = cyl, MPG = mpg)

data.frame [32 x 11]

MPG	CYL	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

Dynamically:

mtcars |> rename_with(\(c) toupper(c), .cols = matches("^d"))

data.frame [32 x 11]

mpg	cyl	DISP	hp	DRAT	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

Manually:

setnames(copy(MT), c("cyl", "mpg"), c("CYL", "MPG"))[]

data.table [32 x 11]

MPG	CYL	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

Dynamically:

setnames(copy(MT), grep("^d", colnames(MT)), toupper)[]

data.table [32 x 11]

mpg	cyl	DISP	hp	DRAT	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

1.4 Select

1.4.1 Basic selection

Tidyverse
data.table

MT |> select(matches("cyl|disp"))

data.table [32 x 2]

cyl	disp
6	160
6	160
4	108
6	258
8	360
6	225
8	360
4	146.7
4	140.8
6	167.6
6	167.6
8	275.8
8	275.8
8	275.8
8	472
[ omitted 17 entries ]

Remove a column:

mtcars |> select(!cyl) # select(-cyl)

data.frame [32 x 10]

mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	160	110	3.9	2.62	16.46	0	1	4	4
21	160	110	3.9	2.875	17.02	0	1	4	4
22.8	108	93	3.85	2.32	18.61	1	1	4	1
21.4	258	110	3.08	3.215	19.44	1	0	3	1
18.7	360	175	3.15	3.44	17.02	0	0	3	2
18.1	225	105	2.76	3.46	20.22	1	0	3	1
14.3	360	245	3.21	3.57	15.84	0	0	3	4
24.4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	275.8	180	3.07	3.78	18	0	0	3	3
10.4	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

MT[, .(mpg, disp)]

data.table [32 x 2]

mpg	disp
21	160
21	160
22.8	108
21.4	258
18.7	360
18.1	225
14.3	360
24.4	146.7
22.8	140.8
19.2	167.6
17.8	167.6
16.4	275.8
17.3	275.8
15.2	275.8
10.4	472
[ omitted 17 entries ]

Alternatives

MT[ , .SD, .SDcols = c("mpg", "disp")]

MT[, .SD, .SDcols = patterns("mpg|disp")]

Remove a column:

MT[, !"cyl"] # MT[, -"cyl"]

data.table [32 x 10]

mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	160	110	3.9	2.62	16.46	0	1	4	4
21	160	110	3.9	2.875	17.02	0	1	4	4
22.8	108	93	3.85	2.32	18.61	1	1	4	1
21.4	258	110	3.08	3.215	19.44	1	0	3	1
18.7	360	175	3.15	3.44	17.02	0	0	3	2
18.1	225	105	2.76	3.46	20.22	1	0	3	1
14.3	360	245	3.21	3.57	15.84	0	0	3	4
24.4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	275.8	180	3.07	3.78	18	0	0	3	3
10.4	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

In-place:

copy(MT)[, cyl := NULL][]

data.table [32 x 10]

mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	160	110	3.9	2.62	16.46	0	1	4	4
21	160	110	3.9	2.875	17.02	0	1	4	4
22.8	108	93	3.85	2.32	18.61	1	1	4	1
21.4	258	110	3.08	3.215	19.44	1	0	3	1
18.7	360	175	3.15	3.44	17.02	0	0	3	2
18.1	225	105	2.76	3.46	20.22	1	0	3	1
14.3	360	245	3.21	3.57	15.84	0	0	3	4
24.4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	275.8	180	3.07	3.78	18	0	0	3	3
10.4	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

Tidyverse
data.table

Select & Extract:

mtcars |> pull(disp)

 [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0 304.0 350.0
[25] 400.0  79.0 120.3  95.1 351.0 145.0 301.0 121.0

Select & Rename:

mtcars |> select(dispp = disp)

data.frame [32 x 1]

dispp
160
160
108
258
360
225
360
146.7
140.8
167.6
167.6
275.8
275.8
275.8
472
[ omitted 17 entries ]

Select & Extract:

MT[, disp]

 [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0 304.0 350.0
[25] 400.0  79.0 120.3  95.1 351.0 145.0 301.0 121.0

Select & Rename:

MT[, .(dispp = disp)]

data.table [32 x 1]

dispp
160
160
108
258
360
225
360
146.7
140.8
167.6
167.6
275.8
275.8
275.8
472
[ omitted 17 entries ]

1.4.2 Dynamic selection

1.4.2.1 By name

cols <- c("cyl", "disp")

Tidyverse
data.table

mtcars |> select(all_of(cols)) # select(!!cols)

data.frame [32 x 2]

cyl	disp
6	160
6	160
4	108
6	258
8	360
6	225
8	360
4	146.7
4	140.8
6	167.6
6	167.6
8	275.8
8	275.8
8	275.8
8	472
[ omitted 17 entries ]

Removing a column:

mtcars |> select(!{{cols}}) # select(-matches(cols))

data.frame [32 x 9]

mpg	hp	drat	wt	qsec	vs	am	gear	carb
21	110	3.9	2.62	16.46	0	1	4	4
21	110	3.9	2.875	17.02	0	1	4	4
22.8	93	3.85	2.32	18.61	1	1	4	1
21.4	110	3.08	3.215	19.44	1	0	3	1
18.7	175	3.15	3.44	17.02	0	0	3	2
18.1	105	2.76	3.46	20.22	1	0	3	1
14.3	245	3.21	3.57	15.84	0	0	3	4
24.4	62	3.69	3.19	20	1	0	4	2
22.8	95	3.92	3.15	22.9	1	0	4	2
19.2	123	3.92	3.44	18.3	1	0	4	4
17.8	123	3.92	3.44	18.9	1	0	4	4
16.4	180	3.07	4.07	17.4	0	0	3	3
17.3	180	3.07	3.73	17.6	0	0	3	3
15.2	180	3.07	3.78	18	0	0	3	3
10.4	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

MT[, ..cols]

data.table [32 x 2]

cyl	disp
6	160
6	160
4	108
6	258
8	360
6	225
8	360
4	146.7
4	140.8
6	167.6
6	167.6
8	275.8
8	275.8
8	275.8
8	472
[ omitted 17 entries ]

Alternatives

MT[, mget(cols)] # Retired

MT[, cols, with = FALSE] # Retired

MT[, .SD, .SDcols = cols]

MT[, j, env = list(j = as.list(cols))]

Removing a column:

MT[, !..cols]

data.table [32 x 9]

mpg	hp	drat	wt	qsec	vs	am	gear	carb
21	110	3.9	2.62	16.46	0	1	4	4
21	110	3.9	2.875	17.02	0	1	4	4
22.8	93	3.85	2.32	18.61	1	1	4	1
21.4	110	3.08	3.215	19.44	1	0	3	1
18.7	175	3.15	3.44	17.02	0	0	3	2
18.1	105	2.76	3.46	20.22	1	0	3	1
14.3	245	3.21	3.57	15.84	0	0	3	4
24.4	62	3.69	3.19	20	1	0	4	2
22.8	95	3.92	3.15	22.9	1	0	4	2
19.2	123	3.92	3.44	18.3	1	0	4	4
17.8	123	3.92	3.44	18.9	1	0	4	4
16.4	180	3.07	4.07	17.4	0	0	3	3
17.3	180	3.07	3.73	17.6	0	0	3	3
15.2	180	3.07	3.78	18	0	0	3	3
10.4	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

Alternatives

MT[, .SD, .SDcols = !cols]

MT[, -j, env = list(j = I(cols))]

In-place:

copy(MT)[, (cols) := NULL][]

data.table [32 x 9]

mpg	hp	drat	wt	qsec	vs	am	gear	carb
21	110	3.9	2.62	16.46	0	1	4	4
21	110	3.9	2.875	17.02	0	1	4	4
22.8	93	3.85	2.32	18.61	1	1	4	1
21.4	110	3.08	3.215	19.44	1	0	3	1
18.7	175	3.15	3.44	17.02	0	0	3	2
18.1	105	2.76	3.46	20.22	1	0	3	1
14.3	245	3.21	3.57	15.84	0	0	3	4
24.4	62	3.69	3.19	20	1	0	4	2
22.8	95	3.92	3.15	22.9	1	0	4	2
19.2	123	3.92	3.44	18.3	1	0	4	4
17.8	123	3.92	3.44	18.9	1	0	4	4
16.4	180	3.07	4.07	17.4	0	0	3	3
17.3	180	3.07	3.73	17.6	0	0	3	3
15.2	180	3.07	3.78	18	0	0	3	3
10.4	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

1.4.2.2 By pattern

Tidyverse
data.table

mtcars |> select(-matches("^d"))

data.frame [32 x 9]

mpg	cyl	hp	wt	qsec	vs	am	gear	carb
21	6	110	2.62	16.46	0	1	4	4
21	6	110	2.875	17.02	0	1	4	4
22.8	4	93	2.32	18.61	1	1	4	1
21.4	6	110	3.215	19.44	1	0	3	1
18.7	8	175	3.44	17.02	0	0	3	2
18.1	6	105	3.46	20.22	1	0	3	1
14.3	8	245	3.57	15.84	0	0	3	4
24.4	4	62	3.19	20	1	0	4	2
22.8	4	95	3.15	22.9	1	0	4	2
19.2	6	123	3.44	18.3	1	0	4	4
17.8	6	123	3.44	18.9	1	0	4	4
16.4	8	180	4.07	17.4	0	0	3	3
17.3	8	180	3.73	17.6	0	0	3	3
15.2	8	180	3.78	18	0	0	3	3
10.4	8	205	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

mtcars |> select(where(\(x) all(x != 0))) # Only keep columns where no value == 0

data.frame [32 x 9]

mpg	cyl	disp	hp	drat	wt	qsec	gear	carb
21	6	160	110	3.9	2.62	16.46	4	4
21	6	160	110	3.9	2.875	17.02	4	4
22.8	4	108	93	3.85	2.32	18.61	4	1
21.4	6	258	110	3.08	3.215	19.44	3	1
18.7	8	360	175	3.15	3.44	17.02	3	2
18.1	6	225	105	2.76	3.46	20.22	3	1
14.3	8	360	245	3.21	3.57	15.84	3	4
24.4	4	146.7	62	3.69	3.19	20	4	2
22.8	4	140.8	95	3.92	3.15	22.9	4	2
19.2	6	167.6	123	3.92	3.44	18.3	4	4
17.8	6	167.6	123	3.92	3.44	18.9	4	4
16.4	8	275.8	180	3.07	4.07	17.4	3	3
17.3	8	275.8	180	3.07	3.73	17.6	3	3
15.2	8	275.8	180	3.07	3.78	18	3	3
10.4	8	472	205	2.93	5.25	17.98	3	4
[ omitted 17 entries ]

MT[, .SD, .SDcols = !patterns("^d")]

data.table [32 x 9]

mpg	cyl	hp	wt	qsec	vs	am	gear	carb
21	6	110	2.62	16.46	0	1	4	4
21	6	110	2.875	17.02	0	1	4	4
22.8	4	93	2.32	18.61	1	1	4	1
21.4	6	110	3.215	19.44	1	0	3	1
18.7	8	175	3.44	17.02	0	0	3	2
18.1	6	105	3.46	20.22	1	0	3	1
14.3	8	245	3.57	15.84	0	0	3	4
24.4	4	62	3.19	20	1	0	4	2
22.8	4	95	3.15	22.9	1	0	4	2
19.2	6	123	3.44	18.3	1	0	4	4
17.8	6	123	3.44	18.9	1	0	4	4
16.4	8	180	4.07	17.4	0	0	3	3
17.3	8	180	3.73	17.6	0	0	3	3
15.2	8	180	3.78	18	0	0	3	3
10.4	8	205	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

MT[, .SD, .SDcols = \(x) all(x != 0)] # Only keep columns where no value == 0

data.table [32 x 9]

mpg	cyl	disp	hp	drat	wt	qsec	gear	carb
21	6	160	110	3.9	2.62	16.46	4	4
21	6	160	110	3.9	2.875	17.02	4	4
22.8	4	108	93	3.85	2.32	18.61	4	1
21.4	6	258	110	3.08	3.215	19.44	3	1
18.7	8	360	175	3.15	3.44	17.02	3	2
18.1	6	225	105	2.76	3.46	20.22	3	1
14.3	8	360	245	3.21	3.57	15.84	3	4
24.4	4	146.7	62	3.69	3.19	20	4	2
22.8	4	140.8	95	3.92	3.15	22.9	4	2
19.2	6	167.6	123	3.92	3.44	18.3	4	4
17.8	6	167.6	123	3.92	3.44	18.9	4	4
16.4	8	275.8	180	3.07	4.07	17.4	3	3
17.3	8	275.8	180	3.07	3.73	17.6	3	3
15.2	8	275.8	180	3.07	3.78	18	3	3
10.4	8	472	205	2.93	5.25	17.98	3	4
[ omitted 17 entries ]

Alternatives

copy(MT)[, grep("^d", colnames(MT)) := NULL][] # In place (column deletion)

MT[, MT[, sapply(.SD, \(x) all(x != 0))], with = FALSE]

1.4.2.3 By column type

iris |> select(where(\(x) !is.numeric(x)))

data.frame [150 x 1]

Species
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
[ omitted 135 entries ]

IRIS[, .SD, .SDcols = !is.numeric]

data.table [150 x 1]

Species
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
[ omitted 135 entries ]

1.5 Mutate / Transmute

data.table can mutate in 2 ways:
- Using = creates a new DT with the new columns only (like dplyr::transmute)
- Using := (or let) modifies the current dt in place (like dplyr::mutate)

The function modifying a column should be the same size as the original column (or group).
If only one value is provided with :=, it will be recycled to the whole column/group.

If the number of values provided is smaller than the original column/group:
- With := or let, an error will be raised, asking to manually specify how to recycle the values.
- With =, it will behave like dplyr::summarize (if a grouping has been specified).

1.5.1 Basic transmute

Only keeping the transformed columns.

Tidyverse
data.table

mtcars |> transmute(cyl = cyl * 2)

data.frame [32 x 1]

cyl
12
12
8
12
16
12
16
8
8
12
12
16
16
16
16
[ omitted 17 entries ]

MT[, .(cyl = cyl * 2)]

data.table [32 x 1]

cyl
12
12
8
12
16
12
16
8
8
12
12
16
16
16
16
[ omitted 17 entries ]

Transmute & Extract:

MT[, (cyl = cyl * 2)]

 [1] 12 12  8 12 16 12 16  8  8 12 12 16 16 16 16 16 16  8  8  8  8 16 16 16 16
[26]  8  8  8 16 12 16  8

1.5.2 Basic mutate

Modifies the transformed column in-place and keeps every other column as-is.

Tidyverse
data.table

mtcars |> mutate(cyl = 200)

data.frame [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	200	160	110	3.9	2.62	16.46	0	1	4	4
21	200	160	110	3.9	2.875	17.02	0	1	4	4
22.8	200	108	93	3.85	2.32	18.61	1	1	4	1
21.4	200	258	110	3.08	3.215	19.44	1	0	3	1
18.7	200	360	175	3.15	3.44	17.02	0	0	3	2
18.1	200	225	105	2.76	3.46	20.22	1	0	3	1
14.3	200	360	245	3.21	3.57	15.84	0	0	3	4
24.4	200	146.7	62	3.69	3.19	20	1	0	4	2
22.8	200	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	200	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	200	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	200	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	200	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	200	275.8	180	3.07	3.78	18	0	0	3	3
10.4	200	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

mtcars |> mutate(cyl = 200, gear = 5)

data.frame [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	200	160	110	3.9	2.62	16.46	0	1	5	4
21	200	160	110	3.9	2.875	17.02	0	1	5	4
22.8	200	108	93	3.85	2.32	18.61	1	1	5	1
21.4	200	258	110	3.08	3.215	19.44	1	0	5	1
18.7	200	360	175	3.15	3.44	17.02	0	0	5	2
18.1	200	225	105	2.76	3.46	20.22	1	0	5	1
14.3	200	360	245	3.21	3.57	15.84	0	0	5	4
24.4	200	146.7	62	3.69	3.19	20	1	0	5	2
22.8	200	140.8	95	3.92	3.15	22.9	1	0	5	2
19.2	200	167.6	123	3.92	3.44	18.3	1	0	5	4
17.8	200	167.6	123	3.92	3.44	18.9	1	0	5	4
16.4	200	275.8	180	3.07	4.07	17.4	0	0	5	3
17.3	200	275.8	180	3.07	3.73	17.6	0	0	5	3
15.2	200	275.8	180	3.07	3.78	18	0	0	5	3
10.4	200	472	205	2.93	5.25	17.98	0	0	5	4
[ omitted 17 entries ]

mtcars |> mutate(mean_cyl = mean(cyl, na.rm = TRUE))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mean_cyl
21	6	160	110	3.9	2.62	16.46	0	1	4	4	6.188
21	6	160	110	3.9	2.875	17.02	0	1	4	4	6.188
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	6.188
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	6.188
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	6.188
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	6.188
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	6.188
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	6.188
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	6.188
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	6.188
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	6.188
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	6.188
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	6.188
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	6.188
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	6.188
[ omitted 17 entries ]

mtcars |> mutate(gear_plus = lead(gear))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	gear_plus
21	6	160	110	3.9	2.62	16.46	0	1	4	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	3
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	3
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	3
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	3
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	4
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	4
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	3
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	3
[ omitted 17 entries ]

copy(MT)[, cyl := 200][]

data.table [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	200	160	110	3.9	2.62	16.46	0	1	4	4
21	200	160	110	3.9	2.875	17.02	0	1	4	4
22.8	200	108	93	3.85	2.32	18.61	1	1	4	1
21.4	200	258	110	3.08	3.215	19.44	1	0	3	1
18.7	200	360	175	3.15	3.44	17.02	0	0	3	2
18.1	200	225	105	2.76	3.46	20.22	1	0	3	1
14.3	200	360	245	3.21	3.57	15.84	0	0	3	4
24.4	200	146.7	62	3.69	3.19	20	1	0	4	2
22.8	200	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	200	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	200	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	200	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	200	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	200	275.8	180	3.07	3.78	18	0	0	3	3
10.4	200	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

copy(MT)[, let(cyl = 200, gear = 5)][]

data.table [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	200	160	110	3.9	2.62	16.46	0	1	5	4
21	200	160	110	3.9	2.875	17.02	0	1	5	4
22.8	200	108	93	3.85	2.32	18.61	1	1	5	1
21.4	200	258	110	3.08	3.215	19.44	1	0	5	1
18.7	200	360	175	3.15	3.44	17.02	0	0	5	2
18.1	200	225	105	2.76	3.46	20.22	1	0	5	1
14.3	200	360	245	3.21	3.57	15.84	0	0	5	4
24.4	200	146.7	62	3.69	3.19	20	1	0	5	2
22.8	200	140.8	95	3.92	3.15	22.9	1	0	5	2
19.2	200	167.6	123	3.92	3.44	18.3	1	0	5	4
17.8	200	167.6	123	3.92	3.44	18.9	1	0	5	4
16.4	200	275.8	180	3.07	4.07	17.4	0	0	5	3
17.3	200	275.8	180	3.07	3.73	17.6	0	0	5	3
15.2	200	275.8	180	3.07	3.78	18	0	0	5	3
10.4	200	472	205	2.93	5.25	17.98	0	0	5	4
[ omitted 17 entries ]

Alternatives

copy(MT)[, `:=`(cyl = 200, gear = 5)][]

copy(MT)[, c("cyl", "gear") := .(200, 5)][]

copy(MT)[, mean_cyl := mean(cyl, na.rm = TRUE)][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mean_cyl
21	6	160	110	3.9	2.62	16.46	0	1	4	4	6.188
21	6	160	110	3.9	2.875	17.02	0	1	4	4	6.188
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	6.188
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	6.188
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	6.188
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	6.188
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	6.188
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	6.188
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	6.188
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	6.188
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	6.188
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	6.188
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	6.188
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	6.188
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	6.188
[ omitted 17 entries ]

copy(MT)[, gearplus := shift(gear, 1, type = "lead")][] # lead, lag, cyclic

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	gearplus
21	6	160	110	3.9	2.62	16.46	0	1	4	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	3
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	3
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	3
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	3
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	4
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	4
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	3
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	3
[ omitted 17 entries ]

1.5.3 Dynamic trans/mutate

LHS <- "mean_mpg"
RHS <- "mpg"

Tidyverse
data.table

mtcars |> mutate({{LHS}} := mean(mpg))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mean_mpg
21	6	160	110	3.9	2.62	16.46	0	1	4	4	20.091
21	6	160	110	3.9	2.875	17.02	0	1	4	4	20.091
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	20.091
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	20.091
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	20.091
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	20.091
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	20.091
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	20.091
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	20.091
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	20.091
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	20.091
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	20.091
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	20.091
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	20.091
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	20.091
[ omitted 17 entries ]

mtcars |> mutate("{LHS}" := mean(.data[[RHS]]))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mean_mpg
21	6	160	110	3.9	2.62	16.46	0	1	4	4	20.091
21	6	160	110	3.9	2.875	17.02	0	1	4	4	20.091
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	20.091
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	20.091
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	20.091
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	20.091
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	20.091
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	20.091
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	20.091
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	20.091
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	20.091
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	20.091
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	20.091
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	20.091
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	20.091
[ omitted 17 entries ]

mtcars |> mutate({{LHS}} := cur_data()[[RHS]] |> mean())

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mean_mpg
21	6	160	110	3.9	2.62	16.46	0	1	4	4	20.091
21	6	160	110	3.9	2.875	17.02	0	1	4	4	20.091
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	20.091
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	20.091
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	20.091
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	20.091
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	20.091
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	20.091
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	20.091
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	20.091
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	20.091
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	20.091
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	20.091
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	20.091
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	20.091
[ omitted 17 entries ]

mtcars |> mutate({{LHS}} := pick({{ RHS }}) |> unlist() |> mean())

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mean_mpg
21	6	160	110	3.9	2.62	16.46	0	1	4	4	20.091
21	6	160	110	3.9	2.875	17.02	0	1	4	4	20.091
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	20.091
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	20.091
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	20.091
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	20.091
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	20.091
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	20.091
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	20.091
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	20.091
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	20.091
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	20.091
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	20.091
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	20.091
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	20.091
[ omitted 17 entries ]

copy(MT)[, (LHS) := mean(mpg)][] # (LHS) <=> c(LHS)

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mean_mpg
21	6	160	110	3.9	2.62	16.46	0	1	4	4	20.091
21	6	160	110	3.9	2.875	17.02	0	1	4	4	20.091
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	20.091
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	20.091
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	20.091
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	20.091
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	20.091
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	20.091
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	20.091
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	20.091
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	20.091
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	20.091
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	20.091
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	20.091
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	20.091
[ omitted 17 entries ]

copy(MT)[, j := mean(mpg), env = list(j = LHS)][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mean_mpg
21	6	160	110	3.9	2.62	16.46	0	1	4	4	20.091
21	6	160	110	3.9	2.875	17.02	0	1	4	4	20.091
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	20.091
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	20.091
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	20.091
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	20.091
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	20.091
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	20.091
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	20.091
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	20.091
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	20.091
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	20.091
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	20.091
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	20.091
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	20.091
[ omitted 17 entries ]

copy(MT)[, c(LHS) := mean(get(RHS))][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mean_mpg
21	6	160	110	3.9	2.62	16.46	0	1	4	4	20.091
21	6	160	110	3.9	2.875	17.02	0	1	4	4	20.091
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	20.091
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	20.091
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	20.091
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	20.091
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	20.091
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	20.091
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	20.091
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	20.091
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	20.091
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	20.091
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	20.091
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	20.091
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	20.091
[ omitted 17 entries ]

copy(MT)[, x := mean(y), env = list(x = LHS, y = RHS)][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mean_mpg
21	6	160	110	3.9	2.62	16.46	0	1	4	4	20.091
21	6	160	110	3.9	2.875	17.02	0	1	4	4	20.091
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	20.091
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	20.091
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	20.091
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	20.091
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	20.091
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	20.091
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	20.091
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	20.091
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	20.091
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	20.091
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	20.091
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	20.091
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	20.091
[ omitted 17 entries ]

1.5.4 Conditional trans/mutate

Tidyverse
data.table

Mutate everything based on multiple conditions:

One condition:

mtcars |> mutate(Size = if_else(cyl >= 6, "BIG", "small", missing = "Unk"))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	Size
21	6	160	110	3.9	2.62	16.46	0	1	4	4	BIG
21	6	160	110	3.9	2.875	17.02	0	1	4	4	BIG
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	small
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	BIG
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	BIG
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	BIG
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	BIG
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	small
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	small
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	BIG
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	BIG
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	BIG
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	BIG
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	BIG
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	BIG
[ omitted 17 entries ]

Nested conditions:

mtcars |> mutate(Size = case_when(
  cyl %between% c(2,4) ~ "small",
  cyl %between% c(4,8) ~ "BIG",
  .default = "Unk"
))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	Size
21	6	160	110	3.9	2.62	16.46	0	1	4	4	BIG
21	6	160	110	3.9	2.875	17.02	0	1	4	4	BIG
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	small
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	BIG
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	BIG
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	BIG
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	BIG
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	small
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	small
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	BIG
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	BIG
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	BIG
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	BIG
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	BIG
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	BIG
[ omitted 17 entries ]

Mutate only rows meeting conditions:

mtcars |> mutate(BIG = case_when(am == 1 ~ cyl >= 6))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	BIG
21	6	160	110	3.9	2.62	16.46	0	1	4	4	TRUE
21	6	160	110	3.9	2.875	17.02	0	1	4	4	TRUE
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	FALSE
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	NA
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	NA
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	NA
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	NA
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	NA
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	NA
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	NA
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	NA
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	NA
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	NA
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	NA
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	NA
[ omitted 17 entries ]

Mutate everything based on multiple conditions:

One condition:

copy(MT)[, Size := fifelse(cyl >= 6, "BIG", "small", na = "Unk")][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	Size
21	6	160	110	3.9	2.62	16.46	0	1	4	4	BIG
21	6	160	110	3.9	2.875	17.02	0	1	4	4	BIG
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	small
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	BIG
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	BIG
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	BIG
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	BIG
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	small
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	small
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	BIG
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	BIG
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	BIG
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	BIG
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	BIG
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	BIG
[ omitted 17 entries ]

Nested conditions:

copy(MT)[, Size := fcase(
  cyl %between% c(2,4), "small", 
  cyl %between% c(4,8), "BIG",
  default = "Unk"
)][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	Size
21	6	160	110	3.9	2.62	16.46	0	1	4	4	BIG
21	6	160	110	3.9	2.875	17.02	0	1	4	4	BIG
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	small
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	BIG
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	BIG
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	BIG
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	BIG
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	small
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	small
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	BIG
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	BIG
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	BIG
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	BIG
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	BIG
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	BIG
[ omitted 17 entries ]

Mutate only rows meeting conditions:

copy(MT)[am == 1, BIG := cyl >= 6][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	BIG
21	6	160	110	3.9	2.62	16.46	0	1	4	4	TRUE
21	6	160	110	3.9	2.875	17.02	0	1	4	4	TRUE
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	FALSE
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	NA
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	NA
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	NA
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	NA
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	NA
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	NA
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	NA
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	NA
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	NA
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	NA
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	NA
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	NA
[ omitted 17 entries ]

1.5.5 Complex trans/mutate

1.5.5.1 Column-wise operations

new <- c("min_mpg", "min_disp")
old <- c("mpg", "disp")

Apply one function to multiple columns:

Tidyverse
data.table

mtcars |> mutate(across(c("mpg", "disp"), min, .names = "min_{col}"))

data.frame [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	min_disp
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	71.1
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	71.1
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	71.1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	71.1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	71.1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	71.1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	71.1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	71.1
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	71.1
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	71.1
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	71.1
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	71.1
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	71.1
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	71.1
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	71.1
[ omitted 17 entries ]

As a transmute:

mtcars |> transmute(across(c("mpg", "disp"), min, .names = "min_{col}"))

data.frame [32 x 2]

min_mpg	min_disp
10.4	71.1
10.4	71.1
10.4	71.1
10.4	71.1
10.4	71.1
10.4	71.1
10.4	71.1
10.4	71.1
10.4	71.1
10.4	71.1
10.4	71.1
10.4	71.1
10.4	71.1
10.4	71.1
10.4	71.1
[ omitted 17 entries ]

Dynamically:

mtcars |> mutate(across(all_of(old), min, .names = "min_{col}"))

data.frame [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	min_disp
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	71.1
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	71.1
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	71.1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	71.1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	71.1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	71.1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	71.1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	71.1
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	71.1
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	71.1
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	71.1
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	71.1
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	71.1
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	71.1
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	71.1
[ omitted 17 entries ]

copy(MT)[
    , c("min_mpg", "min_disp") := lapply(.SD, min), .SDcols = c("mpg", "disp")
  ][]

data.table [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	min_disp
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	71.1
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	71.1
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	71.1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	71.1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	71.1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	71.1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	71.1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	71.1
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	71.1
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	71.1
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	71.1
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	71.1
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	71.1
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	71.1
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	71.1
[ omitted 17 entries ]

copy(MT)[, c("min_mpg", "min_disp") := lapply(.(mpg, disp), min)][]

As a transmute:

A second step is needed to add min_ before the names:

(MT[, lapply(.SD[, .(mpg, disp)], min)] |> bind(d, setnames(d, names(d), \(x) paste0("min_", x))))[]

data.table [1 x 2]

min_mpg	min_disp
10.4	71.1

Dynamically:

copy(MT)[, c(new) := lapply(mget(old), min)][]

data.table [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	min_disp
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	71.1
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	71.1
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	71.1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	71.1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	71.1
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	71.1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	71.1
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	71.1
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	71.1
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	71.1
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	71.1
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	71.1
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	71.1
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	71.1
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	71.1
[ omitted 17 entries ]

copy(MT)[, c(new) := lapply(x, min), env = list(x = as.list(old))][]

Apply multiple functions to one or multiple column:

col <- "mpg"
cols <- c("mpg", "disp")

Tidyverse
data.table

mtcars |> mutate(min_mpg = min(mpg), max_mpg = max(mpg))

data.frame [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	max_mpg
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	33.9
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	33.9
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	33.9
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	33.9
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	33.9
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	33.9
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	33.9
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	33.9
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	33.9
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	33.9
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	33.9
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	33.9
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	33.9
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	33.9
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	33.9
[ omitted 17 entries ]

mtcars |> mutate(across(mpg, list(min = min, max = max), .names = "{fn}_{col}"))

data.frame [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	max_mpg
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	33.9
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	33.9
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	33.9
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	33.9
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	33.9
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	33.9
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	33.9
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	33.9
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	33.9
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	33.9
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	33.9
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	33.9
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	33.9
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	33.9
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	33.9
[ omitted 17 entries ]

Multiple columns:

mtcars |> mutate(across(matches("mpg|disp"), list(min = min, max = max), .names = "{fn}_{col}"))

data.frame [32 x 15]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	max_mpg	min_disp	max_disp
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	33.9	71.1	472
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	33.9	71.1	472
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	33.9	71.1	472
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	33.9	71.1	472
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	33.9	71.1	472
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	33.9	71.1	472
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	33.9	71.1	472
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	33.9	71.1	472
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	33.9	71.1	472
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	33.9	71.1	472
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	33.9	71.1	472
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	33.9	71.1	472
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	33.9	71.1	472
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	33.9	71.1	472
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	33.9	71.1	472
[ omitted 17 entries ]

mtcars |> mutate(across(cols, list(min = \(x) min(x), max = \(x) max(x)), .names = "{fn}_{col}"))

data.frame [32 x 15]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	max_mpg	min_disp	max_disp
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	33.9	71.1	472
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	33.9	71.1	472
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	33.9	71.1	472
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	33.9	71.1	472
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	33.9	71.1	472
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	33.9	71.1	472
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	33.9	71.1	472
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	33.9	71.1	472
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	33.9	71.1	472
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	33.9	71.1	472
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	33.9	71.1	472
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	33.9	71.1	472
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	33.9	71.1	472
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	33.9	71.1	472
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	33.9	71.1	472
[ omitted 17 entries ]

copy(MT)[, let(min_mpg = min(mpg), max_mpg = max(mpg))][]

data.table [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	max_mpg
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	33.9
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	33.9
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	33.9
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	33.9
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	33.9
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	33.9
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	33.9
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	33.9
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	33.9
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	33.9
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	33.9
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	33.9
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	33.9
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	33.9
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	33.9
[ omitted 17 entries ]

copy(MT)[, c("min_mpg", "max_mpg") := .(min(mpg), max(mpg))][]

data.table [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	max_mpg
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	33.9
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	33.9
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	33.9
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	33.9
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	33.9
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	33.9
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	33.9
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	33.9
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	33.9
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	33.9
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	33.9
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	33.9
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	33.9
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	33.9
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	33.9
[ omitted 17 entries ]

Alternatives

copy(MT)[, c("min_mpg", "max_mpg") := 
           lapply(.(mpg), \(x) list(min(x), max(x))) |> do.call(rbind, args = _)
        ][]

copy(MT)[, c("min_mpg", "max_mpg") := 
           lapply(.(get(col)), \(x) list(min(x), max(x))) |> unlist(recursive = FALSE)
        ][]

Multiple columns:

copy(MT)[, c("min_mpg", "min_disp", "max_mpg", "max_disp") := 
           lapply(.SD, \(x) list(min(x), max(x))) |> do.call(rbind, args = _), 
         .SDcols = cols][]

data.table [32 x 15]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	min_disp	max_mpg	max_disp
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	71.1	33.9	472
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	71.1	33.9	472
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	71.1	33.9	472
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	71.1	33.9	472
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	71.1	33.9	472
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	71.1	33.9	472
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	71.1	33.9	472
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	71.1	33.9	472
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	71.1	33.9	472
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	71.1	33.9	472
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	71.1	33.9	472
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	71.1	33.9	472
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	71.1	33.9	472
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	71.1	33.9	472
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	71.1	33.9	472
[ omitted 17 entries ]

copy(MT)[, outer(c("min", "max"), cols, str_c, sep = "_") |> t() |> as.vector() := 
           lapply(.SD, \(x) list(min(x), max(x))) |> do.call(rbind, args = _), 
         .SDcols = cols][]

data.table [32 x 15]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	min_mpg	min_disp	max_mpg	max_disp
21	6	160	110	3.9	2.62	16.46	0	1	4	4	10.4	71.1	33.9	472
21	6	160	110	3.9	2.875	17.02	0	1	4	4	10.4	71.1	33.9	472
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	10.4	71.1	33.9	472
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	10.4	71.1	33.9	472
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	10.4	71.1	33.9	472
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	10.4	71.1	33.9	472
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	10.4	71.1	33.9	472
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	10.4	71.1	33.9	472
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	10.4	71.1	33.9	472
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10.4	71.1	33.9	472
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	10.4	71.1	33.9	472
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	10.4	71.1	33.9	472
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	10.4	71.1	33.9	472
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	10.4	71.1	33.9	472
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	10.4	71.1	33.9	472
[ omitted 17 entries ]

1.5.5.2 Row-wise operations

Apply one function to multiple columns (row-wise):

Tidyverse
data.table

mtcars |> rowwise() |> mutate(rsum = sum(c_across(where(is.numeric)))) |> ungroup()

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	rsum
21	6	160	110	3.9	2.62	16.46	0	1	4	4	328.98
21	6	160	110	3.9	2.875	17.02	0	1	4	4	329.795
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	259.58
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	426.135
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	590.31
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	385.54
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	656.92
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	270.98
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	299.57
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	350.46
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	349.66
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	510.74
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	511.5
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	509.85
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	728.56
[ omitted 17 entries ]

mtcars |> mutate(rsum = pmap_dbl(across(where(is.numeric)), \(...) sum(c(...))))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	rsum
21	6	160	110	3.9	2.62	16.46	0	1	4	4	328.98
21	6	160	110	3.9	2.875	17.02	0	1	4	4	329.795
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	259.58
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	426.135
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	590.31
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	385.54
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	656.92
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	270.98
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	299.57
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	350.46
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	349.66
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	510.74
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	511.5
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	509.85
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	728.56
[ omitted 17 entries ]

Hybrid base R-Tidyverse:

mtcars |> mutate(rsum = apply(across(where(is.numeric)), 1, sum))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	rsum
21	6	160	110	3.9	2.62	16.46	0	1	4	4	328.98
21	6	160	110	3.9	2.875	17.02	0	1	4	4	329.795
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	259.58
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	426.135
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	590.31
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	385.54
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	656.92
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	270.98
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	299.57
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	350.46
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	349.66
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	510.74
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	511.5
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	509.85
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	728.56
[ omitted 17 entries ]

mtcars |> mutate(rsum = rowSums(across(where(is.numeric))))

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	rsum
21	6	160	110	3.9	2.62	16.46	0	1	4	4	328.98
21	6	160	110	3.9	2.875	17.02	0	1	4	4	329.795
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	259.58
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	426.135
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	590.31
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	385.54
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	656.92
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	270.98
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	299.57
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	350.46
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	349.66
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	510.74
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	511.5
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	509.85
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	728.56
[ omitted 17 entries ]

copy(MT)[, rsum := rowSums(.SD), .SDcols = is.numeric][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	rsum
21	6	160	110	3.9	2.62	16.46	0	1	4	4	328.98
21	6	160	110	3.9	2.875	17.02	0	1	4	4	329.795
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	259.58
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	426.135
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	590.31
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	385.54
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	656.92
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	270.98
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	299.57
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	350.46
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	349.66
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	510.74
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	511.5
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	509.85
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	728.56
[ omitted 17 entries ]

copy(MT)[, rsum := apply(.SD, 1, sum), .SDcols = is.numeric][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	rsum
21	6	160	110	3.9	2.62	16.46	0	1	4	4	328.98
21	6	160	110	3.9	2.875	17.02	0	1	4	4	329.795
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	259.58
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	426.135
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	590.31
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	385.54
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	656.92
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	270.98
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	299.57
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	350.46
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	349.66
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	510.74
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	511.5
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	509.85
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	728.56
[ omitted 17 entries ]

Apply multiple functions to multiple columns (row-wise)

Tidyverse
data.table

mtcars |> 
  mutate(pmap_dfr(across(where(is.numeric)), \(...) list(mean = mean(c(...)), sum = sum(c(...)))))

data.frame [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mean	sum
21	6	160	110	3.9	2.62	16.46	0	1	4	4	29.907	328.98
21	6	160	110	3.9	2.875	17.02	0	1	4	4	29.981	329.795
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	23.598	259.58
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	38.74	426.135
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	53.665	590.31
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	35.049	385.54
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	59.72	656.92
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	24.635	270.98
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	27.234	299.57
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	31.86	350.46
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	31.787	349.66
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	46.431	510.74
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	46.5	511.5
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	46.35	509.85
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	66.233	728.56
[ omitted 17 entries ]

Alternatives

mtcars |> 
  mutate(
    pmap(across(where(is.numeric)), \(...) list(mean = mean(c(...)), sum = sum(c(...)))) |> 
      bind_rows()
  )

Hybrid base R-Tidyverse:

mtcars |> 
  mutate(apply(across(where(is.numeric)), 1, \(x) list(mean = mean(x), sum = sum(x))) |> bind_rows())

data.frame [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	mean	sum
21	6	160	110	3.9	2.62	16.46	0	1	4	4	29.907	328.98
21	6	160	110	3.9	2.875	17.02	0	1	4	4	29.981	329.795
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	23.598	259.58
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	38.74	426.135
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	53.665	590.31
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	35.049	385.54
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	59.72	656.92
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	24.635	270.98
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	27.234	299.57
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	31.86	350.46
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	31.787	349.66
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	46.431	510.74
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	46.5	511.5
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	46.35	509.85
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	66.233	728.56
[ omitted 17 entries ]

copy(MT)[, c("rmean", "rsum") := 
           apply(.SD, 1, \(x) list(mean(x), sum(x))) |> rbindlist(), 
         .SDcols = is.numeric][]

data.table [32 x 13]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	rmean	rsum
21	6	160	110	3.9	2.62	16.46	0	1	4	4	29.907	328.98
21	6	160	110	3.9	2.875	17.02	0	1	4	4	29.981	329.795
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	23.598	259.58
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	38.74	426.135
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	53.665	590.31
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	35.049	385.54
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	59.72	656.92
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	24.635	270.98
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	27.234	299.57
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	31.86	350.46
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	31.787	349.66
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	46.431	510.74
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	46.5	511.5
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	46.35	509.85
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	66.233	728.56
[ omitted 17 entries ]

Apply an anonymous function inside the DT:

MT[, {
    print(summary(mpg))
    x <- cyl + gear
    .(RN = 1:.N, CG = x)
  }
]

Min. 1st Qu. Median Mean 3rd Qu. Max. 10.40 15.43 19.20 20.09 22.80 33.90

data.table [32 x 2]

RN	CG
1	10
2	10
3	8
4	9
5	11
6	9
7	11
8	8
9	8
10	10
11	10
12	11
13	11
14	11
15	11
[ omitted 17 entries ]

1.6 Group / Aggregate

Note

The examples listed apply a grouping but do nothing (using .SD to simply keep all columns as is)

cols <- c("cyl", "disp")
cols_missing <- c("cyl", "disp", "missing_col")

1.6.1 Basic grouping

Tidyverse
data.table

mtcars |> group_by(cyl, gear)

data.frame [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

Dynamic grouping:

mtcars |> group_by(across(all_of(cols)))

data.frame [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

Use any_of if you expect some columns to be missing in the data.

mtcars |> group_by(across(any_of(cols_missing)))

data.frame [32 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

MT[, .SD, by = .(cyl, gear)]

data.table [32 x 11]

cyl	gear	mpg	disp	hp	drat	wt	qsec	vs	am	carb
6	4	21	160	110	3.9	2.62	16.46	0	1	4
6	4	21	160	110	3.9	2.875	17.02	0	1	4
6	4	19.2	167.6	123	3.92	3.44	18.3	1	0	4
6	4	17.8	167.6	123	3.92	3.44	18.9	1	0	4
4	4	22.8	108	93	3.85	2.32	18.61	1	1	1
4	4	24.4	146.7	62	3.69	3.19	20	1	0	2
4	4	22.8	140.8	95	3.92	3.15	22.9	1	0	2
4	4	32.4	78.7	66	4.08	2.2	19.47	1	1	1
4	4	30.4	75.7	52	4.93	1.615	18.52	1	1	2
4	4	33.9	71.1	65	4.22	1.835	19.9	1	1	1
4	4	27.3	79	66	4.08	1.935	18.9	1	1	1
4	4	21.4	121	109	4.11	2.78	18.6	1	1	2
6	3	21.4	258	110	3.08	3.215	19.44	1	0	1
6	3	18.1	225	105	2.76	3.46	20.22	1	0	1
8	3	18.7	360	175	3.15	3.44	17.02	0	0	2
[ omitted 17 entries ]

Dynamic grouping:

MT[, .SD, by = cols]

data.table [32 x 11]

cyl	disp	mpg	hp	drat	wt	qsec	vs	am	gear	carb
6	160	21	110	3.9	2.62	16.46	0	1	4	4
6	160	21	110	3.9	2.875	17.02	0	1	4	4
4	108	22.8	93	3.85	2.32	18.61	1	1	4	1
6	258	21.4	110	3.08	3.215	19.44	1	0	3	1
8	360	18.7	175	3.15	3.44	17.02	0	0	3	2
8	360	14.3	245	3.21	3.57	15.84	0	0	3	4
6	225	18.1	105	2.76	3.46	20.22	1	0	3	1
4	146.7	24.4	62	3.69	3.19	20	1	0	4	2
4	140.8	22.8	95	3.92	3.15	22.9	1	0	4	2
6	167.6	19.2	123	3.92	3.44	18.3	1	0	4	4
6	167.6	17.8	123	3.92	3.44	18.9	1	0	4	4
8	275.8	16.4	180	3.07	4.07	17.4	0	0	3	3
8	275.8	17.3	180	3.07	3.73	17.6	0	0	3	3
8	275.8	15.2	180	3.07	3.78	18	0	0	3	3
8	472	10.4	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

To handle potentially missing columns:

MT[, .SD, by = intersect(cols_missing, colnames(MT))]

data.table [32 x 11]

cyl	disp	mpg	hp	drat	wt	qsec	vs	am	gear	carb
6	160	21	110	3.9	2.62	16.46	0	1	4	4
6	160	21	110	3.9	2.875	17.02	0	1	4	4
4	108	22.8	93	3.85	2.32	18.61	1	1	4	1
6	258	21.4	110	3.08	3.215	19.44	1	0	3	1
8	360	18.7	175	3.15	3.44	17.02	0	0	3	2
8	360	14.3	245	3.21	3.57	15.84	0	0	3	4
6	225	18.1	105	2.76	3.46	20.22	1	0	3	1
4	146.7	24.4	62	3.69	3.19	20	1	0	4	2
4	140.8	22.8	95	3.92	3.15	22.9	1	0	4	2
6	167.6	19.2	123	3.92	3.44	18.3	1	0	4	4
6	167.6	17.8	123	3.92	3.44	18.9	1	0	4	4
8	275.8	16.4	180	3.07	4.07	17.4	0	0	3	3
8	275.8	17.3	180	3.07	3.73	17.6	0	0	3	3
8	275.8	15.2	180	3.07	3.78	18	0	0	3	3
8	472	10.4	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

1.6.2 Current group info

Tidyverse
data.table

mtcars |> 
  group_by(cyl) |> 
  filter(cur_group_id() == 1) |> # To only keep one plot
  group_walk(\(d, g) with(d, plot(hp, mpg, main = paste("Cyl:", g$cyl))))

Use the .BY argument to get the current group name:

MT[, with(.SD, plot(hp, mpg, main = paste("Cyl:", .BY))), keyby = cyl]

1.7 Row numbers & indices

1.7.1 Adding row or group indices

.I: Row indices
.N: Number of rows

.GRP: Group indices
.NGRP: Number of groups

1.7.1.1 Adding rows indices:

mtcars |> mutate(I = row_number())

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	I
21	6	160	110	3.9	2.62	16.46	0	1	4	4	1
21	6	160	110	3.9	2.875	17.02	0	1	4	4	2
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	3
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	4
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	5
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	6
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	7
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	8
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	9
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	11
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	12
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	13
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	14
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	15
[ omitted 17 entries ]

copy(MT)[ , I := .I][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	I
21	6	160	110	3.9	2.62	16.46	0	1	4	4	1
21	6	160	110	3.9	2.875	17.02	0	1	4	4	2
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	3
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	4
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	5
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	6
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	7
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	8
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	9
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	10
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	11
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	12
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	13
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	14
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	15
[ omitted 17 entries ]

1.7.1.2 Adding group indices:

Tidyverse
data.table

Adding group indices (same index for each group):

mtcars |> summarize(GRP = cur_group_id(), .by = cyl)

data.frame [3 x 2]

cyl	GRP
6	1
4	2
8	3

Mutate instead of summarize:

mtcars |> mutate(GRP = cur_group_id(), .by = cyl)

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	GRP
21	6	160	110	3.9	2.62	16.46	0	1	4	4	1
21	6	160	110	3.9	2.875	17.02	0	1	4	4	1
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	2
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	3
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	3
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	1
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	1
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	3
[ omitted 17 entries ]

Adding row numbers within each group:

mtcars |> mutate(I_GRP = row_number(), .by = gear)

data.frame [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	I_GRP
21	6	160	110	3.9	2.62	16.46	0	1	4	4	1
21	6	160	110	3.9	2.875	17.02	0	1	4	4	2
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	3
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	3
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	4
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	5
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	6
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	7
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	5
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	6
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	7
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	8
[ omitted 17 entries ]

Adding group indices (same index for each group):

MT[, .GRP, by = cyl]

data.table [3 x 2]

cyl	GRP
6	1
4	2
8	3

Mutate instead of summarize:

copy(MT)[, GRP := .GRP, by = cyl][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	GRP
21	6	160	110	3.9	2.62	16.46	0	1	4	4	1
21	6	160	110	3.9	2.875	17.02	0	1	4	4	1
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	2
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	3
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	3
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	1
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	1
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	3
[ omitted 17 entries ]

Adding row numbers within each group:

copy(MT)[, I_GRP := 1:.N, by = gear][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	I_GRP
21	6	160	110	3.9	2.62	16.46	0	1	4	4	1
21	6	160	110	3.9	2.875	17.02	0	1	4	4	2
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	3
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	3
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	4
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	5
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	6
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	7
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	5
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	6
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	7
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	8
[ omitted 17 entries ]

copy(MT)[, I_GRP := rowid(gear)][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	I_GRP
21	6	160	110	3.9	2.62	16.46	0	1	4	4	1
21	6	160	110	3.9	2.875	17.02	0	1	4	4	2
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	3
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	3
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	4
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	5
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	6
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	7
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	5
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	6
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	7
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	8
[ omitted 17 entries ]

1.7.2 Filtering based on row numbers (slicing)

1.7.2.1 Extracting a specific row

Tidyverse
data.table

mtcars |> dplyr::first()

data.frame [1 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4

mtcars |> dplyr::last()

data.frame [1 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2

mtcars |> dplyr::nth(5)

data.frame [1 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2

MT[1,] # data.table::first(MT)

data.table [1 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4

MT[.N,] # data.table::last(MT)

data.table [1 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2

MT[5,]

data.table [1 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2

1.7.2.2 Slicing rows

Tidyverse
data.table

tail(mtcars, 10)

data.frame [10 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6
15	8	301	335	3.54	3.57	14.6	0	1	5	8
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2

mtcars |> slice((n()-9):n())

data.frame [10 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6
15	8	301	335	3.54	3.57	14.6	0	1	5	8
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2

mtcars |> slice_tail(n = 10)

data.frame [10 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6
15	8	301	335	3.54	3.57	14.6	0	1	5	8
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2

tail(MT, 10)

data.table [10 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6
15	8	301	335	3.54	3.57	14.6	0	1	5	8
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2

MT[(.N-9):.N]

data.table [10 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6
15	8	301	335	3.54	3.57	14.6	0	1	5	8
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2

MT[MT[, .I[(.N-9):.N]]] # Gets the last 10 rows' indices and filters based on them

data.table [10 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6
15	8	301	335	3.54	3.57	14.6	0	1	5	8
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2

1.7.2.3 Slicing groups

Tidyverse
data.table

Random sample by group:

mtcars |> slice_sample(n = 5, by = cyl)

data.frame [15 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
32.4	4	78.7	66	4.08	2.2	19.47	1	1	4	1
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
15	8	301	335	3.54	3.57	14.6	0	1	5	8
10.4	8	460	215	3	5.424	17.82	0	0	3	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4

Filter groups by condition:

mtcars |> filter(n() >= 8, .by = cyl)

data.frame [25 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
32.4	4	78.7	66	4.08	2.2	19.47	1	1	4	1
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
33.9	4	71.1	65	4.22	1.835	19.9	1	1	4	1
21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1
[ omitted 10 entries ]

mtcars |> group_by(cyl) |> group_modify(\(d,g) if (nrow(d) >= 8) d else data.frame())

data.frame [25 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1
4	24.4	146.7	62	3.69	3.19	20	1	0	4	2
4	22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
4	32.4	78.7	66	4.08	2.2	19.47	1	1	4	1
4	30.4	75.7	52	4.93	1.615	18.52	1	1	4	2
4	33.9	71.1	65	4.22	1.835	19.9	1	1	4	1
4	21.5	120.1	97	3.7	2.465	20.01	1	0	3	1
4	27.3	79	66	4.08	1.935	18.9	1	1	4	1
4	26	120.3	91	4.43	2.14	16.7	0	1	5	2
4	30.4	95.1	113	3.77	1.513	16.9	1	1	5	2
4	21.4	121	109	4.11	2.78	18.6	1	1	4	2
8	18.7	360	175	3.15	3.44	17.02	0	0	3	2
8	14.3	360	245	3.21	3.57	15.84	0	0	3	4
8	16.4	275.8	180	3.07	4.07	17.4	0	0	3	3
8	17.3	275.8	180	3.07	3.73	17.6	0	0	3	3
[ omitted 10 entries ]

Random sample by group:

MT[, .SD[sample(.N, 5)], keyby = cyl]

data.table [15 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
4	27.3	79	66	4.08	1.935	18.9	1	1	4	1
4	22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
4	24.4	146.7	62	3.69	3.19	20	1	0	4	2
4	32.4	78.7	66	4.08	2.2	19.47	1	1	4	1
4	30.4	75.7	52	4.93	1.615	18.52	1	1	4	2
6	18.1	225	105	2.76	3.46	20.22	1	0	3	1
6	19.7	145	175	3.62	2.77	15.5	0	1	5	6
6	21	160	110	3.9	2.875	17.02	0	1	4	4
6	21.4	258	110	3.08	3.215	19.44	1	0	3	1
6	19.2	167.6	123	3.92	3.44	18.3	1	0	4	4
8	15.2	275.8	180	3.07	3.78	18	0	0	3	3
8	10.4	472	205	2.93	5.25	17.98	0	0	3	4
8	18.7	360	175	3.15	3.44	17.02	0	0	3	2
8	17.3	275.8	180	3.07	3.73	17.6	0	0	3	3
8	15.5	318	150	2.76	3.52	16.87	0	0	3	2

Filter groups by condition:

MT[, if(.N >= 8) .SD, by = cyl]

data.table [25 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1
4	24.4	146.7	62	3.69	3.19	20	1	0	4	2
4	22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
4	32.4	78.7	66	4.08	2.2	19.47	1	1	4	1
4	30.4	75.7	52	4.93	1.615	18.52	1	1	4	2
4	33.9	71.1	65	4.22	1.835	19.9	1	1	4	1
4	21.5	120.1	97	3.7	2.465	20.01	1	0	3	1
4	27.3	79	66	4.08	1.935	18.9	1	1	4	1
4	26	120.3	91	4.43	2.14	16.7	0	1	5	2
4	30.4	95.1	113	3.77	1.513	16.9	1	1	5	2
4	21.4	121	109	4.11	2.78	18.6	1	1	4	2
8	18.7	360	175	3.15	3.44	17.02	0	0	3	2
8	14.3	360	245	3.21	3.57	15.84	0	0	3	4
8	16.4	275.8	180	3.07	4.07	17.4	0	0	3	3
8	17.3	275.8	180	3.07	3.73	17.6	0	0	3	3
[ omitted 10 entries ]

MT[, .SD[.N >= 8], by = cyl]

data.table [25 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1
4	24.4	146.7	62	3.69	3.19	20	1	0	4	2
4	22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
4	32.4	78.7	66	4.08	2.2	19.47	1	1	4	1
4	30.4	75.7	52	4.93	1.615	18.52	1	1	4	2
4	33.9	71.1	65	4.22	1.835	19.9	1	1	4	1
4	21.5	120.1	97	3.7	2.465	20.01	1	0	3	1
4	27.3	79	66	4.08	1.935	18.9	1	1	4	1
4	26	120.3	91	4.43	2.14	16.7	0	1	5	2
4	30.4	95.1	113	3.77	1.513	16.9	1	1	5	2
4	21.4	121	109	4.11	2.78	18.6	1	1	4	2
8	18.7	360	175	3.15	3.44	17.02	0	0	3	2
8	14.3	360	245	3.21	3.57	15.84	0	0	3	4
8	16.4	275.8	180	3.07	4.07	17.4	0	0	3	3
8	17.3	275.8	180	3.07	3.73	17.6	0	0	3	3
[ omitted 10 entries ]

1.7.3 Extracting row indices

1.7.3.1 Getting the row numbers of specific observations:

Tidyverse
data.table

Row number of the first and last observation of each group:

mtcars |> reframe(I = cur_group_rows()[c(1, n())], .by = cyl)

data.frame [6 x 2]

cyl	I
6	1
6	30
4	3
4	32
8	5
8	31

… while keeping all other columns:

mtcars |> mutate(I = row_number()) |> slice(c(1, n()), .by = cyl)

data.frame [6 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	I
21	6	160	110	3.9	2.62	16.46	0	1	4	4	1
19.7	6	145	175	3.62	2.77	15.5	0	1	5	6	30
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	3
21.4	4	121	109	4.11	2.78	18.6	1	1	4	2	32
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	5
15	8	301	335	3.54	3.57	14.6	0	1	5	8	31

Row number of the first and last observation of each group:

MT[, .I[c(1, .N)], by = cyl]

data.table [6 x 2]

cyl	V1
6	1
6	30
4	3
4	32
8	5
8	31

… while keeping all other columns:

copy(MT)[, I := .I][, .SD[c(1, .N)], by = cyl]

data.table [6 x 12]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb	I
6	21	160	110	3.9	2.62	16.46	0	1	4	4	1
6	19.7	145	175	3.62	2.77	15.5	0	1	5	6	30
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1	3
4	21.4	121	109	4.11	2.78	18.6	1	1	4	2	32
8	18.7	360	175	3.15	3.44	17.02	0	0	3	2	5
8	15	301	335	3.54	3.57	14.6	0	1	5	8	31

1.7.3.2 Extracting row indices after filtering

Tidyverse
data.table

Extracting row numbers in the original dataset:

mtcars |> mutate(I = row_number()) |> filter(gear == 4) |> pull(I)

[1] 1 2 3 8 9 10 11 18 19 20 26 32

Extracting row numbers in the new dataset (after filtering):

mtcars |> filter(gear == 4) |> mutate(I = row_number()) |> pull(I)

[1] 1 2 3 4 5 6 7 8 9 10 11 12

Warning

.I gives the vector of row numbers after any subsetting/filtering has been done

Extracting row numbers in the original dataset:

MT[, .I[gear == 4]]

[1] 1 2 3 8 9 10 11 18 19 20 26 32

Extracting row numbers in the new dataset (after filtering):

MT[gear == 4, .I]

[1] 1 2 3 4 5 6 7 8 9 10 11 12

1.8 Relocate

1.8.1 Basic reordering

Tidyverse
data.table

mtcars |> relocate(cyl, .after = last_col())

data.frame [32 x 11]

mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb	cyl
21	160	110	3.9	2.62	16.46	0	1	4	4	6
21	160	110	3.9	2.875	17.02	0	1	4	4	6
22.8	108	93	3.85	2.32	18.61	1	1	4	1	4
21.4	258	110	3.08	3.215	19.44	1	0	3	1	6
18.7	360	175	3.15	3.44	17.02	0	0	3	2	8
18.1	225	105	2.76	3.46	20.22	1	0	3	1	6
14.3	360	245	3.21	3.57	15.84	0	0	3	4	8
24.4	146.7	62	3.69	3.19	20	1	0	4	2	4
22.8	140.8	95	3.92	3.15	22.9	1	0	4	2	4
19.2	167.6	123	3.92	3.44	18.3	1	0	4	4	6
17.8	167.6	123	3.92	3.44	18.9	1	0	4	4	6
16.4	275.8	180	3.07	4.07	17.4	0	0	3	3	8
17.3	275.8	180	3.07	3.73	17.6	0	0	3	3	8
15.2	275.8	180	3.07	3.78	18	0	0	3	3	8
10.4	472	205	2.93	5.25	17.98	0	0	3	4	8
[ omitted 17 entries ]

Relocate a new column (mutate + relocate):

mtcars |> mutate(GRP = cur_group_id(), .by = cyl, .before = 1)

data.frame [32 x 12]

GRP	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
1	21	6	160	110	3.9	2.62	16.46	0	1	4	4
1	21	6	160	110	3.9	2.875	17.02	0	1	4	4
2	22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
1	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
3	18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
1	18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
3	14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
2	24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
2	22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
1	19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
1	17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
3	16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
3	17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
3	15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
3	10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

setcolorder(copy(MT), "cyl", after = last(colnames(MT)))[]

data.table [32 x 11]

mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb	cyl
21	160	110	3.9	2.62	16.46	0	1	4	4	6
21	160	110	3.9	2.875	17.02	0	1	4	4	6
22.8	108	93	3.85	2.32	18.61	1	1	4	1	4
21.4	258	110	3.08	3.215	19.44	1	0	3	1	6
18.7	360	175	3.15	3.44	17.02	0	0	3	2	8
18.1	225	105	2.76	3.46	20.22	1	0	3	1	6
14.3	360	245	3.21	3.57	15.84	0	0	3	4	8
24.4	146.7	62	3.69	3.19	20	1	0	4	2	4
22.8	140.8	95	3.92	3.15	22.9	1	0	4	2	4
19.2	167.6	123	3.92	3.44	18.3	1	0	4	4	6
17.8	167.6	123	3.92	3.44	18.9	1	0	4	4	6
16.4	275.8	180	3.07	4.07	17.4	0	0	3	3	8
17.3	275.8	180	3.07	3.73	17.6	0	0	3	3	8
15.2	275.8	180	3.07	3.78	18	0	0	3	3	8
10.4	472	205	2.93	5.25	17.98	0	0	3	4	8
[ omitted 17 entries ]

setcolorder(copy(MT), c(setdiff(colnames(MT), "cyl"), "cyl"))[]

data.table [32 x 11]

mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb	cyl
21	160	110	3.9	2.62	16.46	0	1	4	4	6
21	160	110	3.9	2.875	17.02	0	1	4	4	6
22.8	108	93	3.85	2.32	18.61	1	1	4	1	4
21.4	258	110	3.08	3.215	19.44	1	0	3	1	6
18.7	360	175	3.15	3.44	17.02	0	0	3	2	8
18.1	225	105	2.76	3.46	20.22	1	0	3	1	6
14.3	360	245	3.21	3.57	15.84	0	0	3	4	8
24.4	146.7	62	3.69	3.19	20	1	0	4	2	4
22.8	140.8	95	3.92	3.15	22.9	1	0	4	2	4
19.2	167.6	123	3.92	3.44	18.3	1	0	4	4	6
17.8	167.6	123	3.92	3.44	18.9	1	0	4	4	6
16.4	275.8	180	3.07	4.07	17.4	0	0	3	3	8
17.3	275.8	180	3.07	3.73	17.6	0	0	3	3	8
15.2	275.8	180	3.07	3.78	18	0	0	3	3	8
10.4	472	205	2.93	5.25	17.98	0	0	3	4	8
[ omitted 17 entries ]

Relocate a new column (mutate + relocate):

setcolorder(copy(MT)[ , GRP := .GRP, by = cyl], "GRP")[]

data.table [32 x 12]

GRP	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
1	21	6	160	110	3.9	2.62	16.46	0	1	4	4
1	21	6	160	110	3.9	2.875	17.02	0	1	4	4
2	22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
1	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
3	18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
1	18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
3	14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
2	24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
2	22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
1	19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
1	17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
3	16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
3	17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
3	15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
3	10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
[ omitted 17 entries ]

1.8.2 Reordering by column names

Tidyverse
data.table

mtcars |> select(sort(tidyselect::peek_vars()))

data.frame [32 x 11]

am	carb	cyl	disp	drat	gear	hp	mpg	qsec	vs	wt
1	4	6	160	3.9	4	110	21	16.46	0	2.62
1	4	6	160	3.9	4	110	21	17.02	0	2.875
1	1	4	108	3.85	4	93	22.8	18.61	1	2.32
0	1	6	258	3.08	3	110	21.4	19.44	1	3.215
0	2	8	360	3.15	3	175	18.7	17.02	0	3.44
0	1	6	225	2.76	3	105	18.1	20.22	1	3.46
0	4	8	360	3.21	3	245	14.3	15.84	0	3.57
0	2	4	146.7	3.69	4	62	24.4	20	1	3.19
0	2	4	140.8	3.92	4	95	22.8	22.9	1	3.15
0	4	6	167.6	3.92	4	123	19.2	18.3	1	3.44
0	4	6	167.6	3.92	4	123	17.8	18.9	1	3.44
0	3	8	275.8	3.07	3	180	16.4	17.4	0	4.07
0	3	8	275.8	3.07	3	180	17.3	17.6	0	3.73
0	3	8	275.8	3.07	3	180	15.2	18	0	3.78
0	4	8	472	2.93	3	205	10.4	17.98	0	5.25
[ omitted 17 entries ]

mtcars |> select(carb, sort(tidyselect::peek_vars()))

data.frame [32 x 11]

carb	am	cyl	disp	drat	gear	hp	mpg	qsec	vs	wt
4	1	6	160	3.9	4	110	21	16.46	0	2.62
4	1	6	160	3.9	4	110	21	17.02	0	2.875
1	1	4	108	3.85	4	93	22.8	18.61	1	2.32
1	0	6	258	3.08	3	110	21.4	19.44	1	3.215
2	0	8	360	3.15	3	175	18.7	17.02	0	3.44
1	0	6	225	2.76	3	105	18.1	20.22	1	3.46
4	0	8	360	3.21	3	245	14.3	15.84	0	3.57
2	0	4	146.7	3.69	4	62	24.4	20	1	3.19
2	0	4	140.8	3.92	4	95	22.8	22.9	1	3.15
4	0	6	167.6	3.92	4	123	19.2	18.3	1	3.44
4	0	6	167.6	3.92	4	123	17.8	18.9	1	3.44
3	0	8	275.8	3.07	3	180	16.4	17.4	0	4.07
3	0	8	275.8	3.07	3	180	17.3	17.6	0	3.73
3	0	8	275.8	3.07	3	180	15.2	18	0	3.78
4	0	8	472	2.93	3	205	10.4	17.98	0	5.25
[ omitted 17 entries ]

setcolorder(copy(MT), sort(colnames(MT)))[]

data.table [32 x 11]

am	carb	cyl	disp	drat	gear	hp	mpg	qsec	vs	wt
1	4	6	160	3.9	4	110	21	16.46	0	2.62
1	4	6	160	3.9	4	110	21	17.02	0	2.875
1	1	4	108	3.85	4	93	22.8	18.61	1	2.32
0	1	6	258	3.08	3	110	21.4	19.44	1	3.215
0	2	8	360	3.15	3	175	18.7	17.02	0	3.44
0	1	6	225	2.76	3	105	18.1	20.22	1	3.46
0	4	8	360	3.21	3	245	14.3	15.84	0	3.57
0	2	4	146.7	3.69	4	62	24.4	20	1	3.19
0	2	4	140.8	3.92	4	95	22.8	22.9	1	3.15
0	4	6	167.6	3.92	4	123	19.2	18.3	1	3.44
0	4	6	167.6	3.92	4	123	17.8	18.9	1	3.44
0	3	8	275.8	3.07	3	180	16.4	17.4	0	4.07
0	3	8	275.8	3.07	3	180	17.3	17.6	0	3.73
0	3	8	275.8	3.07	3	180	15.2	18	0	3.78
0	4	8	472	2.93	3	205	10.4	17.98	0	5.25
[ omitted 17 entries ]

setcolorder(copy(MT), c("carb", sort(setdiff(colnames(MT), "carb"))))[]

data.table [32 x 11]

carb	am	cyl	disp	drat	gear	hp	mpg	qsec	vs	wt
4	1	6	160	3.9	4	110	21	16.46	0	2.62
4	1	6	160	3.9	4	110	21	17.02	0	2.875
1	1	4	108	3.85	4	93	22.8	18.61	1	2.32
1	0	6	258	3.08	3	110	21.4	19.44	1	3.215
2	0	8	360	3.15	3	175	18.7	17.02	0	3.44
1	0	6	225	2.76	3	105	18.1	20.22	1	3.46
4	0	8	360	3.21	3	245	14.3	15.84	0	3.57
2	0	4	146.7	3.69	4	62	24.4	20	1	3.19
2	0	4	140.8	3.92	4	95	22.8	22.9	1	3.15
4	0	6	167.6	3.92	4	123	19.2	18.3	1	3.44
4	0	6	167.6	3.92	4	123	17.8	18.9	1	3.44
3	0	8	275.8	3.07	3	180	16.4	17.4	0	4.07
3	0	8	275.8	3.07	3	180	17.3	17.6	0	3.73
3	0	8	275.8	3.07	3	180	15.2	18	0	3.78
4	0	8	472	2.93	3	205	10.4	17.98	0	5.25
[ omitted 17 entries ]

1.9 Summarize/Reframe

With data.table, one needs to use the = operator to summarize. It takes a function that returns a list of values smaller than the original column (or group) size. By default, it will only keep the modified columns (like a transmute).

1.9.1 Basic summary

mtcars |> summarize(mean_cyl = mean(cyl))

data.frame [1 x 1]

mean_cyl
6.188

MT[, .(mean_cyl = mean(cyl))]

data.table [1 x 1]

mean_cyl
6.188

1.9.2 Grouped summary

Tidyverse
data.table

By default, dplyr::summarize will arrange the result by the grouping factor:

mtcars |> summarize(N = n(), .by = cyl)

data.frame [3 x 2]

cyl	N
6	7
4	11
8	14

To order by the grouping factor, use group_by() instead of .by:

mtcars |> group_by(cyl) |> summarize(N = n())

data.frame [3 x 2]

cyl	N
4	11
6	7
8	14

By default, data.table keeps the order the groups originally appear in:

MT[, .N, by = cyl]

data.table [3 x 2]

cyl	N
6	7
4	11
8	14

To order by the grouping factor, use keyby instead of by:

MT[, .N, keyby = cyl]

data.table [3 x 2]

cyl	N
4	11
6	7
8	14

Grouped on a temporary variable:

mtcars |> group_by(cyl > 6) |> summarize(N = n())

data.frame [2 x 2]

cyl > 6	N
FALSE	18
TRUE	14

MT[, .N, by = .(cyl > 6)]

data.table [2 x 2]

cyl	N
FALSE	18
TRUE	14

1.9.3 Column-wise summary

1.9.3.1 Apply one function to multiple columns:

Tidyverse
data.table

mtcars |> summarize(across(everything(), mean), .by = cyl)

data.frame [3 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
6	19.743	183.314	122.286	3.586	3.117	17.977	0.571	0.429	3.857	3.429
4	26.664	105.136	82.636	4.071	2.286	19.137	0.909	0.727	4.091	1.545
8	15.1	353.1	209.214	3.229	3.999	16.772	0	0.143	3.286	3.5

By column type:

mtcars |> summarize(across(where(is.double), mean), .by = cyl)

data.frame [3 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
6	19.743	183.314	122.286	3.586	3.117	17.977	0.571	0.429	3.857	3.429
4	26.664	105.136	82.636	4.071	2.286	19.137	0.909	0.727	4.091	1.545
8	15.1	353.1	209.214	3.229	3.999	16.772	0	0.143	3.286	3.5

By matching column names:

mtcars |> summarize(across(matches("^d"), mean), .by = cyl)

data.frame [3 x 3]

cyl	disp	drat
6	183.314	3.586
4	105.136	4.071
8	353.1	3.229

MT[, lapply(.SD, mean), by = cyl]

data.table [3 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
6	19.743	183.314	122.286	3.586	3.117	17.977	0.571	0.429	3.857	3.429
4	26.664	105.136	82.636	4.071	2.286	19.137	0.909	0.727	4.091	1.545
8	15.1	353.1	209.214	3.229	3.999	16.772	0	0.143	3.286	3.5

By column type:

MT[, lapply(.SD[, -"cyl"], mean), by = cyl, .SDcols = is.double]

data.table [3 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
6	19.743	183.314	122.286	3.586	3.117	17.977	0.571	0.429	3.857	3.429
4	26.664	105.136	82.636	4.071	2.286	19.137	0.909	0.727	4.091	1.545
8	15.1	353.1	209.214	3.229	3.999	16.772	0	0.143	3.286	3.5

By matching column names:

MT[, lapply(.SD, mean), by = cyl, .SDcols = patterns("^d")]

data.table [3 x 3]

cyl	disp	drat
6	183.314	3.586
4	105.136	4.071
8	353.1	3.229

1.9.3.2 Applying multiple functions to one column:

Tidyverse
data.table

mtcars |> summarize(mean(mpg), sd(mpg), .by = cyl)

data.frame [3 x 3]

cyl	mean(mpg)	sd(mpg)
6	19.743	1.454
4	26.664	4.51
8	15.1	2.56

With column names:

mtcars |> summarize(mean = mean(mpg), sd = sd(mpg), .by = cyl)

data.frame [3 x 3]

cyl	mean	sd
6	19.743	1.454
4	26.664	4.51
8	15.1	2.56

mtcars |> summarize(across(mpg, list(mean = mean, sd = sd), .names = "{fn}"), .by = cyl)

data.frame [3 x 3]

cyl	mean	sd
6	19.743	1.454
4	26.664	4.51
8	15.1	2.56

MT[, .(mean(mpg), sd(mpg)), by = cyl]

data.table [3 x 3]

cyl	V1	V2
6	19.743	1.454
4	26.664	4.51
8	15.1	2.56

MT[, lapply(.(mpg), \(x) list(mean(x), sd(x))) |> rbindlist(), by = cyl]

data.table [3 x 3]

cyl	V1	V2
6	19.743	1.454
4	26.664	4.51
8	15.1	2.56

With column names:

MT[, .(mean = mean(mpg), sd = sd(mpg)), by = cyl]

data.table [3 x 3]

cyl	mean	sd
6	19.743	1.454
4	26.664	4.51
8	15.1	2.56

MT[, lapply(.SD, \(x) list(mean = mean(x), sd = sd(x))) |> rbindlist(), by = cyl, .SDcols = "mpg"]

data.table [3 x 3]

cyl	mean	sd
6	19.743	1.454
4	26.664	4.51
8	15.1	2.56

1.9.3.3 Apply multiple functions to multiple columns:

Note

Depending on the output we want (i.e. having the function’s output as columns or rows), we can either provide a list of functions to apply (list_of_fns), or a function returning a list (fn_returning_list).

cols <- c("mpg", "hp")

list_of_fns <- list(mean = \(x) mean(x), sd = \(x) sd(x))

fn_returning_list <- \(x) list(mean = mean(x), sd = sd(x))

Tidyverse
data.table

One column per function, one row per variable:

reframe(mtcars, map_dfr(pick(all_of(cols)), fn_returning_list, .id = "Var"), .by = cyl)

data.frame [6 x 4]

cyl	Var	mean	sd
6	mpg	19.743	1.454
6	hp	122.286	24.26
4	mpg	26.664	4.51
4	hp	82.636	20.935
8	mpg	15.1	2.56
8	hp	209.214	50.977

Alternatives

reframe(mtcars, map(pick(all_of(cols)), fn_returning_list) |> bind_rows(.id = "Var"), .by = cyl)

One column per variable, one row per function:

reframe(mtcars, map_dfr(list_of_fns, \(f) map(pick(all_of(cols)), f), .id = "Fn"), .by = cyl)

data.frame [6 x 4]

cyl	Fn	mpg	hp
6	mean	19.743	122.286
6	sd	1.454	24.26
4	mean	26.664	82.636
4	sd	4.51	20.935
8	mean	15.1	209.214
8	sd	2.56	50.977

One column per function/variable combination:

summarize(mtcars, across(all_of(cols), list_of_fns, .names = "{col}.{fn}"), .by = cyl)

data.frame [3 x 5]

cyl	mpg.mean	mpg.sd	hp.mean	hp.sd
6	19.743	1.454	122.286	24.26
4	26.664	4.51	82.636	20.935
8	15.1	2.56	209.214	50.977

One column per function, one row per variable:

MT[, lapply(.SD, fn_returning_list) |> rbindlist(idcol = "Var"), by = cyl, .SDcols = cols]

data.table [6 x 4]

cyl	Var	mean	sd
6	mpg	19.743	1.454
6	hp	122.286	24.26
4	mpg	26.664	4.51
4	hp	82.636	20.935
8	mpg	15.1	2.56
8	hp	209.214	50.977

One column per variable, one row per function:

MT[, lapply(list_of_fns, \(f) lapply(.SD, f)) |> rbindlist(idcol = "Fn"), by = cyl, .SDcols = cols]

data.table [6 x 4]

cyl	Fn	mpg	hp
6	mean	19.743	122.286
6	sd	1.454	24.26
4	mean	26.664	82.636
4	sd	4.51	20.935
8	mean	15.1	209.214
8	sd	2.56	50.977

One column per function/variable combination:

MT[, lapply(.SD, fn_returning_list) |> unlist(recursive = FALSE), by = cyl, .SDcols = cols]

data.table [3 x 5]

cyl	mpg.mean	mpg.sd	hp.mean	hp.sd
6	19.743	1.454	122.286	24.26
4	26.664	4.51	82.636	20.935
8	15.1	2.56	209.214	50.977

data.table [3 x 5]

cyl	mpg.mean	mpg.sd	hp.mean	hp.sd
6	19.743	1.454	122.286	24.26
4	26.664	4.51	82.636	20.935
8	15.1	2.56	209.214	50.977

Different column order & naming scheme:

MT[, 
  lapply(list_of_fns, \(f) lapply(.SD, f)) |> 
    unlist(recursive = FALSE),
  by = cyl, .SDcols = cols
]

data.table [3 x 5]

cyl	mean.mpg	mean.hp	sd.mpg	sd.hp
6	19.743	122.286	1.454	24.26
4	26.664	82.636	4.51	20.935
8	15.1	209.214	2.56	50.977

Using dcast (see next section for more on pivots):

dcast(MT, cyl ~ ., fun.agg = list_of_fns, value.var = cols) # list(mean, sd)

data.table [3 x 5]

cyl	mpg_mean	hp_mean	mpg_sd	hp_sd
4	26.664	82.636	4.51	20.935
6	19.743	122.286	1.454	24.26
8	15.1	209.214	2.56	50.977

2 Pivots

2.1 Melt / Longer

Data:

(fam1 <- as.data.frame(FAM1))

data.frame [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

(fam2 <- as.data.frame(FAM2))

data.frame [5 x 8]

family_id	age_mother	dob_child1	dob_child2	dob_child3	gender_child1	gender_child2	gender_child3
1	30	1998-11-26	2000-01-29	NA	1	2	NA
2	27	1996-06-22	NA	NA	2	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02	2	2	1
4	32	2004-10-10	2009-08-27	2012-07-21	1	1	1
5	29	2000-12-05	2005-02-28	NA	2	1	NA

2.1.1 Basic Melt/Longer

Tip

data.table::melt does partial argument matching and thus accepts shortened versions of its arguments. E.g.: variable.name <=> variable (or var), value.name <=> value (or val), measure.vars <=> measure, id.vars <=> id, pattern <=> pat, …

One group of columns –> single value column

Tidyverse
data.table

pivot_longer(FAM1, cols = matches("dob_"), names_to = "variable")

data.frame [15 x 4]

family_id	age_mother	variable	value
1	30	dob_child1	1998-11-26
1	30	dob_child2	2000-01-29
1	30	dob_child3	NA
2	27	dob_child1	1996-06-22
2	27	dob_child2	NA
2	27	dob_child3	NA
3	26	dob_child1	2002-07-11
3	26	dob_child2	2004-04-05
3	26	dob_child3	2007-09-02
4	32	dob_child1	2004-10-10
4	32	dob_child2	2009-08-27
4	32	dob_child3	2012-07-21
5	29	dob_child1	2000-12-05
5	29	dob_child2	2005-02-28
5	29	dob_child3	NA

melt(FAM1, measure.vars = c("dob_child1", "dob_child2", "dob_child3"))

data.table [15 x 4]

family_id	age_mother	variable	value
1	30	dob_child1	1998-11-26
2	27	dob_child1	1996-06-22
3	26	dob_child1	2002-07-11
4	32	dob_child1	2004-10-10
5	29	dob_child1	2000-12-05
1	30	dob_child2	2000-01-29
2	27	dob_child2	NA
3	26	dob_child2	2004-04-05
4	32	dob_child2	2009-08-27
5	29	dob_child2	2005-02-28
1	30	dob_child3	NA
2	27	dob_child3	NA
3	26	dob_child3	2007-09-02
4	32	dob_child3	2012-07-21
5	29	dob_child3	NA

melt(FAM1, measure = patterns("^dob_"))

data.table [15 x 4]

family_id	age_mother	variable	value
1	30	dob_child1	1998-11-26
2	27	dob_child1	1996-06-22
3	26	dob_child1	2002-07-11
4	32	dob_child1	2004-10-10
5	29	dob_child1	2000-12-05
1	30	dob_child2	2000-01-29
2	27	dob_child2	NA
3	26	dob_child2	2004-04-05
4	32	dob_child2	2009-08-27
5	29	dob_child2	2005-02-28
1	30	dob_child3	NA
2	27	dob_child3	NA
3	26	dob_child3	2007-09-02
4	32	dob_child3	2012-07-21
5	29	dob_child3	NA

One group of columns –> multiple value columns

Tidyverse
data.table

# No direct equivalent

melt(FAM1, measure = patterns(child1 = "child1$", child2 = "child2$|child3$"))

data.table [10 x 5]

family_id	age_mother	variable	child1	child2
1	30	1	1998-11-26	2000-01-29
2	27	1	1996-06-22	NA
3	26	1	2002-07-11	2004-04-05
4	32	1	2004-10-10	2009-08-27
5	29	1	2000-12-05	2005-02-28
1	30	2	NA	NA
2	27	2	NA	NA
3	26	2	NA	2007-09-02
4	32	2	NA	2012-07-21
5	29	2	NA	NA

2.1.2 Merging multiple yes/no columns

Melting multiple presence/absence columns into a single variable:

Data:

(MOVIES_WIDE <- as.data.table(movies_wide))

data.table [3 x 4]

ID	action	adventure	animation
1	1	0	0
2	1	1	0
3	1	1	1

Tidyverse
data.table

pivot_longer(
    movies_wide, -ID, names_to = "Genre", 
    values_transform = \(x) ifelse(x == 0, NA, x), values_drop_na = TRUE
  ) |> select(-value)

data.frame [6 x 2]

ID	Genre
1	action
2	action
2	adventure
3	action
3	adventure
3	animation

melt(MOVIES_WIDE, id.vars = "ID", var = "Genre")[value != 0][order(ID), -"value"]

data.table [6 x 2]

ID	Genre
1	action
2	action
2	adventure
3	action
3	adventure
3	animation

2.1.3 Partial pivot

Multiple groups of columns –> Multiple value columns

Tidyverse
data.table

Using .value:

Tip

Using the .value special identifier allows to do a “half” pivot: the values that would be listed as rows under .value are instead used as columns.

pivot_longer(fam2, matches("^dob|^gender"), names_to = c(".value", "child"), names_sep = "_child")

data.frame [15 x 5]

family_id	age_mother	child	dob	gender
1	30	1	1998-11-26	1
1	30	2	2000-01-29	2
1	30	3	NA	NA
2	27	1	1996-06-22	2
2	27	2	NA	NA
2	27	3	NA	NA
3	26	1	2002-07-11	2
3	26	2	2004-04-05	2
3	26	3	2007-09-02	1
4	32	1	2004-10-10	1
4	32	2	2009-08-27	1
4	32	3	2012-07-21	1
5	29	1	2000-12-05	2
5	29	2	2005-02-28	1
5	29	3	NA	NA

Using .value:

melt(FAM2, measure = patterns("^dob", "^gender"), val = c("dob", "gender"), var = "child")

data.table [15 x 5]

family_id	age_mother	child	dob	gender
1	30	1	1998-11-26	1
2	27	1	1996-06-22	2
3	26	1	2002-07-11	2
4	32	1	2004-10-10	1
5	29	1	2000-12-05	2
1	30	2	2000-01-29	2
2	27	2	NA	NA
3	26	2	2004-04-05	2
4	32	2	2009-08-27	1
5	29	2	2005-02-28	1
1	30	3	NA	NA
2	27	3	NA	NA
3	26	3	2007-09-02	1
4	32	3	2012-07-21	1
5	29	3	NA	NA

Manually:

colA <- str_subset(colnames(FAM2), "^dob")
colB <- str_subset(colnames(FAM2), "^gender")

melt(FAM2, measure = list(colA, colB), val = c("dob", "gender"), var = "child")

data.table [15 x 5]

family_id	age_mother	child	dob	gender
1	30	1	1998-11-26	1
2	27	1	1996-06-22	2
3	26	1	2002-07-11	2
4	32	1	2004-10-10	1
5	29	1	2000-12-05	2
1	30	2	2000-01-29	2
2	27	2	NA	NA
3	26	2	2004-04-05	2
4	32	2	2009-08-27	1
5	29	2	2005-02-28	1
1	30	3	NA	NA
2	27	3	NA	NA
3	26	3	2007-09-02	1
4	32	3	2012-07-21	1
5	29	3	NA	NA

Alternatives

melt(FAM2, measure = list(a, b), val = c("dob", "gender"), var = "child") |> 
  substitute2(env = list(a = I(str_subset(colnames(FAM2), "^dob")), b = I(str_subset(colnames(FAM2), "^gender")))) |> eval()

Using measure and value.name:

melt(FAM2, measure = measure(value.name, child = \(x) as.integer(x), sep = "_child"))

data.table [15 x 5]

family_id	age_mother	child	dob	gender
1	30	1	1998-11-26	1
2	27	1	1996-06-22	2
3	26	1	2002-07-11	2
4	32	1	2004-10-10	1
5	29	1	2000-12-05	2
1	30	2	2000-01-29	2
2	27	2	NA	NA
3	26	2	2004-04-05	2
4	32	2	2009-08-27	1
5	29	2	2005-02-28	1
1	30	3	NA	NA
2	27	3	NA	NA
3	26	3	2007-09-02	1
4	32	3	2012-07-21	1
5	29	3	NA	NA

Alternatives

melt(FAM2, measure = measurev(list(value.name = NULL, child = as.integer), pat = "(.*)_child(\\d)"))

2.2 Dcast / Wider

General idea:
- Pivot around the combination of id.vars (LHS of the formula)
- The measure.vars (RHS of the formula) are the ones whose values become column names
- The value.var are the ones the values are taken from to fill the new columns

Data:

(fam1l <- as.data.frame(FAM1L))

data.frame [15 x 4]

family_id	age_mother	variable	value
1	30	dob_child1	1998-11-26
2	27	dob_child1	1996-06-22
3	26	dob_child1	2002-07-11
4	32	dob_child1	2004-10-10
5	29	dob_child1	2000-12-05
1	30	dob_child2	2000-01-29
2	27	dob_child2	NA
3	26	dob_child2	2004-04-05
4	32	dob_child2	2009-08-27
5	29	dob_child2	2005-02-28
1	30	dob_child3	NA
2	27	dob_child3	NA
3	26	dob_child3	2007-09-02
4	32	dob_child3	2012-07-21
5	29	dob_child3	NA

(fam2l <- as.data.frame(FAM2L))

data.frame [15 x 5]

family_id	age_mother	child	dob	gender
1	30	1	1998-11-26	1
2	27	1	1996-06-22	2
3	26	1	2002-07-11	2
4	32	1	2004-10-10	1
5	29	1	2000-12-05	2
1	30	2	2000-01-29	2
2	27	2	NA	NA
3	26	2	2004-04-05	2
4	32	2	2009-08-27	1
5	29	2	2005-02-28	1
1	30	3	NA	NA
2	27	3	NA	NA
3	26	3	2007-09-02	1
4	32	3	2012-07-21	1
5	29	3	NA	NA

2.2.1 Basic Dcast/Wider

Tidyverse
data.table

pivot_wider(fam1l, id_cols = c("family_id", "age_mother"), names_from = "variable")

data.frame [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

dcast(FAM1L, family_id + age_mother ~ variable)

data.table [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

Using all the columns as IDs:

Tidyverse
data.table

pivot_wider(fam1l, names_from = variable)

data.frame [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

Note

By default, id_cols = everything()

FAM1L |> dcast(... ~ variable)

data.table [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

Note

... <=> “every unused column”

Multiple value columns –> Multiple groups of columns:

Tidyverse
data.table

pivot_wider(
  fam2l, id_cols = c("family_id", "age_mother"), values_from = c("dob", "gender"), 
  names_from = "child", names_sep = "_child"
)

data.frame [5 x 8]

family_id	age_mother	dob_child1	dob_child2	dob_child3	gender_child1	gender_child2	gender_child3
1	30	1998-11-26	2000-01-29	NA	1	2	NA
2	27	1996-06-22	NA	NA	2	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02	2	2	1
4	32	2004-10-10	2009-08-27	2012-07-21	1	1	1
5	29	2000-12-05	2005-02-28	NA	2	1	NA

dcast(FAM2L, family_id + age_mother ~ child, value.var = c("dob", "gender"), sep = "_child")

data.table [5 x 8]

family_id	age_mother	dob_child1	dob_child2	dob_child3	gender_child1	gender_child2	gender_child3
1	30	1998-11-26	2000-01-29	NA	1	2	NA
2	27	1996-06-22	NA	NA	2	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02	2	2	1
4	32	2004-10-10	2009-08-27	2012-07-21	1	1	1
5	29	2000-12-05	2005-02-28	NA	2	1	NA

dcast(FAM2L, ... ~ child, value.var = c("dob", "gender"), sep = "_child")

data.table [5 x 8]

family_id	age_mother	dob_child1	dob_child2	dob_child3	gender_child1	gender_child2	gender_child3
1	30	1998-11-26	2000-01-29	NA	1	2	NA
2	27	1996-06-22	NA	NA	2	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02	2	2	1
4	32	2004-10-10	2009-08-27	2012-07-21	1	1	1
5	29	2000-12-05	2005-02-28	NA	2	1	NA

Dynamic names in the formula:

var_name <- "variable"

id_vars <- c("family_id", "age_mother")

Tidyverse
data.table

pivot_wider(fam1l, id_cols = c(family_id, age_mother), names_from = {{ var_name }})

data.frame [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

Multiple dynamic names:

pivot_wider(fam1l, id_cols = all_of(id_vars), names_from = variable)

data.frame [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

dcast(FAM1L, family_id + age_mother ~ base::get(var_name))

data.table [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

dcast(FAM1L, family_id + age_mother ~ x) |> substitute2(env = list(x = var_name)) |> eval()

data.table [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

Multiple dynamic names:

dcast(FAM1L, str_c(str_c(id_vars, collapse = " + "), " ~ variable"))

data.table [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

dcast(FAM1L, x + y ~ variable) |> substitute2(env = list(x = id_vars[1], y = id_vars[2])) |> eval()

data.table [5 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

2.2.2 Renaming (prefix/suffix) the columns

Tidyverse
data.table

pivot_wider(fam1l, names_from = variable, values_from = value, names_prefix = "Attr: ")

data.frame [5 x 5]

family_id	age_mother	Attr: dob_child1	Attr: dob_child2	Attr: dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

pivot_wider(fam1l, names_from = variable, values_from = value, names_glue = "Attr: {variable}")

data.frame [5 x 5]

family_id	age_mother	Attr: dob_child1	Attr: dob_child2	Attr: dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

dcast(FAM1L, family_id + age_mother ~ paste0("Attr: ", variable))

data.table [5 x 5]

family_id	age_mother	Attr: dob_child1	Attr: dob_child2	Attr: dob_child3
1	30	1998-11-26	2000-01-29	NA
2	27	1996-06-22	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	2000-12-05	2005-02-28	NA

2.2.3 Unused combinations

Warning

The logic is inverted between dplyr (keep) and data.table (drop):

Tidyverse
data.table

pivot_wider(fam1l, names_from = variable, values_from = value, id_expand = T, names_expand = F)

data.frame [25 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	26	NA	NA	NA
1	27	NA	NA	NA
1	29	NA	NA	NA
1	30	1998-11-26	2000-01-29	NA
1	32	NA	NA	NA
2	26	NA	NA	NA
2	27	1996-06-22	NA	NA
2	29	NA	NA	NA
2	30	NA	NA	NA
2	32	NA	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
3	27	NA	NA	NA
3	29	NA	NA	NA
3	30	NA	NA	NA
3	32	NA	NA	NA
[ omitted 10 entries ]

dcast(FAM1L, family_id + age_mother ~ variable, drop = c(FALSE, TRUE)) # (drop_LHS, drop_RHS)

data.table [25 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
1	26	NA	NA	NA
1	27	NA	NA	NA
1	29	NA	NA	NA
1	30	1998-11-26	2000-01-29	NA
1	32	NA	NA	NA
2	26	NA	NA	NA
2	27	1996-06-22	NA	NA
2	29	NA	NA	NA
2	30	NA	NA	NA
2	32	NA	NA	NA
3	26	2002-07-11	2004-04-05	2007-09-02
3	27	NA	NA	NA
3	29	NA	NA	NA
3	30	NA	NA	NA
3	32	NA	NA	NA
[ omitted 10 entries ]

2.2.4 Subsetting

Tidyverse
data.table

fam1l |> filter(value >= lubridate::ymd(20030101)) |> 
  pivot_wider(id_cols = c("family_id", "age_mother"), names_from = "variable")

data.frame [3 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
4	32	2004-10-10	2009-08-27	2012-07-21
3	26	NA	2004-04-05	2007-09-02
5	29	NA	2005-02-28	NA

Warning

AFAIK, pivot_wider can’t do this on its own.

dcast(FAM1L, family_id + age_mother ~ variable, subset = .(value >= lubridate::ymd(20030101)))

data.table [3 x 5]

family_id	age_mother	dob_child1	dob_child2	dob_child3
3	26	NA	2004-04-05	2007-09-02
4	32	2004-10-10	2009-08-27	2012-07-21
5	29	NA	2005-02-28	NA

2.2.5 Aggregating

In data.table, not specifying the column holding the measure vars (the names) will result in an empty column counting the number of columns that should have been created for all the measures (i.e. the length() of the result).

Tidyverse
data.table

(pivot_wider(fam1l, id_cols = c(family_id, age_mother), names_from = variable, values_fn = length)
  |> mutate(length = apply(pick(matches("_child")), 1, \(x) sum(x))) 
  |> select(-matches("^dob_"))
)

data.frame [5 x 3]

family_id	age_mother	length
1	30	3
2	27	3
3	26	3
4	32	3
5	29	3

dcast(FAM1L, family_id + age_mother ~ .)

data.table [5 x 3]

family_id	age_mother	.
1	30	3
2	27	3
3	26	3
4	32	3
5	29	3

Customizing the default behavior (length()) using the fun.aggregate (<=> fun.agg or fun) argument:

Here, we count the number of child for each each combination of (family_id + age_mother) -> sum all non-NA value

Tidyverse
data.table

(pivot_wider(
    fam1l, id_cols = c(family_id, age_mother), names_from = variable, values_fn = \(x) !is.na(x)
  ) 
  |> mutate(child_count = apply(pick(matches("_child")), 1, \(x) sum(x)))
  |> select(-matches("^dob_"))
)

data.frame [5 x 3]

family_id	age_mother	child_count
1	30	2
2	27	1
3	26	3
4	32	3
5	29	2

Alternatives

(pivot_wider(fam1l, id_cols = c(family_id, age_mother), names_from = variable, values_fn = \(x) !is.na(x))
  |> mutate(child_count = pmap_int(pick(matches("_child")), \(...) sum(...)))
  |> select(-matches("^dob_"))
)

(pivot_wider(fam1l, id_cols = c(family_id, age_mother), names_from = variable, values_fn = \(x) !is.na(x))
  |> rowwise()
  |> mutate(child_count = sum(c_across(matches("_child"))))
  |> ungroup()
  |> select(-matches("^dob_"))
)

(dcast(FAM1L, family_id + age_mother ~ ., fun = \(x) sum(!is.na(x))) |> setnames(".", "child_count"))

data.table [5 x 3]

family_id	age_mother	child_count
1	30	2
2	27	1
3	26	3
4	32	3
5	29	2

Applying multiple fun.agg:

Data:

(DTL <- data.table(
    id1 = sample(5, 20, TRUE), 
    id2 = sample(2, 20, TRUE), 
    group = sample(letters[1:2], 20, TRUE), 
    v1 = runif(20), 
    v2 = 1L
  )
)

data.table [20 x 5]

id1	id2	group	v1	v2
2	1	a	0.002	1
3	2	b	0.432	1
3	2	a	0.434	1
5	2	a	0.621	1
5	2	a	0.66	1
5	1	b	0.868	1
5	1	b	0.113	1
3	1	b	0.638	1
2	2	b	0.055	1
3	2	b	0.277	1
3	2	a	0.539	1
1	2	a	0.629	1
5	2	b	0.567	1
2	1	a	0.146	1
5	1	a	0.855	1
[ omitted 5 entries ]

Tidyverse
data.table

Multiple aggregation functions applied to one variable:

(pivot_wider(
    DTL, id_cols = c("id1", "id2"), names_from = "group", values_from = "v1",
    names_glue = "{.value}_{.name}", names_vary = "slowest", names_sort = TRUE,
    values_fn = \(x) tibble("sum" = sum(x), "mean" = mean(x))
  ) 
  |> unnest(cols = starts_with("v1"), names_sep = "_")
)

data.frame [9 x 6]

id1	id2	v1_a_sum	v1_a_mean	v1_b_sum	v1_b_mean
2	1	0.148	0.074	NA	NA
3	2	0.973	0.486	0.709	0.354
5	2	1.28	0.64	1.159	0.58
5	1	1.536	0.512	0.981	0.491
3	1	NA	NA	0.638	0.638
2	2	NA	NA	0.055	0.055
1	2	0.629	0.629	NA	NA
4	1	0.793	0.793	NA	NA
4	2	NA	NA	0.361	0.361

Multiple aggregation functions applied to multiple variables (all combinations):

(DTL |> pivot_wider(
    id_cols = c("id1", "id2"), names_from = "group", names_vary = "slowest", names_sort = TRUE,
    values_from = c("v1", "v2"), values_fn = \(x) tibble("sum" = sum(x), "mean" = mean(x))
  ) 
  |> unnest(cols = matches("^v1|^v2"), names_sep = "_")
)

data.frame [9 x 10]

id1	id2	v1_a_sum	v1_a_mean	v2_a_sum	v2_a_mean	v1_b_sum	v1_b_mean	v2_b_sum	v2_b_mean
2	1	0.148	0.074	2	1	NA	NA	NA	NA
3	2	0.973	0.486	2	1	0.709	0.354	2	1
5	2	1.28	0.64	2	1	1.159	0.58	2	1
5	1	1.536	0.512	3	1	0.981	0.491	2	1
3	1	NA	NA	NA	NA	0.638	0.638	1	1
2	2	NA	NA	NA	NA	0.055	0.055	1	1
1	2	0.629	0.629	1	1	NA	NA	NA	NA
4	1	0.793	0.793	1	1	NA	NA	NA	NA
4	2	NA	NA	NA	NA	0.361	0.361	1	1

Multiple aggregation functions applied to multiple variables (one-to-one):

# Not possible with pivot_wider AFAIK

Multiple aggregation functions applied to one variable:

dcast(DTL, id1 + id2 ~ group, fun = list(sum, mean), value.var = "v1")

data.table [9 x 6]

id1	id2	v1_sum_a	v1_sum_b	v1_mean_a	v1_mean_b
1	2	0.629	0	0.629	NaN
2	1	0.148	0	0.074	NaN
2	2	0	0.055	NaN	0.055
3	1	0	0.638	NaN	0.638
3	2	0.973	0.709	0.486	0.354
4	1	0.793	0	0.793	NaN
4	2	0	0.361	NaN	0.361
5	1	1.536	0.981	0.512	0.491
5	2	1.28	1.159	0.64	0.58

Multiple aggregation functions applied to multiple variables (all combinations):

dcast(DTL, id1 + id2 ~ group, fun = list(sum, mean), value.var = c("v1", "v2"))

data.table [9 x 10]

id1	id2	v1_sum_a	v1_sum_b	v2_sum_a	v2_sum_b	v1_mean_a	v1_mean_b	v2_mean_a	v2_mean_b
1	2	0.629	0	1	0	0.629	NaN	1	NaN
2	1	0.148	0	2	0	0.074	NaN	1	NaN
2	2	0	0.055	0	1	NaN	0.055	NaN	1
3	1	0	0.638	0	1	NaN	0.638	NaN	1
3	2	0.973	0.709	2	2	0.486	0.354	1	1
4	1	0.793	0	1	0	0.793	NaN	1	NaN
4	2	0	0.361	0	1	NaN	0.361	NaN	1
5	1	1.536	0.981	3	2	0.512	0.491	1	1
5	2	1.28	1.159	2	2	0.64	0.58	1	1

Multiple aggregation functions applied to multiple variables (one-to-one):

Here, we apply sum to v1 (for both group a & b), and mean to v2 (for both group a & b)

dcast(DTL, id1 + id2 ~ group, fun = list(sum, mean), value.var = list("v1", "v2"))

data.table [9 x 6]

id1	id2	v1_sum_a	v1_sum_b	v2_mean_a	v2_mean_b
1	2	0.629	0	1	NaN
2	1	0.148	0	1	NaN
2	2	0	0.055	NaN	1
3	1	0	0.638	NaN	1
3	2	0.973	0.709	1	1
4	1	0.793	0	1	NaN
4	2	0	0.361	NaN	1
5	1	1.536	0.981	1	1
5	2	1.28	1.159	1	1

2.2.6 One-hot encoding

Making each level of a variable into a presence/absence column:

movies_long

data.frame [6 x 3]

ID	Genre	OtherCol
1	action	0.768
2	action	0.145
2	adventure	0.749
3	action	0.975
3	adventure	0.381
3	animation	0.09

Tidyverse
data.table

pivot_wider(
  movies_long, names_from = "Genre", values_from = "Genre", 
  values_fn = \(x) !is.na(x), values_fill = FALSE
)

data.frame [6 x 5]

ID	OtherCol	action	adventure	animation
1	0.768	TRUE	FALSE	FALSE
2	0.145	TRUE	FALSE	FALSE
2	0.749	FALSE	TRUE	FALSE
3	0.975	TRUE	FALSE	FALSE
3	0.381	FALSE	TRUE	FALSE
3	0.09	FALSE	FALSE	TRUE

dcast(MOVIES_LONG, ... ~ Genre, value.var = "Genre", fun = \(x) !is.na(x), fill = FALSE)

data.table [6 x 5]

ID	OtherCol	action	adventure	animation
1	0.768	TRUE	FALSE	FALSE
2	0.145	TRUE	FALSE	FALSE
2	0.749	FALSE	TRUE	FALSE
3	0.09	FALSE	FALSE	TRUE
3	0.381	FALSE	TRUE	FALSE
3	0.975	TRUE	FALSE	FALSE

3 Joins

3.1 Mutating Joins

The purpose of mutating joins is to add columns/information from one table to another, by matching their rows.

Data:

(CITIES <- as.data.table(cities))

data.table [10 x 3]

city_id	city	country_id
1	Barcelona	9
2	Bergen	8
3	Bern	10
4	Helsinki	4
5	Linz	1
6	Punaauia	6
7	Queenstown	7
8	Rouen	5
9	Sosua	3
10	Trondheim	8

(COUNTRIES <- as.data.table(countries))

data.table [9 x 2]

country_id	country
1	Austria
2	Canada
3	Dominican Republic
4	Finland
5	France
6	French Polynesia
7	New-Zealand
8	Norway
9	Spain

3.1.1 Left/Right Join

Both left & right joins append the columns of one table to those of another, in the order they are given (i.e. columns of the first table will appear first in the result). However, how rows are matched (and how the ones not finding a match are handled) depends on the type of join:
- Left joins match on the rows of the first (left) table. Unmatched rows from the left table will be kept, but not the right’s.
- Right joins match on the rows of the second (right) table. Unmatched rows from the right table will be kept, but not the left’s.

Example

To find out which country each city belongs to, we’re going to merge countries into cities.

Here, we want to add data to the cities table by matching each city to a country (by their country_id). The ideal output would have the columns of cities first, and keep all rows from cities, even if unmatched: thus we will use a left join.

As a left join:

Tidyverse
data.table

left_join(cities, countries, by = "country_id", multiple = "all")

data.frame [10 x 4]

city_id	city	country_id	country
1	Barcelona	9	Spain
2	Bergen	8	Norway
3	Bern	10	NA
4	Helsinki	4	Finland
5	Linz	1	Austria
6	Punaauia	6	French Polynesia
7	Queenstown	7	New-Zealand
8	Rouen	5	France
9	Sosua	3	Dominican Republic
10	Trondheim	8	Norway

data.table natively only supports right joins

It filters the rows of the first table by those of the second (FIRST[SECOND]), but only keeps the unmatched rows from the second table.

The normal output of the join

CITIES[COUNTRIES, on = .(country_id)]

data.table [10 x 4]

city_id	city	country_id	country
5	Linz	1	Austria
NA	NA	2	Canada
9	Sosua	3	Dominican Republic
4	Helsinki	4	Finland
8	Rouen	5	France
6	Punaauia	6	French Polynesia
7	Queenstown	7	New-Zealand
2	Bergen	8	Norway
10	Trondheim	8	Norway
1	Barcelona	9	Spain

The unmatched rows from countries were kept, but not the ones from cities. Here are two possible workarounds:

Inverting the two tables (countries first), and then inverting the order of the columns in the result:

COUNTRIES[CITIES, .(city_id, city, country_id, country), on = .(country_id)]

data.table [10 x 4]

city_id	city	country_id	country
1	Barcelona	9	Spain
2	Bergen	8	Norway
3	Bern	10	NA
4	Helsinki	4	Finland
5	Linz	1	Austria
6	Punaauia	6	French Polynesia
7	Queenstown	7	New-Zealand
8	Rouen	5	France
9	Sosua	3	Dominican Republic
10	Trondheim	8	Norway

Adding the columns of countries (in-place) to cities during the join:

copy(CITIES)[COUNTRIES, c("country_id", "country") := list(i.country_id, i.country), on = .(country_id)][]

data.table [10 x 4]

city_id	city	country_id	country
1	Barcelona	9	Spain
2	Bergen	8	Norway
3	Bern	10	NA
4	Helsinki	4	Finland
5	Linz	1	Austria
6	Punaauia	6	French Polynesia
7	Queenstown	7	New-Zealand
8	Rouen	5	France
9	Sosua	3	Dominican Republic
10	Trondheim	8	Norway

We could accomplish a similar result with a right join by inverting the order of appearance of the columns. But the order of the columns in the result will be less ideal (countries first):

As a right join:

Tidyverse
data.table

right_join(countries, cities, by = "country_id", multiple = "all")

data.frame [10 x 4]

country_id	country	city_id	city
1	Austria	5	Linz
3	Dominican Republic	9	Sosua
4	Finland	4	Helsinki
5	France	8	Rouen
6	French Polynesia	6	Punaauia
7	New-Zealand	7	Queenstown
8	Norway	2	Bergen
8	Norway	10	Trondheim
9	Spain	1	Barcelona
10	NA	3	Bern

COUNTRIES[CITIES, on = .(country_id)][order(country_id)]

data.table [10 x 4]

country_id	country	city_id	city
1	Austria	5	Linz
3	Dominican Republic	9	Sosua
4	Finland	4	Helsinki
5	France	8	Rouen
6	French Polynesia	6	Punaauia
7	New-Zealand	7	Queenstown
8	Norway	2	Bergen
8	Norway	10	Trondheim
9	Spain	1	Barcelona
10	NA	3	Bern

3.1.2 Full Join

Fully merges the two tables, keeping the unmatched rows from both tables.

Tidyverse
data.table

full_join(cities, countries, by = join_by(country_id))

data.frame [11 x 4]

city_id	city	country_id	country
1	Barcelona	9	Spain
2	Bergen	8	Norway
3	Bern	10	NA
4	Helsinki	4	Finland
5	Linz	1	Austria
6	Punaauia	6	French Polynesia
7	Queenstown	7	New-Zealand
8	Rouen	5	France
9	Sosua	3	Dominican Republic
10	Trondheim	8	Norway
NA	NA	2	Canada

merge(CITIES, COUNTRIES, by = "country_id", all = TRUE)[order(city_id), .(city_id, city, country_id, country)]

data.table [11 x 4]

city_id	city	country_id	country
1	Barcelona	9	Spain
2	Bergen	8	Norway
3	Bern	10	NA
4	Helsinki	4	Finland
5	Linz	1	Austria
6	Punaauia	6	French Polynesia
7	Queenstown	7	New-Zealand
8	Rouen	5	France
9	Sosua	3	Dominican Republic
10	Trondheim	8	Norway
NA	NA	2	Canada

3.1.3 Cross Join

Generating all combinations of the IDs of both tables.

Tidyverse
data.table

cross_join(select(cities, city), select(countries, country))

data.frame [90 x 2]

city	country
Barcelona	Austria
Barcelona	Canada
Barcelona	Dominican Republic
Barcelona	Finland
Barcelona	France
Barcelona	French Polynesia
Barcelona	New-Zealand
Barcelona	Norway
Barcelona	Spain
Bergen	Austria
Bergen	Canada
Bergen	Dominican Republic
Bergen	Finland
Bergen	France
Bergen	French Polynesia
[ omitted 75 entries ]

CJ(city = CITIES[, city], country = COUNTRIES[, country])

data.table [90 x 2]

city	country
Barcelona	Austria
Barcelona	Canada
Barcelona	Dominican Republic
Barcelona	Finland
Barcelona	France
Barcelona	French Polynesia
Barcelona	New-Zealand
Barcelona	Norway
Barcelona	Spain
Bergen	Austria
Bergen	Canada
Bergen	Dominican Republic
Bergen	Finland
Bergen	France
Bergen	French Polynesia
[ omitted 75 entries ]

3.1.4 Inner Join

Merges the columns of both tables and only returns the rows that matched between both tables (no unmatched rows are kept).

Tidyverse
data.table

inner_join(countries, cities, by = "country_id", multiple = "all")

data.frame [9 x 4]

country_id	country	city_id	city
1	Austria	5	Linz
3	Dominican Republic	9	Sosua
4	Finland	4	Helsinki
5	France	8	Rouen
6	French Polynesia	6	Punaauia
7	New-Zealand	7	Queenstown
8	Norway	2	Bergen
8	Norway	10	Trondheim
9	Spain	1	Barcelona

COUNTRIES[CITIES, on = .(country_id), nomatch = NULL]

data.table [9 x 4]

country_id	country	city_id	city
9	Spain	1	Barcelona
8	Norway	2	Bergen
4	Finland	4	Helsinki
1	Austria	5	Linz
6	French Polynesia	6	Punaauia
7	New-Zealand	7	Queenstown
5	France	8	Rouen
3	Dominican Republic	9	Sosua
8	Norway	10	Trondheim

3.1.5 Self join

Merging the table with itself. Typically used on graph-type data represented as a flat table (e.g. hierarchies).

Data:

data.frame [5 x 4]

id	first_name	last_name	manager_id
1	Maisy	Bloom	NA
2	Caine	Farrow	1
3	Waqar	Jarvis	2
4	Lacey-Mai	Rahman	2
5	Merryn	French	3

The goal here is to find the identity of everyone’s n+1 by merging the table on itself:

Tidyverse
data.table

left_join(hiera, hiera, by = join_by(manager_id == id))

data.frame [5 x 7]

id	first_name.x	last_name.x	manager_id	first_name.y	last_name.y	manager_id.y
1	Maisy	Bloom	NA	NA	NA	NA
2	Caine	Farrow	1	Maisy	Bloom	NA
3	Waqar	Jarvis	2	Caine	Farrow	1
4	Lacey-Mai	Rahman	2	Caine	Farrow	1
5	Merryn	French	3	Waqar	Jarvis	2

HIERA[HIERA, on = .(manager_id = id), nomatch = NULL]

data.table [4 x 7]

id	first_name	last_name	manager_id	i.first_name	i.last_name	i.manager_id
2	Caine	Farrow	1	Maisy	Bloom	NA
3	Waqar	Jarvis	2	Caine	Farrow	1
4	Lacey-Mai	Rahman	2	Caine	Farrow	1
5	Merryn	French	3	Waqar	Jarvis	2

3.2 Filtering Joins

Use to filter one table (left) based on another (right): it will only keep the columns from the left table and will either keep (semi join) or discard (anti join) the rows where IDs match between both tables.

3.2.1 Semi join

Note

Will give the same result as an inner join, but will only keep the columns of the first table (no information is added).

Here, it will filter countries to only keep the countries having a matching country_id in the cities table.

Tidyverse
data.table

semi_join(countries, cities, by = join_by(country_id))

data.frame [8 x 2]

country_id	country
1	Austria
3	Dominican Republic
4	Finland
5	France
6	French Polynesia
7	New-Zealand
8	Norway
9	Spain

COUNTRIES[country_id %in% CITIES[, unique(country_id)]]

data.table [8 x 2]

country_id	country
1	Austria
3	Dominican Republic
4	Finland
5	France
6	French Polynesia
7	New-Zealand
8	Norway
9	Spain

Alternatives

fsetdiff(COUNTRIES, COUNTRIES[!CITIES, on = "country_id"])

COUNTRIES[!eval(COUNTRIES[!CITIES, on = .(country_id)])]

3.2.2 Anti join

Here, it will filter countries to only keep the countries having no matching country_id in the cities table.

Tidyverse
data.table

anti_join(countries, cities, by = join_by(country_id))

data.frame [1 x 2]

country_id	country
2	Canada

COUNTRIES[!CITIES, on = .(country_id)]

data.table [1 x 2]

country_id	country
2	Canada

Alternatives

COUNTRIES[fsetdiff(COUNTRIES[, .(country_id)], CITIES[, .(country_id)])]

3.3 Non-equi joins

Non-equi joins are joins where the the condition to match rows are no longer strict equalities between the tables’ ID columns.

We can divide non-equi joins between:
- Unequality joins: a general unequality condition between IDs, that could result in multiple matches.
- Rolling joins: only keep the match that minimizes the distance between the IDs (i.e. the closest to perfect equality).
- Overlap joins: matching to all values within a range.

Tip

Please refer to this page of the second edition of R4DS for more detailed explanations.

Data:

Events:

data.table [3 x 4]

e.id	event	e.start	e.end
1	Alice’s graduation	2023-06-05 10:00:00	2023-06-05 13:00:00
2	John’s birthday	2023-06-05 12:00:00	2023-06-05 22:00:00
3	Alice & Mark’s wedding	2023-06-07 13:00:00	2023-06-07 18:00:00

Strikes:

data.table [4 x 4]

s.id	strike_motive	s.start	s.end
1	Not enough cheese	2023-06-05 11:00:00	2023-06-05 20:00:00
2	Not enough wine	2023-06-05 14:00:00	2023-06-05 16:00:00
3	Life’s too expensive	2023-06-08 09:00:00	2023-06-08 20:00:00
4	Our team lost some sport event	2023-07-05 16:00:00	2023-07-05 22:00:00

3.3.1 Unequality join

Inequality joins are joins (left, right, inner, …) that use inequalities (<, <=, >=, or >) to specify the matching criteria.

Warning

The condition has to be a simple inequality between existing columns: it cannot be an arbitrary function (e.g. date.x <= min(date.y) * 2 will not work).

For each event, which strikes occurred (finished) before the event ?

Tidyverse
data.table

inner_join(events, strikes, join_by(e.start >= s.end))

data.frame [2 x 8]

e.id	event	e.start	e.end	s.id	strike_motive	s.start	s.end
3	Alice & Mark’s wedding	2023-06-07 13:00:00	2023-06-07 18:00:00	1	Not enough cheese	2023-06-05 11:00:00	2023-06-05 20:00:00
3	Alice & Mark’s wedding	2023-06-07 13:00:00	2023-06-07 18:00:00	2	Not enough wine	2023-06-05 14:00:00	2023-06-05 16:00:00

EVENTS[STRIKES, on = .(e.start >= s.end), nomatch = NULL]

data.table [2 x 7]

e.id	event	e.start	e.end	s.id	strike_motive	s.start
3	Alice & Mark’s wedding	2023-06-05 20:00:00	2023-06-07 18:00:00	1	Not enough cheese	2023-06-05 11:00:00
3	Alice & Mark’s wedding	2023-06-05 16:00:00	2023-06-07 18:00:00	2	Not enough wine	2023-06-05 14:00:00

Caution

When specifying an equality or inequality condition, data.table will merge the two columns: only one will remain, with the values of the second column and the name of the first. Here, e.start will have the values of s.end (which will be removed).

I’m not sure if this is a bug or not.

A useful use-case for un-equality joins is to avoid duplicates when generating combinations of items in cross joins:

Data:

data.frame [3 x 2]

id	name
1	Alice
2	Mark
3	John

All permutations: with duplicates (order matters)

cross_join(people, people)

data.frame [9 x 4]

id.x	name.x	id.y	name.y
1	Alice	1	Alice
1	Alice	2	Mark
1	Alice	3	John
2	Mark	1	Alice
2	Mark	2	Mark
2	Mark	3	John
3	John	1	Alice
3	John	2	Mark
3	John	3	John

All combinations: without duplicates (order doesn’t matter)

inner_join(people, people, join_by(id < id))

data.frame [3 x 4]

id.x	name.x	id.y	name.y
1	Alice	2	Mark
1	Alice	3	John
2	Mark	3	John

3.3.2 Rolling joins

Rolling joins are a special type of inequality join where instead of getting every row that satisfies the inequality, we get the one where the IDs are the closest to equality.

Tidyverse
data.table

Which strike started the soonest after the beginning an event ?

inner_join(events, strikes, join_by(closest(e.start <= s.start)))

data.frame [3 x 8]

e.id	event	e.start	e.end	s.id	strike_motive	s.start	s.end
1	Alice’s graduation	2023-06-05 10:00:00	2023-06-05 13:00:00	1	Not enough cheese	2023-06-05 11:00:00	2023-06-05 20:00:00
2	John’s birthday	2023-06-05 12:00:00	2023-06-05 22:00:00	2	Not enough wine	2023-06-05 14:00:00	2023-06-05 16:00:00
3	Alice & Mark’s wedding	2023-06-07 13:00:00	2023-06-07 18:00:00	3	Life’s too expensive	2023-06-08 09:00:00	2023-06-08 20:00:00

Which strike ended the soonest before the start an event ?

inner_join(events, strikes, join_by(closest(e.start >= s.end)))

data.frame [1 x 8]

e.id	event	e.start	e.end	s.id	strike_motive	s.start	s.end
3	Alice & Mark’s wedding	2023-06-07 13:00:00	2023-06-07 18:00:00	1	Not enough cheese	2023-06-05 11:00:00	2023-06-05 20:00:00

Which strike started the soonest after the beginning an event ?

EVENTS[STRIKES, on = .(e.start == s.start), roll = "nearest"
     ][, .SD[which.min(abs(e.start - e.end))], by = "e.id"]

data.table [3 x 7]

e.id	event	e.start	e.end	s.id	strike_motive	s.end
1	Alice’s graduation	2023-06-05 11:00:00	2023-06-05 13:00:00	1	Not enough cheese	2023-06-05 20:00:00
2	John’s birthday	2023-06-05 14:00:00	2023-06-05 22:00:00	2	Not enough wine	2023-06-05 16:00:00
3	Alice & Mark’s wedding	2023-06-08 09:00:00	2023-06-07 18:00:00	3	Life’s too expensive	2023-06-08 20:00:00

Note

Using the roll argument relaxes the equality constraint of the join (e.start == s.end).

Which strike ended the soonest before the start an event ?

EVENTS[STRIKES, on = .(e.start == s.end), roll = -Inf
      ][, .SD[which.min(abs(e.start - e.end))], by = "e.id"]

data.table [1 x 7]

e.id	event	e.start	e.end	s.id	strike_motive	s.start
3	Alice & Mark’s wedding	2023-06-05 20:00:00	2023-06-07 18:00:00	1	Not enough cheese	2023-06-05 11:00:00

3.3.3 Overlap joins

Tidyverse
data.table

dplyr helper functions

dplyr provides three helper functions to make it easier to work with intervals:
- between(x, y_min, y_max) <=> x >= y_min, x <= y_max: a value of the first table is within a given range of the second
- within(x_min, x_max, y_min, y_max) <=> x_min >= y_min, x_max <= y_max: the ranges of the first table are contained within the second’s
- overlaps(x_min, x_max, y_min, y_max) <=> x_min <= y_max, x_max >= y_min: the two ranges overlap partially or totally, in any direction

Between: Which events had a strike staring in the two hours before the beginning of the event ?

Tip

First, we need to create the new “2 hours after the beginning of the event” column since we cannot use arbitrary functions in join_by() (e.g. we cannot do between(s.start, e.start, e.start + hours(2)))

events2 <- mutate(events, e.start_minus2 = e.start - hours(2))

inner_join(strikes, events2, join_by(between(s.start, e.start_minus2, e.start))) |> 
  select(colnames(events), colnames(strikes)) # Re-ordering the columns

data.frame [1 x 8]

e.id	event	e.start	e.end	s.id	strike_motive	s.start	s.end
2	John’s birthday	2023-06-05 12:00:00	2023-06-05 22:00:00	1	Not enough cheese	2023-06-05 11:00:00	2023-06-05 20:00:00

Note

By default, the value to match needs to be from the first table, and the range it falls within needs to be from the second table. Depending on the column order we need, this can force us to reorder the columns post-join (as in the above example).

This can be alleviated by manually specifying from which table each column comes from, using x$col and y$col (x referring the to first column).

inner_join(events2, strikes, join_by(between(y$s.start, x$e.start_minus2, x$e.start))) |> 
  select(-e.start_minus2)

data.frame [1 x 8]

e.id	event	e.start	e.end	s.id	strike_motive	s.start	s.end
2	John’s birthday	2023-06-05 12:00:00	2023-06-05 22:00:00	1	Not enough cheese	2023-06-05 11:00:00	2023-06-05 20:00:00

Manually:

inner_join(events2, strikes, join_by(e.start_minus2 <= s.start, e.start >= s.start)) |> 
  select(-e.start_minus2)

data.frame [1 x 8]

e.id	event	e.start	e.end	s.id	strike_motive	s.start	s.end
2	John’s birthday	2023-06-05 12:00:00	2023-06-05 22:00:00	1	Not enough cheese	2023-06-05 11:00:00	2023-06-05 20:00:00

Within: Which strikes occurred entirely within the period of an event ?

inner_join(strikes, events, join_by(within(s.start, s.end, e.start, e.end)), multiple = "all") |> 
  select(colnames(events), colnames(strikes)) # Re-ordering the columns

data.frame [1 x 8]

e.id	event	e.start	e.end	s.id	strike_motive	s.start	s.end
2	John’s birthday	2023-06-05 12:00:00	2023-06-05 22:00:00	2	Not enough wine	2023-06-05 14:00:00	2023-06-05 16:00:00

Note

As before, within() requires the first range to be within the second by default, meaning the first table must be the one with the smaller range. Using x$col and y$col resolves the issue of column order.

inner_join(events, strikes, join_by(within(y$s.start, y$s.end, x$e.start, x$e.end)), multiple = "all")

data.frame [1 x 8]

e.id	event	e.start	e.end	s.id	strike_motive	s.start	s.end
2	John’s birthday	2023-06-05 12:00:00	2023-06-05 22:00:00	2	Not enough wine	2023-06-05 14:00:00	2023-06-05 16:00:00

Manually:

inner_join(events, strikes, join_by(e.start <= s.start, e.end >= s.end), multiple = "all")

data.frame [1 x 8]

e.id	event	e.start	e.end	s.id	strike_motive	s.start	s.end
2	John’s birthday	2023-06-05 12:00:00	2023-06-05 22:00:00	2	Not enough wine	2023-06-05 14:00:00	2023-06-05 16:00:00

Overlaps: Which events overlap with each-other ?

inner_join(events, events, join_by(e.id < e.id, overlaps(e.start, e.end, e.start, e.end)))

data.frame [1 x 8]

e.id.x	event.x	e.start.x	e.end.x	e.id.y	event.y	e.start.y	e.end.y
1	Alice’s graduation	2023-06-05 10:00:00	2023-06-05 13:00:00	2	John’s birthday	2023-06-05 12:00:00	2023-06-05 22:00:00

Manually:

inner_join(events, events, join_by(e.id < e.id, e.start <= e.end, e.end >= e.start))

data.frame [1 x 8]

e.id.x	event.x	e.start.x	e.end.x	e.id.y	event.y	e.start.y	e.end.y
1	Alice’s graduation	2023-06-05 10:00:00	2023-06-05 13:00:00	2	John’s birthday	2023-06-05 12:00:00	2023-06-05 22:00:00

Between: Which events had a strike staring in the two hours before the beginning of the event ?

copy(EVENTS)[, e.start_minus2 := e.start - hours(2)
           ][STRIKES, on = .(e.start_minus2 <= s.start, e.start >= s.start), nomatch = NULL
           ][, -"e.start_minus2"]

data.table [1 x 7]

e.id	event	e.start	e.end	s.id	strike_motive	s.end
2	John’s birthday	2023-06-05 11:00:00	2023-06-05 22:00:00	1	Not enough cheese	2023-06-05 20:00:00

Within: Which strikes occurred entirely within the period of an event ?

EVENTS[STRIKES, on = .(e.start <= s.start, e.end >= s.end), nomatch = NULL]

data.table [1 x 6]

e.id	event	e.start	e.end	s.id	strike_motive
2	John’s birthday	2023-06-05 14:00:00	2023-06-05 16:00:00	2	Not enough wine

Overlaps: Which events overlap with each-other ?

EVENTS[EVENTS, on = .(e.id < e.id, e.start <= e.end, e.end >= e.start), nomatch = NULL]

data.table [1 x 5]

e.id	event	e.start	e.end	i.event
2	Alice’s graduation	2023-06-05 22:00:00	2023-06-05 12:00:00	John’s birthday

setkey(EVENTS, e.start, e.end)

foverlaps(EVENTS, EVENTS, type = "any", mult = "first", nomatch = NULL)[e.id != i.e.id]

data.table [1 x 8]

e.id	event	e.start	e.end	i.e.id	i.event	i.e.start	i.e.end
1	Alice’s graduation	2023-06-05 10:00:00	2023-06-05 13:00:00	2	John’s birthday	2023-06-05 12:00:00	2023-06-05 22:00:00

4 Tidyr & Others

4.1 Remove NA

Tidyverse
data.table

tidyr::drop_na(IRIS, Species)

data.table [150 x 5]

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
4.6	3.4	1.4	0.3	setosa
5	3.4	1.5	0.2	setosa
4.4	2.9	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa
5.4	3.7	1.5	0.2	setosa
4.8	3.4	1.6	0.2	setosa
4.8	3	1.4	0.1	setosa
4.3	3	1.1	0.1	setosa
5.8	4	1.2	0.2	setosa
[ omitted 135 entries ]

tidyr::drop_na(IRIS, matches("Sepal"))

data.table [150 x 5]

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
4.6	3.4	1.4	0.3	setosa
5	3.4	1.5	0.2	setosa
4.4	2.9	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa
5.4	3.7	1.5	0.2	setosa
4.8	3.4	1.6	0.2	setosa
4.8	3	1.4	0.1	setosa
4.3	3	1.1	0.1	setosa
5.8	4	1.2	0.2	setosa
[ omitted 135 entries ]

na.omit(IRIS, cols = "Species")

data.table [150 x 5]

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
4.6	3.4	1.4	0.3	setosa
5	3.4	1.5	0.2	setosa
4.4	2.9	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa
5.4	3.7	1.5	0.2	setosa
4.8	3.4	1.6	0.2	setosa
4.8	3	1.4	0.1	setosa
4.3	3	1.1	0.1	setosa
5.8	4	1.2	0.2	setosa
[ omitted 135 entries ]

na.omit(IRIS, cols = str_subset(colnames(IRIS), "Sepal"))

data.table [150 x 5]

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
4.6	3.4	1.4	0.3	setosa
5	3.4	1.5	0.2	setosa
4.4	2.9	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa
5.4	3.7	1.5	0.2	setosa
4.8	3.4	1.6	0.2	setosa
4.8	3	1.4	0.1	setosa
4.3	3	1.1	0.1	setosa
5.8	4	1.2	0.2	setosa
[ omitted 135 entries ]

4.2 Unite

Combine multiple columns into a single one:

Tidyverse
data.table

mtcars |> tidyr::unite("x", gear, carb, sep = "_")

data.frame [32 x 10]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	x
21	6	160	110	3.9	2.62	16.46	0	1	4_4
21	6	160	110	3.9	2.875	17.02	0	1	4_4
22.8	4	108	93	3.85	2.32	18.61	1	1	4_1
21.4	6	258	110	3.08	3.215	19.44	1	0	3_1
18.7	8	360	175	3.15	3.44	17.02	0	0	3_2
18.1	6	225	105	2.76	3.46	20.22	1	0	3_1
14.3	8	360	245	3.21	3.57	15.84	0	0	3_4
24.4	4	146.7	62	3.69	3.19	20	1	0	4_2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4_2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4_4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4_4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3_3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3_3
15.2	8	275.8	180	3.07	3.78	18	0	0	3_3
10.4	8	472	205	2.93	5.25	17.98	0	0	3_4
[ omitted 17 entries ]

copy(MT)[, x := paste(gear, carb, sep = "_")][]

data.table [32 x 12]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	x
21	6	160	110	3.9	2.62	16.46	0	1	4	4	4_4
21	6	160	110	3.9	2.875	17.02	0	1	4	4	4_4
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1	4_1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	3_1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	3_2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1	3_1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4	3_4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	4_2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2	4_2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4	4_4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	4_4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3	3_3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3	3_3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3	3_3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4	3_4
[ omitted 17 entries ]

4.3 Separate / Extract

4.3.1 Separate wider (extract)

(MT.ext <- MT[, .(x = str_c(gear, carb, sep = "_"))])

data.table [32 x 1]

x
4_4
4_4
4_1
3_1
3_2
3_1
3_4
4_2
4_2
4_4
4_4
3_3
3_3
3_3
3_4
[ omitted 17 entries ]

Tidyverse
data.table

Based on a delimiter:

MT.ext |> separate_wider_delim(x, delim = "_", names = c("gear", "carb"))

data.frame [32 x 2]

gear	carb
4	4
4	4
4	1
3	1
3	2
3	1
3	4
4	2
4	2
4	4
4	4
3	3
3	3
3	3
3	4
[ omitted 17 entries ]

Based on a regex:

MT.ext |> separate_wider_regex(x, patterns = c(gear = "\\d{1}", "_", carb = "\\d{1}"))

data.frame [32 x 2]

gear	carb
4	4
4	4
4	1
3	1
3	2
3	1
3	4
4	2
4	2
4	4
4	4
3	3
3	3
3	3
3	4
[ omitted 17 entries ]

Based on position:

MT.ext |> separate_wider_position(x, widths  = c(gear = 1, delim = 1, carb = 1))

data.frame [32 x 3]

gear	delim	carb
4	_	4
4	_	4
4	_	1
3	_	1
3	_	2
3	_	1
3	_	4
4	_	2
4	_	2
4	_	4
4	_	4
3	_	3
3	_	3
3	_	3
3	_	4
[ omitted 17 entries ]

Note

separate_wider_* supersedes both extract and separate.

Old syntax

tidyr::separate(MT.ext, x, into = c("gear", "carb"), sep = "_", remove = TRUE)

tidyr::extract(MT.ext, x, into = c("gear", "carb"), regex = "(.*)_(.*)", remove = TRUE)

Based on a delimiter:

copy(MT.ext)[, c("gear", "carb") := tstrsplit(x, "_", fixed = TRUE)][]

data.table [32 x 3]

x	gear	carb
4_4	4	4
4_4	4	4
4_1	4	1
3_1	3	1
3_2	3	2
3_1	3	1
3_4	3	4
4_2	4	2
4_2	4	2
4_4	4	4
4_4	4	4
3_3	3	3
3_3	3	3
3_3	3	3
3_4	3	4
[ omitted 17 entries ]

Based on a regex:

copy(MT.ext)[, c("gear", "carb") := str_extract_all(x, "\\d") |> list_transpose()][]

data.table [32 x 3]

x	gear	carb
4_4	4	4
4_4	4	4
4_1	4	1
3_1	3	1
3_2	3	2
3_1	3	1
3_4	3	4
4_2	4	2
4_2	4	2
4_4	4	4
4_4	4	4
3_3	3	3
3_3	3	3
3_3	3	3
3_4	3	4
[ omitted 17 entries ]

4.3.2 Separate longer/rows

Separating a row into multiple rows, duplicating the rest of the values.

Data

(SP <- data.table(
  val = c(1,"2,3",4), 
  date = as.Date(c("2020-01-01", "2020-01-02", "2020-01-03"), origin = "1970-01-01")
  )
)

data.table [3 x 2]

val	date
1	2020-01-01
2,3	2020-01-02
4	2020-01-03

Tidyverse
data.table

Based on a delimiter:

SP |> separate_longer_delim(val, delim = ",")

data.frame [4 x 2]

val	date
1	2020-01-01
2	2020-01-02
3	2020-01-02
4	2020-01-03

Based on position:

SP |> separate_longer_position(val, width = 1) |> filter(val != ",")

data.frame [4 x 2]

val	date
1	2020-01-01
2	2020-01-02
3	2020-01-02
4	2020-01-03

Warning

separate_longer_* now supersedes separate_rows

Old syntax

SP |> separate_rows(val, sep = ",", convert = TRUE)

Solution 1:

copy(SP)[, c(V1 = strsplit(val, ",", fixed = TRUE), .SD), by = val][, let(val = V1, V1 = NULL)][]

data.table [4 x 2]

val	date
1	2020-01-01
2	2020-01-02
3	2020-01-02
4	2020-01-03

Solution 2:

SP[, strsplit(val, ",", fixed = TRUE), by = val][SP, on = "val"][, let(val = V1, V1 = NULL)][]

data.table [4 x 2]

val	date
1	2020-01-01
2	2020-01-02
3	2020-01-02
4	2020-01-03

Solution 3:

(With type conversion)

SP[, unlist(tstrsplit(val, ",", type.convert = TRUE)), by = val][SP, on = "val"][, let(val = V1, V1 = NULL)][]

data.table [4 x 2]

val	date
1	2020-01-01
2	2020-01-02
3	2020-01-02
4	2020-01-03

Solution 4:

copy(SP)[rep(1:.N, lengths(strsplit(val, ",")))][, val := strsplit(val, ","), by = val][]

data.table [4 x 2]

val	date
1	2020-01-01
2	2020-01-02
3	2020-01-02
4	2020-01-03

(With type conversion)

copy(SP)[rep(1:.N, lengths(strsplit(val, ",")))
       ][, val := strsplit(val, ","), by = val
       ][, val := utils::type.convert(val, as.is = T)][]

data.table [4 x 2]

val	date
1	2020-01-01
2	2020-01-02
3	2020-01-02
4	2020-01-03

4.4 Duplicates

4.4.1 Duplicated rows

4.4.1.1 Only keeping duplicated rows

Tidyverse
data.table

mtcars |> filter(n() > 1, .by = c(mpg, hp))

data.frame [2 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21	6	160	110	3.9	2.62	16.46	0	1	4	4
21	6	160	110	3.9	2.875	17.02	0	1	4	4

MT[, if(.N > 1) .SD, by = .(mpg, hp)]

data.table [2 x 11]

mpg	hp	cyl	disp	drat	wt	qsec	vs	am	gear	carb
21	110	6	160	3.9	2.62	16.46	0	1	4	4
21	110	6	160	3.9	2.875	17.02	0	1	4	4

4.4.1.2 Removing duplicated rows

Note

This is different from distinct/unique, which will keep one of the duplicated rows of each group.

This removes all groups which have duplicated rows.

Tidyverse
data.table

mtcars |> filter(n() == 1, .by = c(mpg, hp))

data.frame [30 x 11]

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
10.4	8	460	215	3	5.424	17.82	0	0	3	4
14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
[ omitted 15 entries ]

Alternatives

# More convoluted

mtcars |> filter(n() > 1, .by = c(mpg, hp)) |> anti_join(mtcars, y = _)

MT[, if(.N == 1) .SD, by = .(mpg, hp)]

data.table [30 x 11]

mpg	hp	cyl	disp	drat	wt	qsec	vs	am	gear	carb
22.8	93	4	108	3.85	2.32	18.61	1	1	4	1
21.4	110	6	258	3.08	3.215	19.44	1	0	3	1
18.7	175	8	360	3.15	3.44	17.02	0	0	3	2
18.1	105	6	225	2.76	3.46	20.22	1	0	3	1
14.3	245	8	360	3.21	3.57	15.84	0	0	3	4
24.4	62	4	146.7	3.69	3.19	20	1	0	4	2
22.8	95	4	140.8	3.92	3.15	22.9	1	0	4	2
19.2	123	6	167.6	3.92	3.44	18.3	1	0	4	4
17.8	123	6	167.6	3.92	3.44	18.9	1	0	4	4
16.4	180	8	275.8	3.07	4.07	17.4	0	0	3	3
17.3	180	8	275.8	3.07	3.73	17.6	0	0	3	3
15.2	180	8	275.8	3.07	3.78	18	0	0	3	3
10.4	205	8	472	2.93	5.25	17.98	0	0	3	4
10.4	215	8	460	3	5.424	17.82	0	0	3	4
14.7	230	8	440	3.23	5.345	17.42	0	0	3	4
[ omitted 15 entries ]

Alternatives

# More convoluted

MT[!MT[, if(.N > 1) .SD, by = .(mpg, hp)], on = colnames(MT)]

fsetdiff(MT, setcolorder(MT[, if(.N > 1) .SD, by = .(mpg, hp)], colnames(MT)))

4.4.2 Duplicated values (per row)

(DUPED <- data.table(
    A = c("A1", "A2", "B3", "A4"), 
    B = c("B1", "B2", "B3", "B4"), 
    C = c("A1", "C2", "D3", "C4"), 
    D = c("A1", "D2", "D3", "D4")
  )
)

data.table [4 x 4]

A	B	C	D
A1	B1	A1	A1
A2	B2	C2	D2
B3	B3	D3	D3
A4	B4	C4	D4

Tidyverse
data.table

mutate(DUPED, Repeats = apply(
    pick(everything()), 1, \(r) r[which(duplicated(r))] |> unique() |> str_c(collapse = ", ")
  )
)

data.table [4 x 5]

A	B	C	D	Repeats
A1	B1	A1	A1	A1
A2	B2	C2	D2
B3	B3	D3	D3	B3, D3
A4	B4	C4	D4

copy(DUPED)[
  , Repeats := apply(.SD, 1, \(r) r[which(duplicated(r))] |> unique() |> str_c(collapse = ", "))
  ][]

data.table [4 x 5]

A	B	C	D	Repeats
A1	B1	A1	A1	A1
A2	B2	C2	D2
B3	B3	D3	D3	B3, D3
A4	B4	C4	D4

With duplication counter:

dup_counts <- function(v) {
  rles <- as.data.table(unclass(rle(v[which(duplicated(v))])))[, lengths := lengths + 1]
  paste(apply(rles, 1, \(r) paste0(r[2], " (", r[1], ")")), collapse = ", ")
}

Tidyverse
data.table

DUPED |> mutate(Repeats = apply(pick(everything()), 1, \(r) dup_counts(r)))

data.table [4 x 5]

A	B	C	D	Repeats
A1	B1	A1	A1	A1 (3)
A2	B2	C2	D2
B3	B3	D3	D3	B3 (2), D3 (2)
A4	B4	C4	D4

DUPED[, Repeats := apply(.SD, 1, \(r) dup_counts(r))][]

data.table [4 x 5]

A	B	C	D	Repeats
A1	B1	A1	A1	A1 (3)
A2	B2	C2	D2
B3	B3	D3	D3	B3 (2), D3 (2)
A4	B4	C4	D4

4.5 Expand & Complete

Here, we are missing an entry for person B on year 2010, that we want to fill:

(CAR <- data.table(
    year = c(2010,2011,2012,2013,2014,2015,2011,2012,2013,2014,2015), 
    person = c("A","A","A","A","A","A", "B","B","B","B","B"),
    car = c("BMW", "BMW", "AUDI", "AUDI", "AUDI", "Mercedes", "Citroen","Citroen", "Citroen", "Toyota", "Toyota")
  )
)

data.table [11 x 3]

year	person	car
2 010	A	BMW
2 011	A	BMW
2 012	A	AUDI
2 013	A	AUDI
2 014	A	AUDI
2 015	A	Mercedes
2 011	B	Citroen
2 012	B	Citroen
2 013	B	Citroen
2 014	B	Toyota
2 015	B	Toyota

4.5.1 Expand

Tidyverse
data.table

tidyr::expand(CAR, person, year)

data.frame [12 x 2]

person	year
A	2 010
A	2 011
A	2 012
A	2 013
A	2 014
A	2 015
B	2 010
B	2 011
B	2 012
B	2 013
B	2 014
B	2 015

CJ(CAR$person, CAR$year, unique = TRUE)

data.table [12 x 2]

V1	V2
A	2 010
A	2 011
A	2 012
A	2 013
A	2 014
A	2 015
B	2 010
B	2 011
B	2 012
B	2 013
B	2 014
B	2 015

4.5.2 Complete

Joins the original dataset with the expanded one:

Tidyverse
data.table

CAR |> tidyr::complete(person, year)

data.frame [12 x 3]

person	year	car
A	2 010	BMW
A	2 011	BMW
A	2 012	AUDI
A	2 013	AUDI
A	2 014	AUDI
A	2 015	Mercedes
B	2 010	NA
B	2 011	Citroen
B	2 012	Citroen
B	2 013	Citroen
B	2 014	Toyota
B	2 015	Toyota

CAR[CJ(person, year, unique = TRUE), on = .(person, year)]

data.table [12 x 3]

year	person	car
2 010	A	BMW
2 011	A	BMW
2 012	A	AUDI
2 013	A	AUDI
2 014	A	AUDI
2 015	A	Mercedes
2 010	B	NA
2 011	B	Citroen
2 012	B	Citroen
2 013	B	Citroen
2 014	B	Toyota
2 015	B	Toyota

4.6 Uncount

Duplicating aggregated rows to get back the un-aggregated version.

Data

cols <- c("Mild", "Moderate", "Severe")

dat_agg

data.frame [10 x 6]

ID	Site	Domain	Mild	Moderate	Severe
1	23	A1	4	0	0
2	27	A1	0	1	1
3	28	A1	0	1	0
4	29	A1	0	0	1
5	31	A1	0	1	0
6	33	A1	0	1	1
7	41	A1	3	0	1
8	48	A1	0	2	4
9	64	A1	1	0	0
10	66	A1	1	0	0

Tidyverse
data.table

dat_agg |> 
  pivot_longer(cols = all_of(cols), names_to = "Severity", values_to = "Count") |> 
  uncount(Count) |> 
  mutate(ID_new = row_number(), .after = "ID") |>
  pivot_wider(
    names_from = "Severity", values_from = "Severity", 
    values_fn = \(x) ifelse(is.na(x), 0, 1), values_fill = 0
  )

data.frame [23 x 7]

ID	ID_new	Site	Domain	Mild	Moderate	Severe
1	1	23	A1	1	0	0
1	2	23	A1	1	0	0
1	3	23	A1	1	0	0
1	4	23	A1	1	0	0
2	5	27	A1	0	1	0
2	6	27	A1	0	0	1
3	7	28	A1	0	1	0
4	8	29	A1	0	0	1
5	9	31	A1	0	1	0
6	10	33	A1	0	1	0
6	11	33	A1	0	0	1
7	12	41	A1	1	0	0
7	13	41	A1	1	0	0
7	14	41	A1	1	0	0
7	15	41	A1	0	0	1
[ omitted 8 entries ]

Solution 1:

(melt(DAT_AGG, measure.vars = cols, variable.name = "Severity", value.name = "Count")
  [rep(1:.N, Count)][, ID_new := .I] 
  |> dcast(... ~ Severity, value.var = "Severity", fun.agg = \(x) ifelse(is.na(x), 0, 1), fill = 0)
  |> _[, -"Count"]
)

data.table [23 x 7]

ID	Site	Domain	ID_new	Mild	Moderate	Severe
1	23	A1	1	1	0	0
1	23	A1	2	1	0	0
1	23	A1	3	1	0	0
1	23	A1	4	1	0	0
2	27	A1	10	0	1	0
2	27	A1	16	0	0	1
3	28	A1	11	0	1	0
4	29	A1	17	0	0	1
5	31	A1	12	0	1	0
6	33	A1	13	0	1	0
6	33	A1	18	0	0	1
7	41	A1	19	0	0	1
7	41	A1	5	1	0	0
7	41	A1	6	1	0	0
7	41	A1	7	1	0	0
[ omitted 8 entries ]

Solution 2:

DAT_AGG[Reduce(`c`, sapply(mget(cols), \(x) rep(1:.N, x)))
      ][, (cols) := lapply(.SD, \(x) ifelse(x > 1, 1, x)), .SDcols = cols
      ][order(ID)]

data.table [23 x 6]

ID	Site	Domain	Mild	Moderate	Severe
1	23	A1	1	0	0
1	23	A1	1	0	0
1	23	A1	1	0	0
1	23	A1	1	0	0
2	27	A1	0	1	1
2	27	A1	0	1	1
3	28	A1	0	1	0
4	29	A1	0	0	1
5	31	A1	0	1	0
6	33	A1	0	1	1
6	33	A1	0	1	1
7	41	A1	1	0	1
7	41	A1	1	0	1
7	41	A1	1	0	1
7	41	A1	1	0	1
[ omitted 8 entries ]

4.7 List / Unlist

When a column contains a simple vector/list of values (of the same type, without structure)

4.7.1 One listed column

Single ID (grouping) column:

Data:

MT_LIST

data.table [3 x 2]

cyl	mpg
4	<numeric [11]>
6	<numeric [7]>
8	<numeric [14]>

Tidyverse
data.table

mt_list |> unnest(cols = mpg)

data.frame [32 x 2]

cyl	mpg
6	21
6	21
6	21.4
6	18.1
6	19.2
6	17.8
6	19.7
4	22.8
4	24.4
4	22.8
4	32.4
4	30.4
4	33.9
4	21.5
4	27.3
[ omitted 17 entries ]

MT_LIST[, .(mpg = unlist(mpg)), keyby = cyl]

data.table [32 x 2]

cyl	mpg
4	22.8
4	24.4
4	22.8
4	32.4
4	30.4
4	33.9
4	21.5
4	27.3
4	26
4	30.4
4	21.4
6	21
6	21
6	21.4
6	18.1
[ omitted 17 entries ]

Alternative that bypasses the need of grouping when unlisting by growing the data.table back to its original number of rows before unlisting:

MT_LIST[rep(MT_LIST[, .I], lengths(mpg))][, mpg := unlist(MT_LIST$mpg)][]

data.table [32 x 2]

cyl	mpg
4	22.8
4	24.4
4	22.8
4	32.4
4	30.4
4	33.9
4	21.5
4	27.3
4	26
4	30.4
4	21.4
6	21
6	21
6	21.4
6	18.1
[ omitted 17 entries ]

Multiple ID (grouping) columns:

Data:

mt_list2

data.frame [8 x 3]

cyl	gear	mpg
6	4	<numeric [4]>
4	4	<numeric [8]>
6	3	<numeric [2]>
8	3	<numeric [12]>
4	3	<numeric [1]>
4	5	<numeric [2]>
8	5	<numeric [2]>
6	5	<numeric [1]>

Tidyverse
data.table

mt_list2 |> unnest(cols = mpg) # group_by(cyl, gear) is optional

data.frame [32 x 3]

cyl	gear	mpg
6	4	21
6	4	21
6	4	19.2
6	4	17.8
4	4	22.8
4	4	24.4
4	4	22.8
4	4	32.4
4	4	30.4
4	4	33.9
4	4	27.3
4	4	21.4
6	3	21.4
6	3	18.1
8	3	18.7
[ omitted 17 entries ]

Solution 1:

MT_LIST2[, .(mpg = unlist(mpg)), by = setdiff(colnames(MT_LIST2), 'mpg')]

data.table [32 x 3]

cyl	gear	mpg
4	3	21.5
4	4	22.8
4	4	24.4
4	4	22.8
4	4	32.4
4	4	30.4
4	4	33.9
4	4	27.3
4	4	21.4
4	5	26
4	5	30.4
6	3	21.4
6	3	18.1
6	4	21
6	4	21
[ omitted 17 entries ]

Solution 2:

MT_LIST2[rep(MT_LIST2[, .I], lengths(mpg))][, mpg := unlist(MT_LIST2$mpg)][]

data.table [32 x 3]

cyl	gear	mpg
4	3	21.5
4	4	22.8
4	4	24.4
4	4	22.8
4	4	32.4
4	4	30.4
4	4	33.9
4	4	27.3
4	4	21.4
4	5	26
4	5	30.4
6	3	21.4
6	3	18.1
6	4	21
6	4	21
[ omitted 17 entries ]

4.7.2 Multiple listed column

Data:

mt_list_mult

data.frame [8 x 4]

cyl	gear	mpg	disp
6	4	<numeric [4]>	<numeric [4]>
4	4	<numeric [8]>	<numeric [8]>
6	3	<numeric [2]>	<numeric [2]>
8	3	<numeric [12]>	<numeric [12]>
4	3	<numeric [1]>	<numeric [1]>
4	5	<numeric [2]>	<numeric [2]>
8	5	<numeric [2]>	<numeric [2]>
6	5	<numeric [1]>	<numeric [1]>

Tidyverse
data.table

mt_list_mult |> unnest(cols = c(mpg, disp)) # group_by(cyl, gear) is optional

data.frame [32 x 4]

cyl	gear	mpg	disp
6	4	21	160
6	4	21	160
6	4	19.2	167.6
6	4	17.8	167.6
4	4	22.8	108
4	4	24.4	146.7
4	4	22.8	140.8
4	4	32.4	78.7
4	4	30.4	75.7
4	4	33.9	71.1
4	4	27.3	79
4	4	21.4	121
6	3	21.4	258
6	3	18.1	225
8	3	18.7	360
[ omitted 17 entries ]

MT_LIST_MULT[, lapply(.SD, \(c) unlist(c)), by = setdiff(colnames(MT_LIST_MULT), c("mpg", "disp"))]

data.table [32 x 4]

cyl	gear	mpg	disp
4	3	21.5	120.1
4	4	22.8	108
4	4	24.4	146.7
4	4	22.8	140.8
4	4	32.4	78.7
4	4	30.4	75.7
4	4	33.9	71.1
4	4	27.3	79
4	4	21.4	121
4	5	26	120.3
4	5	30.4	95.1
6	3	21.4	258
6	3	18.1	225
6	4	21	160
6	4	21	160
[ omitted 17 entries ]

4.8 Nest / Unnest

When a column contains a data.table/data.frame (with multiple columns, structured)

4.8.1 One nested column

Nesting

Tidyverse
data.table

mtcars |> tidyr::nest(data = -cyl) # Data is inside tibbles

data.frame [3 x 2]

cyl	data
6	<tbl_df [7 x 10]>
4	<tbl_df [11 x 10]>
8	<tbl_df [14 x 10]>

Alternatives

mtcars |> nest_by(cyl) |> ungroup() # Data is inside vctrs_list_of. Returns a rowwise() df

Nesting while keeping the grouping variable inside the nested tables:

mtcars |> tidyr::nest(data = everything(), .by = cyl)

data.frame [3 x 2]

cyl	data
6	<tbl_df [7 x 11]>
4	<tbl_df [11 x 11]>
8	<tbl_df [14 x 11]>

MT[, .(data = .(.SD)), keyby = cyl]

data.table [3 x 2]

cyl	data
4	<data.table [11 x 10]></data.table>
6	<data.table [7 x 10]></data.table>
8	<data.table [14 x 10]></data.table>

Nesting while keeping the grouping variable inside the nested tables:

MT[, .(data = list(data.table(cyl, .SD))), keyby = cyl]

data.table [3 x 2]

cyl	data
4	<data.table [11 x 11]></data.table>
6	<data.table [7 x 11]></data.table>
8	<data.table [14 x 11]></data.table>

Unnesting

Data:

mtcars_nest <- mtcars |> tidyr::nest(data = -cyl)

MT_NEST <- MT[, .(data = .(.SD)), keyby = cyl]

Tidyverse
data.table

mtcars_nest |> unnest(cols = data) |> ungroup()

data.frame [32 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
6	21	160	110	3.9	2.62	16.46	0	1	4	4
6	21	160	110	3.9	2.875	17.02	0	1	4	4
6	21.4	258	110	3.08	3.215	19.44	1	0	3	1
6	18.1	225	105	2.76	3.46	20.22	1	0	3	1
6	19.2	167.6	123	3.92	3.44	18.3	1	0	4	4
6	17.8	167.6	123	3.92	3.44	18.9	1	0	4	4
6	19.7	145	175	3.62	2.77	15.5	0	1	5	6
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1
4	24.4	146.7	62	3.69	3.19	20	1	0	4	2
4	22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
4	32.4	78.7	66	4.08	2.2	19.47	1	1	4	1
4	30.4	75.7	52	4.93	1.615	18.52	1	1	4	2
4	33.9	71.1	65	4.22	1.835	19.9	1	1	4	1
4	21.5	120.1	97	3.7	2.465	20.01	1	0	3	1
4	27.3	79	66	4.08	1.935	18.9	1	1	4	1
[ omitted 17 entries ]

MT_NEST[, rbindlist(data), keyby = cyl] # MT_NEST[, do.call(c, data), keyby = cyl]

data.table [32 x 11]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1
4	24.4	146.7	62	3.69	3.19	20	1	0	4	2
4	22.8	140.8	95	3.92	3.15	22.9	1	0	4	2
4	32.4	78.7	66	4.08	2.2	19.47	1	1	4	1
4	30.4	75.7	52	4.93	1.615	18.52	1	1	4	2
4	33.9	71.1	65	4.22	1.835	19.9	1	1	4	1
4	21.5	120.1	97	3.7	2.465	20.01	1	0	3	1
4	27.3	79	66	4.08	1.935	18.9	1	1	4	1
4	26	120.3	91	4.43	2.14	16.7	0	1	5	2
4	30.4	95.1	113	3.77	1.513	16.9	1	1	5	2
4	21.4	121	109	4.11	2.78	18.6	1	1	4	2
6	21	160	110	3.9	2.62	16.46	0	1	4	4
6	21	160	110	3.9	2.875	17.02	0	1	4	4
6	21.4	258	110	3.08	3.215	19.44	1	0	3	1
6	18.1	225	105	2.76	3.46	20.22	1	0	3	1
[ omitted 17 entries ]

4.8.2 Multiple nested column

Nesting:

Tidyverse
data.table

(mtcars |> nest(data1 = c(mpg, hp), data2 = !c(cyl, gear, mpg, hp), .by = c(cyl, gear)) -> mt_nest_mult)

data.frame [8 x 4]

cyl	gear	data1	data2
6	4	<tbl_df [4 x 2]>	<tbl_df [4 x 7]>
4	4	<tbl_df [8 x 2]>	<tbl_df [8 x 7]>
6	3	<tbl_df [2 x 2]>	<tbl_df [2 x 7]>
8	3	<tbl_df [12 x 2]>	<tbl_df [12 x 7]>
4	3	<tbl_df [1 x 2]>	<tbl_df [1 x 7]>
4	5	<tbl_df [2 x 2]>	<tbl_df [2 x 7]>
8	5	<tbl_df [2 x 2]>	<tbl_df [2 x 7]>
6	5	<tbl_df [1 x 2]>	<tbl_df [1 x 7]>

(MT[, .(data1 = .(.SD[, .(mpg, hp)]), data2 = .(.SD[, !c("mpg", "hp")])), by = .(cyl, gear)] -> MT_NEST_MULT)

data.table [8 x 4]

cyl	gear	data1	data2
6	4	<data.table [4 x 2]></data.table>	<data.table [4 x 7]></data.table>
4	4	<data.table [8 x 2]></data.table>	<data.table [8 x 7]></data.table>
6	3	<data.table [2 x 2]></data.table>	<data.table [2 x 7]></data.table>
8	3	<data.table [12 x 2]></data.table>	<data.table [12 x 7]></data.table>
4	3	<data.table [1 x 2]></data.table>	<data.table [1 x 7]></data.table>
4	5	<data.table [2 x 2]></data.table>	<data.table [2 x 7]></data.table>
8	5	<data.table [2 x 2]></data.table>	<data.table [2 x 7]></data.table>
6	5	<data.table [1 x 2]></data.table>	<data.table [1 x 7]></data.table>

Unnesting:

Tidyverse
data.table

mt_nest_mult |> unnest(cols = c(data1, data2))

data.frame [32 x 11]

cyl	gear	mpg	hp	disp	drat	wt	qsec	vs	am	carb
6	4	21	110	160	3.9	2.62	16.46	0	1	4
6	4	21	110	160	3.9	2.875	17.02	0	1	4
6	4	19.2	123	167.6	3.92	3.44	18.3	1	0	4
6	4	17.8	123	167.6	3.92	3.44	18.9	1	0	4
4	4	22.8	93	108	3.85	2.32	18.61	1	1	1
4	4	24.4	62	146.7	3.69	3.19	20	1	0	2
4	4	22.8	95	140.8	3.92	3.15	22.9	1	0	2
4	4	32.4	66	78.7	4.08	2.2	19.47	1	1	1
4	4	30.4	52	75.7	4.93	1.615	18.52	1	1	2
4	4	33.9	65	71.1	4.22	1.835	19.9	1	1	1
4	4	27.3	66	79	4.08	1.935	18.9	1	1	1
4	4	21.4	109	121	4.11	2.78	18.6	1	1	2
6	3	21.4	110	258	3.08	3.215	19.44	1	0	1
6	3	18.1	105	225	2.76	3.46	20.22	1	0	1
8	3	18.7	175	360	3.15	3.44	17.02	0	0	2
[ omitted 17 entries ]

Using a pattern to specify the columns to unnest:

mt_nest_mult |> unnest(cols = matches("data"))

data.frame [32 x 11]

cyl	gear	mpg	hp	disp	drat	wt	qsec	vs	am	carb
6	4	21	110	160	3.9	2.62	16.46	0	1	4
6	4	21	110	160	3.9	2.875	17.02	0	1	4
6	4	19.2	123	167.6	3.92	3.44	18.3	1	0	4
6	4	17.8	123	167.6	3.92	3.44	18.9	1	0	4
4	4	22.8	93	108	3.85	2.32	18.61	1	1	1
4	4	24.4	62	146.7	3.69	3.19	20	1	0	2
4	4	22.8	95	140.8	3.92	3.15	22.9	1	0	2
4	4	32.4	66	78.7	4.08	2.2	19.47	1	1	1
4	4	30.4	52	75.7	4.93	1.615	18.52	1	1	2
4	4	33.9	65	71.1	4.22	1.835	19.9	1	1	1
4	4	27.3	66	79	4.08	1.935	18.9	1	1	1
4	4	21.4	109	121	4.11	2.78	18.6	1	1	2
6	3	21.4	110	258	3.08	3.215	19.44	1	0	1
6	3	18.1	105	225	2.76	3.46	20.22	1	0	1
8	3	18.7	175	360	3.15	3.44	17.02	0	0	2
[ omitted 17 entries ]

MT_NEST_MULT[, c(rbindlist(data1), rbindlist(data2)), keyby = .(cyl, gear)]

data.table [32 x 11]

cyl	gear	mpg	hp	disp	drat	wt	qsec	vs	am	carb
4	3	21.5	97	120.1	3.7	2.465	20.01	1	0	1
4	4	22.8	93	108	3.85	2.32	18.61	1	1	1
4	4	24.4	62	146.7	3.69	3.19	20	1	0	2
4	4	22.8	95	140.8	3.92	3.15	22.9	1	0	2
4	4	32.4	66	78.7	4.08	2.2	19.47	1	1	1
4	4	30.4	52	75.7	4.93	1.615	18.52	1	1	2
4	4	33.9	65	71.1	4.22	1.835	19.9	1	1	1
4	4	27.3	66	79	4.08	1.935	18.9	1	1	1
4	4	21.4	109	121	4.11	2.78	18.6	1	1	2
4	5	26	91	120.3	4.43	2.14	16.7	0	1	2
4	5	30.4	113	95.1	3.77	1.513	16.9	1	1	2
6	3	21.4	110	258	3.08	3.215	19.44	1	0	1
6	3	18.1	105	225	2.76	3.46	20.22	1	0	1
6	4	21	110	160	3.9	2.62	16.46	0	1	4
6	4	21	110	160	3.9	2.875	17.02	0	1	4
[ omitted 17 entries ]

Using a pattern to specify the columns to unnest:

MT_NEST_MULT[, 
  do.call(c, unname(lapply(.SD, \(c) rbindlist(c)))), .SDcols = patterns('data'), 
  keyby = .(cyl, gear)
]

data.table [32 x 11]

cyl	gear	mpg	hp	disp	drat	wt	qsec	vs	am	carb
4	3	21.5	97	120.1	3.7	2.465	20.01	1	0	1
4	4	22.8	93	108	3.85	2.32	18.61	1	1	1
4	4	24.4	62	146.7	3.69	3.19	20	1	0	2
4	4	22.8	95	140.8	3.92	3.15	22.9	1	0	2
4	4	32.4	66	78.7	4.08	2.2	19.47	1	1	1
4	4	30.4	52	75.7	4.93	1.615	18.52	1	1	2
4	4	33.9	65	71.1	4.22	1.835	19.9	1	1	1
4	4	27.3	66	79	4.08	1.935	18.9	1	1	1
4	4	21.4	109	121	4.11	2.78	18.6	1	1	2
4	5	26	91	120.3	4.43	2.14	16.7	0	1	2
4	5	30.4	113	95.1	3.77	1.513	16.9	1	1	2
6	3	21.4	110	258	3.08	3.215	19.44	1	0	1
6	3	18.1	105	225	2.76	3.46	20.22	1	0	1
6	4	21	110	160	3.9	2.62	16.46	0	1	4
6	4	21	110	160	3.9	2.875	17.02	0	1	4
[ omitted 17 entries ]

data.table [32 x 11]

cyl	gear	mpg	hp	disp	drat	wt	qsec	vs	am	carb
4	3	21.5	97	120.1	3.7	2.465	20.01	1	0	1
4	4	22.8	93	108	3.85	2.32	18.61	1	1	1
4	4	24.4	62	146.7	3.69	3.19	20	1	0	2
4	4	22.8	95	140.8	3.92	3.15	22.9	1	0	2
4	4	32.4	66	78.7	4.08	2.2	19.47	1	1	1
4	4	30.4	52	75.7	4.93	1.615	18.52	1	1	2
4	4	33.9	65	71.1	4.22	1.835	19.9	1	1	1
4	4	27.3	66	79	4.08	1.935	18.9	1	1	1
4	4	21.4	109	121	4.11	2.78	18.6	1	1	2
4	5	26	91	120.3	4.43	2.14	16.7	0	1	2
4	5	30.4	113	95.1	3.77	1.513	16.9	1	1	2
6	3	21.4	110	258	3.08	3.215	19.44	1	0	1
6	3	18.1	105	225	2.76	3.46	20.22	1	0	1
6	4	21	110	160	3.9	2.62	16.46	0	1	4
6	4	21	110	160	3.9	2.875	17.02	0	1	4
[ omitted 17 entries ]

4.8.3 Operate on nested/list columns

Data:

mt_nest

data.frame [3 x 2]

cyl	data
6	<tbl_df [7 x 10]>
4	<tbl_df [11 x 10]>
8	<tbl_df [14 x 10]>

Creating a new column using the nested data:

Tidyverse
data.table

Keeping the nested column:

mt_nest |> mutate(sum = sum(unlist(data)), .by = cyl)

data.frame [3 x 3]

cyl	data	sum
6	<tbl_df [7 x 10]>	2 508.16
4	<tbl_df [11 x 10]>	2 719.233
8	<tbl_df [14 x 10]>	8 516.809

Dropping the nested column:

mt_nest |> summarize(sum = sum(unlist(data)), .by = cyl)

data.frame [3 x 2]

cyl	sum
6	2 508.16
4	2 719.233
8	8 516.809

Keeping the nested column:

copy(MT_NEST)[, sum := sapply(data, \(r) sum(r)), by = cyl][]

data.table [3 x 3]

cyl	data	sum
6	<data.table [7 x 10]></data.table>	2 508.16
4	<data.table [11 x 10]></data.table>	2 719.233
8	<data.table [14 x 10]></data.table>	8 516.809

Dropping the nested column:

MT_NEST[, .(sum = sapply(data, \(r) sum(r))), by = cyl]

data.table [3 x 2]

cyl	sum
6	2 508.16
4	2 719.233
8	8 516.809

Creating multiple new columns using the nested data:

linreg <- \(data) lm(mpg ~ hp, data = data) |> broom::tidy()

Tidyverse
data.table

mt_nest |> group_by(cyl) |> group_modify(\(d, g) linreg(unnest(d, everything()))) |> ungroup()

data.frame [6 x 6]

cyl	term	estimate	std.error	statistic	p.value
4	(Intercept)	35.983	5.201	6.918	0
4	hp	−0.113	0.061	−1.843	0.098
6	(Intercept)	20.674	3.304	6.256	0.002
6	hp	−0.008	0.027	−0.286	0.786
8	(Intercept)	18.08	2.988	6.052	0
8	hp	−0.014	0.014	−1.025	0.326

MT_NEST[, rbindlist(lapply(data, \(ndt) linreg(ndt))), keyby = cyl][]

data.table [6 x 6]

cyl	term	estimate	std.error	statistic	p.value
4	(Intercept)	35.983	5.201	6.918	0
4	hp	−0.113	0.061	−1.843	0.098
6	(Intercept)	20.674	3.304	6.256	0.002
6	hp	−0.008	0.027	−0.286	0.786
8	(Intercept)	18.08	2.988	6.052	0
8	hp	−0.014	0.014	−1.025	0.326

Operating inside the nested data:

Tidyverse
data.table

mt_nest |> 
  mutate(data = map(data, \(t) mutate(t, sum = pmap_dbl(pick(everything()), sum)))) |> 
  unnest(data)

data.frame [32 x 12]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb	sum
6	21	160	110	3.9	2.62	16.46	0	1	4	4	322.98
6	21	160	110	3.9	2.875	17.02	0	1	4	4	323.795
6	21.4	258	110	3.08	3.215	19.44	1	0	3	1	420.135
6	18.1	225	105	2.76	3.46	20.22	1	0	3	1	379.54
6	19.2	167.6	123	3.92	3.44	18.3	1	0	4	4	344.46
6	17.8	167.6	123	3.92	3.44	18.9	1	0	4	4	343.66
6	19.7	145	175	3.62	2.77	15.5	0	1	5	6	373.59
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1	255.58
4	24.4	146.7	62	3.69	3.19	20	1	0	4	2	266.98
4	22.8	140.8	95	3.92	3.15	22.9	1	0	4	2	295.57
4	32.4	78.7	66	4.08	2.2	19.47	1	1	4	1	209.85
4	30.4	75.7	52	4.93	1.615	18.52	1	1	4	2	191.165
4	33.9	71.1	65	4.22	1.835	19.9	1	1	4	1	202.955
4	21.5	120.1	97	3.7	2.465	20.01	1	0	3	1	269.775
4	27.3	79	66	4.08	1.935	18.9	1	1	4	1	204.215
[ omitted 17 entries ]

Alternatives

mt_nest |> 
  mutate(across(data, \(ts) map(ts, \(t) mutate(t, sum = apply(pick(everything()), 1, sum))))) |> 
  unnest(data)

Using the nplyr package

library(nplyr)

mt_nest |> 
  nplyr::nest_mutate(data, sum = apply(pick(everything()), 1, sum)) |> 
  unnest(data)

copy(MT_NEST)[, data := lapply(data, \(dt) dt[, sum := apply(.SD, 1, sum)])
            ][, rbindlist(data), keyby = cyl]

data.table [32 x 12]

cyl	mpg	disp	hp	drat	wt	qsec	vs	am	gear	carb	sum
4	22.8	108	93	3.85	2.32	18.61	1	1	4	1	255.58
4	24.4	146.7	62	3.69	3.19	20	1	0	4	2	266.98
4	22.8	140.8	95	3.92	3.15	22.9	1	0	4	2	295.57
4	32.4	78.7	66	4.08	2.2	19.47	1	1	4	1	209.85
4	30.4	75.7	52	4.93	1.615	18.52	1	1	4	2	191.165
4	33.9	71.1	65	4.22	1.835	19.9	1	1	4	1	202.955
4	21.5	120.1	97	3.7	2.465	20.01	1	0	3	1	269.775
4	27.3	79	66	4.08	1.935	18.9	1	1	4	1	204.215
4	26	120.3	91	4.43	2.14	16.7	0	1	5	2	268.57
4	30.4	95.1	113	3.77	1.513	16.9	1	1	5	2	269.683
4	21.4	121	109	4.11	2.78	18.6	1	1	4	2	284.89
6	21	160	110	3.9	2.62	16.46	0	1	4	4	322.98
6	21	160	110	3.9	2.875	17.02	0	1	4	4	323.795
6	21.4	258	110	3.08	3.215	19.44	1	0	3	1	420.135
6	18.1	225	105	2.76	3.46	20.22	1	0	3	1	379.54
[ omitted 17 entries ]

4.9 Rotate / Transpose

(MT_SUMMARY <- MT[, tidy(summary(mpg)), by = cyl])

data.table [3 x 7]

cyl	minimum	q1	median	mean	q3	maximum
6	17.8	18.65	19.7	19.743	21	21.4
4	21.4	22.8	26	26.664	30.4	33.9
8	10.4	14.4	15.2	15.1	16.25	19.2

Using pivots:

Tidyverse
data.table

MT_SUMMARY |> 
  pivot_longer(!cyl, names_to = "Statistic") |> 
  pivot_wider(id_cols = "Statistic", names_from = "cyl", names_prefix = "Cyl ")

data.frame [6 x 4]

Statistic	Cyl 6	Cyl 4	Cyl 8
minimum	17.8	21.4	10.4
q1	18.65	22.8	14.4
median	19.7	26	15.2
mean	19.743	26.664	15.1
q3	21	30.4	16.25
maximum	21.4	33.9	19.2

MT_SUMMARY |> 
  melt(id.vars = "cyl", variable.name = "Statistic") |> 
  dcast(Statistic ~ paste0("Cyl ", cyl))

data.table [6 x 4]

Statistic	Cyl 4	Cyl 6	Cyl 8
minimum	21.4	17.8	10.4
q1	22.8	18.65	14.4
median	26	19.7	15.2
mean	26.664	19.743	15.1
q3	30.4	21	16.25
maximum	33.9	21.4	19.2

With dedicated functions:

Tidyverse
data.table

# No function exists to do this AFAIK

data.table::transpose(MT_SUMMARY, keep.names = "Statistic", make.names = 1)

data.table [6 x 4]

Statistic	6	4	8
minimum	17.8	21.4	10.4
q1	18.65	22.8	14.4
median	19.7	26	15.2
mean	19.743	26.664	15.1
q3	21	30.4	16.25
maximum	21.4	33.9	19.2

💻 Expand for Session Info

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       Ubuntu 22.04.3 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  C.UTF-8
 ctype    C.UTF-8
 tz       Europe/Paris
 date     2024-02-07
 pandoc   3.1.11
 Quarto   1.5.9

─ Packages ───────────────────────────────────────────────────────────────────
 ! package    * version date (UTC) lib source
 P broom      * 1.0.5   2023-06-09 [?] CRAN (R 4.3.0)
 P crayon     * 1.5.2   2022-09-29 [?] CRAN (R 4.3.0)
 P data.table * 1.15.0  2024-01-30 [?] CRAN (R 4.3.1)
 P dplyr      * 1.1.4   2023-11-17 [?] CRAN (R 4.3.1)
 P ggplot2    * 3.4.4   2023-10-12 [?] CRAN (R 4.3.1)
 P gt         * 0.10.0  2023-10-07 [?] CRAN (R 4.3.1)
 P here       * 1.0.1   2020-12-13 [?] CRAN (R 4.3.0)
 P knitr      * 1.44    2023-09-11 [?] CRAN (R 4.3.0)
 P lubridate  * 1.9.3   2023-09-27 [?] CRAN (R 4.3.1)
 P pipebind   * 0.1.2   2023-08-30 [?] CRAN (R 4.3.0)
 P purrr      * 1.0.2   2023-08-10 [?] CRAN (R 4.3.0)
 P stringr    * 1.5.0   2022-12-02 [?] CRAN (R 4.3.0)
 P tibble     * 3.2.1   2023-03-20 [?] CRAN (R 4.3.0)
 P tidyr      * 1.3.0   2023-01-24 [?] CRAN (R 4.3.0)

 [1] /home/mar/Dev/Projects/R/ma-riviere.com/renv/library/R-4.3/x86_64-pc-linux-gnu
 [2] /home/mar/.cache/R/renv/sandbox/R-4.3/x86_64-pc-linux-gnu/9a444a72

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────

Citation

BibTeX citation:

@online{rivière2022,
  author = {Rivière, Marc-Aurèle},
  title = {Data Wrangling with Data.table and the {Tidyverse}},
  date = {2022-05-19},
  url = {https://ma-riviere.com/content/code/posts/data.table},
  langid = {en},
  abstract = {This post showcases various ways to accomplish most data
    wrangling operations, from basic filtering/mutating to pivots and
    non-equi joins, with both `data.table` and the Tidyverse (`dplyr`,
    `tidyr`, `purrr`, `stringr`).}
}

For attribution, please cite this work as:

Rivière, M.-A. (2022, May 19). Data wrangling with data.table and the Tidyverse. https://ma-riviere.com/content/code/posts/data.table