Data wrangling with data.table and the Tidyverse

Common data wrangling operations with both data.table and the Tidyverse.

R
Tidyverse
data.table
First Published

May 19, 2022

Summary
This post showcases various ways to accomplish most data wrangling operations, from basic filtering/mutating to pivots and non-equi joins, with both data.table and the Tidyverse (dplyr, tidyr, purrr, stringr).

v1: 2022-05-19

v2: 2022-05-26

  • Improved the section on keys (for ordering & filtering)
  • Adding a section for translations of Tidyr (and other similar packages)
  • Capping tables to display 15 rows max when unfolded
  • Improving table display (stripping, hiding the contents of nested columns, …)

v3: 2022-07-20

  • Updating data.table’s examples of dynamic programming using env
  • Added new entries in processing examples
  • Added new entries to Tidyr & Others: expand + complete, transpose/rotation, …
  • Added pivot_wider examples to match the dcast ones in the Pivots section
  • Added some new examples here and there across the Basic Operations section
  • Added an entry for operating inside nested data.frames/data.tables
  • Added a processing example for run-length encoding (i.e. successive event tagging)

v4: 2022-08-05

  • Improved pivot section: example of one-hot encoding (and reverse operation) + better examples of partial pivots with .value
  • Added tidyr::uncount() (row duplication) example.
  • Improved both light & dark themes (code highlight, tables, …)

v5: 2023-03-12

  • Revamped the whole document with grouped tabsets by framework for better readability
  • Revamped the whole Basic Operations section: better structure, reworked examples, …
  • Revamped the whole Joins section: better structure, new examples (e.g. join_by), better explanations, …
  • Updated code to reflect recent updates of the Tidyverse:
    • dplyr (1.1.0): .by, reframe, join_by, consecutive_id, …
    • purrr (1.0.0): list_rbind, list_cbind, …
    • tidyr (1.3.0): updated the separate/separate_rows section to the newer separate_wider/longer_*
  • Updated code to reflect recent updates of data.table (1.14.9): let, DT(), …

Setup


library(here)        # Working directory management
library(pipebind)    # Piping goodies

library(data.table)  # Fast data manipulation (in-RAM)

library(tibble)      # Extending data.frames             (Tidyverse)
library(dplyr)       # Manipulating data.frames - core   (Tidyverse)
library(tidyr)       # Manipulating data.frames - extras (Tidyverse)
library(stringr)     # Manipulating strings              (Tidyverse)
library(purrr)       # Manipulating lists                (Tidyverse)
library(lubridate)   # Manipulating date/time            (Tidyverse)

library(broom)       # Tidying up models output          (Tidymodels)

data.table::setDTthreads(parallel::detectCores(logical = FALSE))
Applying a custom theme to all gt tables
#-----------------------#
####🔺gt knit_prints ####
#-----------------------#

library(knitr)
library(gt)

knit_print.grouped_df <- function(x, options, ...) {
  if ("grouped_df" %in% class(x)) x <- ungroup(x)
  
  cl <- intersect(class(x), c("data.table", "data.frame"))[1]
  nrows <- ifelse(!is.null(options$total_rows), as.numeric(options$total_rows), dim(x)[1])
  is_open <- ifelse(!is.null(options[["details-open"]]), as.logical(options[["details-open"]]), FALSE)
  
  cat(str_glue("\n<details{ifelse(is_open, ' open', '')}>\n"))
  cat("<summary>\n")
  cat(str_glue("\n*{cl} [{scales::label_comma()(nrows)} x {dim(x)[2]}]*\n"))
  cat("</summary>\n<br>\n")
  print(gt::as_raw_html(style_table(x, nrows)))
  cat("</details>\n\n")
}

registerS3method("knit_print", "grouped_df", knit_print.grouped_df)

knit_print.data.frame <- function(x, options, ...) {
  cl <- intersect(class(x), c("data.table", "data.frame"))[1]
  nrows <- ifelse(!is.null(options$total_rows), as.numeric(options$total_rows), dim(x)[1])
  is_open <- ifelse(!is.null(options[["details-open"]]), as.logical(options[["details-open"]]), FALSE)
  
  cat(str_glue("\n<details{ifelse(is_open, ' open', '')}>\n"))
  cat("<summary>\n")
  cat(str_glue("\n*{cl} [{scales::label_comma()(nrows)} x {dim(x)[2]}]*\n"))
  cat("</summary>\n<br>\n")
  print(gt::as_raw_html(style_table(x, nrows)))
  cat("</details>\n\n")
}

registerS3method("knit_print", "data.frame", knit_print.data.frame)

1 Basic Operations


Tipdata.table general syntax:

DT[row selector (filter/sort), col selector (select/mutate/summarize/reframe/rename), modifiers (group/join by)]

Data

MT <- as.data.table(mtcars)
IRIS <- as.data.table(iris)[, Species := as.character(Species)]

1.1 Arrange / Order

1.1.1 Basic ordering

mtcars |> arrange(desc(cyl))
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21 6 160 110 3.9 2.62 16.46 0 1 4 4
[ omitted 17 entries ]
mtcars |> arrange(desc(cyl), gear)
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
[ omitted 17 entries ]
MT[order(-cyl)]
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21 6 160 110 3.9 2.62 16.46 0 1 4 4
[ omitted 17 entries ]
MT[order(-cyl, gear)]
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
15.2 8 304 150 3.15 3.435 17.3 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
15 8 301 335 3.54 3.57 14.6 0 1 5 8
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
[ omitted 17 entries ]
Alternatives
MT[fsort(cyl, decreasing = TRUE)]

setorder(MT, -cyl, gear)[]

setorderv(MT, c("cyl", "gear"), c(-1 ,1))[]

Ordering on a character column

IRIS[chorder(Species)]
data.table [150 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 135 entries ]

1.1.2 Ordering with keys

  • Keys physically reorders the dataset within the RAM (by reference)
    • No memory is used for sorting (other than marking which columns is the key)
  • The dataset is marked with an attribute “sorted”
  • The dataset is always sorted in ascending order, with NA first
  • Using keyby instead of by when grouping will set the grouping factors as keys
Tip

See this SO post for more information on keys.

setkey(MT, cyl, gear)

setkeyv(MT, c("cyl", "gear"))

MT
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
[ omitted 17 entries ]

To see over which keys (if any) the dataset is currently ordered:

haskey(MT)

[1] TRUE

key(MT)

[1] “cyl” “gear”

Warning

Unless our task involves repeated subsetting on the same column, the speed gain from key-based subsetting could effectively be nullified by the time needed to reorder the data in RAM, especially for large datasets.

1.1.3 Ordering with (secondary) indices

  • setindex creates an index for the provided columns, but doesn’t physically reorder the dataset in RAM.
  • It computes the ordering vector of the dataset’s rows according to the provided columns in an additional attribute called index
setindex(MT, cyl, gear)

setindexv(MT, c("cyl", "gear"))

MT
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

We can see the additional index attribute added to the data.table:

[1] "names"             "row.names"         "class"            
[4] ".internal.selfref" "index"            

We can get the currently used indices with:

[1] “cyl__gear”

Adding a new index doesn’t remove a previously existing one:

setindex(MT, hp)

indices(MT)

[1] “cyl__gear” “hp”

We can thus use indices to pre-compute the ordering for the columns (or combinations of columns) that we will be using to group or subset by frequently !

1.2 Subset / Filter

1.2.1 Basic filtering

mtcars |> filter(cyl >= 6 & disp < 180)
data.frame [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
iris |> filter(Species %in% c("setosa"))
data.frame [50 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 35 entries ]
MT[cyl >= 6 & disp < 180]
data.table [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
IRIS[Species %chin% c("setosa")]
data.table [50 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 35 entries ]

For non-regex character filtering, use %chin% (which is a character-optimized version of %in%)

1.2.2 Filter based on a range

mtcars |> filter(between(disp, 200, 300))
data.frame [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
MT[disp %between% c(200, 300)]
data.table [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3

1.2.3 Filter with a pattern

mtcars |> filter(str_detect(disp, "^\\d{3}\\."))
data.frame [9 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
MT[disp %like% "^\\d{3}\\."]
data.table [9 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
Variants
IRIS[Species %flike% "set"] # Fixed (not regex)

IRIS[Species %ilike% "Set"] # Ignore case

IRIS[Species %plike% "(?=set)"] # Perl-like regex

1.2.4 Filter on row number (slicing)

mtcars |> slice(1) # slice_head(n = 1)
data.frame [1 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
mtcars |> slice(n()) # slice_tail(n = 1)
data.frame [1 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2

Slice a random sample of rows:

mtcars |> slice_sample(n = 5)
data.frame [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
MT[1]
data.table [1 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
MT[.N]
data.table [1 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2

Slice a random sample of rows:

MT[sample(.N, 5)]
data.table [5 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
21 6 160 110 3.9 2.875 17.02 0 1 4 4

1.2.5 Filter distinct/unique rows

mtcars |> distinct(mpg, hp, .keep_all = TRUE)
data.frame [31 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
[ omitted 16 entries ]

Number of unique rows/values

n_distinct(mtcars$gear)

[1] 3

unique(MT, by = c("mpg", "hp")) # cols = other_cols_to_keep
data.table [31 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
10.4 8 460 215 3 5.424 17.82 0 0 3 4
[ omitted 16 entries ]

Number of unique rows/values

uniqueN(MT, by = "gear")

[1] 3

1.2.6 Filter by keys

When keys or indices are defined, we can filter based on them, which is often a lot faster.

Tip

We do not even need to specify the column name we are filtering on: the values will be attributed to the keys in order.

setkey(MT, cyl)

MT[.(6)] # Equivalent to MT[cyl == 6]
data.table [7 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
setkey(MT, cyl, gear)

MT[.(6, 4)] # Equivalent to MT[cyl == 6 & gear == 4]
data.table [4 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4

1.2.7 Filter by indices

To filter by indices, we can use the on argument, which creates a temporary secondary index on the fly (if it doesn’t already exist).

IRIS["setosa", on = "Species"]
data.table [50 x 5]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
[ omitted 35 entries ]

Since the time to compute the secondary indices is quite small, we don’t have to use setindex, unless the task involves repeated subsetting on the same columns.

Tip

When using on with multiple values, the nomatch = NULL argument avoids creating combinations that do not exist in the original data (i.e. for cyl == 5 here)

MT[.(4:6, 4), on = c("cyl", "gear"), nomatch = NULL]
data.table [12 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
27.3 4 79 66 4.08 1.935 18.9 1 1 4 1
21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4

1.2.8 Filtering on multiple columns

Filtering with one function taking multiple columns:

f_dat <- \(d) with(d, gear > cyl) # Function taking the data and comparing fix columns

f_dyn <- \(x, y) x > y # Function taking dynamic columns and comparing them
cols <- c("gear", "cyl")

Manually:

mtcars |> filter(f_dyn(gear, cyl))
data.frame [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

Dynamically:

Taking column names:

mtcars |> filter(f_dyn(!!!syms(cols)))
data.frame [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

Taking the data:

mtcars |> filter(f_dat(cur_data()))
data.frame [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

Manually:

MT[f_dyn(gear, cyl),]
data.table [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

Dynamically:

Taking column names:

MT[do.call(f_dyn, args), env = list(args = as.list(cols))] # exec(f_dyn, !!!args)
data.table [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

Taking the data:

MT[f_dat(MT),] # Can't use .SD in i
data.table [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

In two steps:

NoteWe can’t use .SD in the i clause of a data.table

But we can bypass that constraint by doing the operation in two steps:
- Obtaining a vector stating if each row of the table matches or not the conditions
- Filtering the original table based on the vector

MT[MT[, f_dat(.SD)]]
data.table [2 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
26 4 120.3 91 4.43 2.14 16.7 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

Combining multiple filtering functions:

This function filters rows that have 2 or more non-zero decimals, and we’re going to call it on multiple columns:

decp <- \(x) str_length(str_remove(as.character(abs(x)), ".*\\.")) >= 2
cols <- c("drat", "wt", "qsec")

Manually:

mtcars |> filter(decp(drat) & decp(wt) & decp(qsec))
data.frame [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2

Dynamically:

mtcars |> filter(if_all(cols, decp))
data.frame [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2

Manually:

MT[decp(drat) & decp(wt) & decp(qsec), ]
data.table [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2

Dynamically:

MT[Reduce(`&`, lapply(mget(cols), decp)), ]
data.table [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
Alternatives
MT[Reduce(`&`, lapply(MT[, ..cols], decp)), ]

MT[Reduce(`&`, lapply(v1, decp)), env = list(v1 = as.list(cols))]

In two steps:

MT[MT[, Reduce(`&`, lapply(.SD, decp)), .SDcols = cols]]
data.table [13 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
15.5 8 318 150 2.76 3.52 16.87 0 0 3 2
13.3 8 350 245 3.73 3.84 15.41 0 0 3 4
19.2 8 400 175 3.08 3.845 17.05 0 0 3 2

1.3 Rename

Note

setnames changes column names in-place

Manually:

mtcars |> rename(CYL = cyl, MPG = mpg)
data.frame [32 x 11]
MPG CYL disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

Dynamically:

mtcars |> rename_with(\(c) toupper(c), .cols = matches("^d"))
data.frame [32 x 11]
mpg cyl DISP hp DRAT wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

Manually:

setnames(copy(MT), c("cyl", "mpg"), c("CYL", "MPG"))[]
data.table [32 x 11]
MPG CYL disp hp drat wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

Dynamically:

setnames(copy(MT), grep("^d", colnames(MT)), toupper)[]
data.table [32 x 11]
mpg cyl DISP hp DRAT wt qsec vs am gear carb
21 6 160 110 3.9 2.62 16.46 0 1 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

1.4 Select

1.4.1 Basic selection

MT |> select(matches("cyl|disp"))
data.table [32 x 2]
cyl disp
6 160
6 160
4 108
6 258
8 360
6 225
8 360
4 146.7
4 140.8
6 167.6
6 167.6
8 275.8
8 275.8
8 275.8
8 472
[ omitted 17 entries ]



Remove a column:

mtcars |> select(!cyl) # select(-cyl)
data.frame [32 x 10]
mpg disp hp drat wt qsec vs am gear carb
21 160 110 3.9 2.62 16.46 0 1 4 4
21 160 110 3.9 2.875 17.02 0 1 4 4
22.8 108 93 3.85 2.32 18.61 1 1 4 1
21.4 258 110 3.08 3.215 19.44 1 0 3 1
18.7 360 175 3.15 3.44 17.02 0 0 3 2
18.1 225 105 2.76 3.46 20.22 1 0 3 1
14.3 360 245 3.21 3.57 15.84 0 0 3 4
24.4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 275.8 180 3.07 3.78 18 0 0 3 3
10.4 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
MT[, .(mpg, disp)]
data.table [32 x 2]
mpg disp
21 160
21 160
22.8 108
21.4 258
18.7 360
18.1 225
14.3 360
24.4 146.7
22.8 140.8
19.2 167.6
17.8 167.6
16.4 275.8
17.3 275.8
15.2 275.8
10.4 472
[ omitted 17 entries ]
Alternatives
MT[ , .SD, .SDcols = c("mpg", "disp")]

MT[, .SD, .SDcols = patterns("mpg|disp")]

Remove a column:

MT[, !"cyl"] # MT[, -"cyl"]
data.table [32 x 10]
mpg disp hp drat wt qsec vs am gear carb
21 160 110 3.9 2.62 16.46 0 1 4 4
21 160 110 3.9 2.875 17.02 0 1 4 4
22.8 108 93 3.85 2.32 18.61 1 1 4 1
21.4 258 110 3.08 3.215 19.44 1 0 3 1
18.7 360 175 3.15 3.44 17.02 0 0 3 2
18.1 225 105 2.76 3.46 20.22 1 0 3 1
14.3 360 245 3.21 3.57 15.84 0 0 3 4
24.4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 275.8 180 3.07 3.78 18 0 0 3 3
10.4 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

In-place:

copy(MT)[, cyl := NULL][]
data.table [32 x 10]
mpg disp hp drat wt qsec vs am gear carb
21 160 110 3.9 2.62 16.46 0 1 4 4
21 160 110 3.9 2.875 17.02 0 1 4 4
22.8 108 93 3.85 2.32 18.61 1 1 4 1
21.4 258 110 3.08 3.215 19.44 1 0 3 1
18.7 360 175 3.15 3.44 17.02 0 0 3 2
18.1 225 105 2.76 3.46 20.22 1 0 3 1
14.3 360 245 3.21 3.57 15.84 0 0 3 4
24.4 146.7 62 3.69 3.19 20 1 0 4 2
22.8 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 275.8 180 3.07 3.78 18 0 0 3 3
10.4 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

Select & Extract:

mtcars |> pull(disp)
 [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0 304.0 350.0
[25] 400.0  79.0 120.3  95.1 351.0 145.0 301.0 121.0

Select & Rename:

mtcars |> select(dispp = disp)
data.frame [32 x 1]
dispp
160
160
108
258
360
225
360
146.7
140.8
167.6
167.6
275.8
275.8
275.8
472
[ omitted 17 entries ]

Select & Extract:

MT[, disp]
 [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0 304.0 350.0
[25] 400.0  79.0 120.3  95.1 351.0 145.0 301.0 121.0

Select & Rename:

MT[, .(dispp = disp)]
data.table [32 x 1]
dispp
160
160
108
258
360
225
360
146.7
140.8
167.6
167.6
275.8
275.8
275.8
472
[ omitted 17 entries ]

1.4.2 Dynamic selection

1.4.2.1 By name

cols <- c("cyl", "disp")
mtcars |> select(all_of(cols)) # select(!!cols)
data.frame [32 x 2]
cyl disp
6 160
6 160
4 108
6 258
8 360
6 225
8 360
4 146.7
4 140.8
6 167.6
6 167.6
8 275.8
8 275.8
8 275.8
8 472
[ omitted 17 entries ]



Removing a column:

mtcars |> select(!{{cols}}) # select(-matches(cols))
data.frame [32 x 9]
mpg hp drat wt qsec vs am gear carb
21 110 3.9 2.62 16.46 0 1 4 4
21 110 3.9 2.875 17.02 0 1 4 4
22.8 93 3.85 2.32 18.61 1 1 4 1
21.4 110 3.08 3.215 19.44 1 0 3 1
18.7 175 3.15 3.44 17.02 0 0 3 2
18.1 105 2.76 3.46 20.22 1 0 3 1
14.3 245 3.21 3.57 15.84 0 0 3 4
24.4 62 3.69 3.19 20 1 0 4 2
22.8 95 3.92 3.15 22.9 1 0 4 2
19.2 123 3.92 3.44 18.3 1 0 4 4
17.8 123 3.92 3.44 18.9 1 0 4 4
16.4 180 3.07 4.07 17.4 0 0 3 3
17.3 180 3.07 3.73 17.6 0 0 3 3
15.2 180 3.07 3.78 18 0 0 3 3
10.4 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
MT[, ..cols]
data.table [32 x 2]
cyl disp
6 160
6 160
4 108
6 258
8 360
6 225
8 360
4 146.7
4 140.8
6 167.6
6 167.6
8 275.8
8 275.8
8 275.8
8 472
[ omitted 17 entries ]
Alternatives
MT[, mget(cols)] # Retired

MT[, cols, with = FALSE] # Retired

MT[, .SD, .SDcols = cols]

MT[, j, env = list(j = as.list(cols))]

Removing a column:

MT[, !..cols]
data.table [32 x 9]
mpg hp drat wt qsec vs am gear carb
21 110 3.9 2.62 16.46 0 1 4 4
21 110 3.9 2.875 17.02 0 1 4 4
22.8 93 3.85 2.32 18.61 1 1 4 1
21.4 110 3.08 3.215 19.44 1 0 3 1
18.7 175 3.15 3.44 17.02 0 0 3 2
18.1 105 2.76 3.46 20.22 1 0 3 1
14.3 245 3.21 3.57 15.84 0 0 3 4
24.4 62 3.69 3.19 20 1 0 4 2
22.8 95 3.92 3.15 22.9 1 0 4 2
19.2 123 3.92 3.44 18.3 1 0 4 4
17.8 123 3.92 3.44 18.9 1 0 4 4
16.4 180 3.07 4.07 17.4 0 0 3 3
17.3 180 3.07 3.73 17.6 0 0 3 3
15.2 180 3.07 3.78 18 0 0 3 3
10.4 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
Alternatives
MT[, .SD, .SDcols = !cols]

MT[, -j, env = list(j = I(cols))]

In-place:

copy(MT)[, (cols) := NULL][]
data.table [32 x 9]
mpg hp drat wt qsec vs am gear carb
21 110 3.9 2.62 16.46 0 1 4 4
21 110 3.9 2.875 17.02 0 1 4 4
22.8 93 3.85 2.32 18.61 1 1 4 1
21.4 110 3.08 3.215 19.44 1 0 3 1
18.7 175 3.15 3.44 17.02 0 0 3 2
18.1 105 2.76 3.46 20.22 1 0 3 1
14.3 245 3.21 3.57 15.84 0 0 3 4
24.4 62 3.69 3.19 20 1 0 4 2
22.8 95 3.92 3.15 22.9 1 0 4 2
19.2 123 3.92 3.44 18.3 1 0 4 4
17.8 123 3.92 3.44 18.9 1 0 4 4
16.4 180 3.07 4.07 17.4 0 0 3 3
17.3 180 3.07 3.73 17.6 0 0 3 3
15.2 180 3.07 3.78 18 0 0 3 3
10.4 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]

1.4.2.2 By pattern

mtcars |> select(-matches("^d"))
data.frame [32 x 9]
mpg cyl hp wt qsec vs am gear carb
21 6 110 2.62 16.46 0 1 4 4
21 6 110 2.875 17.02 0 1 4 4
22.8 4 93 2.32 18.61 1 1 4 1
21.4 6 110 3.215 19.44 1 0 3 1
18.7 8 175 3.44 17.02 0 0 3 2
18.1 6 105 3.46 20.22 1 0 3 1
14.3 8 245 3.57 15.84 0 0 3 4
24.4 4 62 3.19 20 1 0 4 2
22.8 4 95 3.15 22.9 1 0 4 2
19.2 6 123 3.44 18.3 1 0 4 4
17.8 6 123 3.44 18.9 1 0 4 4
16.4 8 180 4.07 17.4 0 0 3 3
17.3 8 180 3.73 17.6 0 0 3 3
15.2 8 180 3.78 18 0 0 3 3
10.4 8 205 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
mtcars |> select(where(\(x) all(x != 0))) # Only keep columns where no value == 0
data.frame [32 x 9]
mpg cyl disp hp drat wt qsec gear carb
21 6 160 110 3.9 2.62 16.46 4 4
21 6 160 110 3.9 2.875 17.02 4 4
22.8 4 108 93 3.85 2.32 18.61 4 1
21.4 6 258 110 3.08 3.215 19.44 3 1
18.7 8 360 175 3.15 3.44 17.02 3 2
18.1 6 225 105 2.76 3.46 20.22 3 1
14.3 8 360 245 3.21 3.57 15.84 3 4
24.4 4 146.7 62 3.69 3.19 20 4 2
22.8 4 140.8 95 3.92 3.15 22.9 4 2
19.2 6 167.6 123 3.92 3.44 18.3 4 4
17.8 6 167.6 123 3.92 3.44 18.9 4 4
16.4 8 275.8 180 3.07 4.07 17.4 3 3
17.3 8 275.8 180 3.07 3.73 17.6 3 3
15.2 8 275.8 180 3.07 3.78 18 3 3
10.4 8 472 205 2.93 5.25 17.98 3 4
[ omitted 17 entries ]
MT[, .SD, .SDcols = !patterns("^d")]
data.table [32 x 9]
mpg cyl hp wt qsec vs am gear carb
21 6 110 2.62 16.46 0 1 4 4
21 6 110 2.875 17.02 0 1 4 4
22.8 4 93 2.32 18.61 1 1 4 1
21.4 6 110 3.215 19.44 1 0 3 1
18.7 8 175 3.44 17.02 0 0 3 2
18.1 6 105 3.46 20.22 1 0 3 1
14.3 8 245 3.57 15.84 0 0 3 4
24.4 4 62 3.19 20 1 0 4 2
22.8 4 95 3.15 22.9 1 0 4 2
19.2 6 123 3.44 18.3 1 0 4 4
17.8 6 123 3.44 18.9 1 0 4 4
16.4 8 180 4.07 17.4 0 0 3 3
17.3 8 180 3.73 17.6 0 0 3 3
15.2 8 180 3.78 18 0 0 3 3
10.4 8 205 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
MT[, .SD, .SDcols = \(x) all(x != 0)] # Only keep columns where no value == 0
data.table [32 x 9]
mpg cyl disp hp drat wt qsec gear carb
21 6 160 110 3.9 2.62 16.46 4 4
21 6 160 110 3.9 2.875 17.02 4 4
22.8 4 108 93 3.85 2.32 18.61 4 1
21.4 6 258 110 3.08 3.215 19.44 3 1
18.7 8 360 175 3.15 3.44 17.02 3 2
18.1 6 225 105 2.76 3.46 20.22 3 1
14.3 8 360 245 3.21 3.57 15.84 3 4
24.4 4 146.7 62 3.69 3.19 20 4 2
22.8 4 140.8 95 3.92 3.15 22.9 4 2
19.2 6 167.6 123 3.92 3.44 18.3 4 4
17.8 6 167.6 123 3.92 3.44 18.9 4 4
16.4 8 275.8 180 3.07 4.07 17.4 3 3
17.3 8 275.8 180 3.07 3.73 17.6 3 3
15.2 8 275.8 180 3.07 3.78 18 3 3
10.4 8 472 205 2.93 5.25 17.98 3 4
[ omitted 17 entries ]
Alternatives
copy(MT)[, grep("^d", colnames(MT)) := NULL][] # In place (column deletion)

MT[, MT[, sapply(.SD, \(x) all(x != 0))], with = FALSE]

1.4.2.3 By column type

iris |> select(where(\(x) !is.numeric(x)))
data.frame [150 x 1]
Species
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
[ omitted 135 entries ]
IRIS[, .SD, .SDcols = !is.numeric]
data.table [150 x 1]
Species
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
[ omitted 135 entries ]

1.5 Mutate / Transmute

data.table can mutate in 2 ways:
- Using = creates a new DT with the new columns only (like dplyr::transmute)
- Using := (or let) modifies the current dt in place (like dplyr::mutate)

The function modifying a column should be the same size as the original column (or group).
If only one value is provided with :=, it will be recycled to the whole column/group.

If the number of values provided is smaller than the original column/group:
- With := or let, an error will be raised, asking to manually specify how to recycle the values.
- With =, it will behave like dplyr::summarize (if a grouping has been specified).

1.5.1 Basic transmute

Only keeping the transformed columns.

mtcars |> transmute(cyl = cyl * 2)
data.frame [32 x 1]
cyl
12
12
8
12
16
12
16
8
8
12
12
16
16
16
16
[ omitted 17 entries ]
MT[, .(cyl = cyl * 2)]
data.table [32 x 1]
cyl
12
12
8
12
16
12
16
8
8
12
12
16
16
16
16
[ omitted 17 entries ]

Transmute & Extract:

MT[, (cyl = cyl * 2)]
 [1] 12 12  8 12 16 12 16  8  8 12 12 16 16 16 16 16 16  8  8  8  8 16 16 16 16
[26]  8  8  8 16 12 16  8

1.5.2 Basic mutate

Modifies the transformed column in-place and keeps every other column as-is.

mtcars |> mutate(cyl = 200)
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 200 160 110 3.9 2.62 16.46 0 1 4 4
21 200 160 110 3.9 2.875 17.02 0 1 4 4
22.8 200 108 93 3.85 2.32 18.61 1 1 4 1
21.4 200 258 110 3.08 3.215 19.44 1 0 3 1
18.7 200 360 175 3.15 3.44 17.02 0 0 3 2
18.1 200 225 105 2.76 3.46 20.22 1 0 3 1
14.3 200 360 245 3.21 3.57 15.84 0 0 3 4
24.4 200 146.7 62 3.69 3.19 20 1 0 4 2
22.8 200 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 200 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 200 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 200 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 200 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 200 275.8 180 3.07 3.78 18 0 0 3 3
10.4 200 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
mtcars |> mutate(cyl = 200, gear = 5)
data.frame [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 200 160 110 3.9 2.62 16.46 0 1 5 4
21 200 160 110 3.9 2.875 17.02 0 1 5 4
22.8 200 108 93 3.85 2.32 18.61 1 1 5 1
21.4 200 258 110 3.08 3.215 19.44 1 0 5 1
18.7 200 360 175 3.15 3.44 17.02 0 0 5 2
18.1 200 225 105 2.76 3.46 20.22 1 0 5 1
14.3 200 360 245 3.21 3.57 15.84 0 0 5 4
24.4 200 146.7 62 3.69 3.19 20 1 0 5 2
22.8 200 140.8 95 3.92 3.15 22.9 1 0 5 2
19.2 200 167.6 123 3.92 3.44 18.3 1 0 5 4
17.8 200 167.6 123 3.92 3.44 18.9 1 0 5 4
16.4 200 275.8 180 3.07 4.07 17.4 0 0 5 3
17.3 200 275.8 180 3.07 3.73 17.6 0 0 5 3
15.2 200 275.8 180 3.07 3.78 18 0 0 5 3
10.4 200 472 205 2.93 5.25 17.98 0 0 5 4
[ omitted 17 entries ]


mtcars |> mutate(mean_cyl = mean(cyl, na.rm = TRUE))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb mean_cyl
21 6 160 110 3.9 2.62 16.46 0 1 4 4 6.188
21 6 160 110 3.9 2.875 17.02 0 1 4 4 6.188
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 6.188
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 6.188
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 6.188
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 6.188
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 6.188
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 6.188
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 6.188
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 6.188
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 6.188
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 6.188
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 6.188
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 6.188
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 6.188
[ omitted 17 entries ]
mtcars |> mutate(gear_plus = lead(gear))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb gear_plus
21 6 160 110 3.9 2.62 16.46 0 1 4 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 3
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 3
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 3
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 4
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 3
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 3
[ omitted 17 entries ]
copy(MT)[, cyl := 200][]
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 200 160 110 3.9 2.62 16.46 0 1 4 4
21 200 160 110 3.9 2.875 17.02 0 1 4 4
22.8 200 108 93 3.85 2.32 18.61 1 1 4 1
21.4 200 258 110 3.08 3.215 19.44 1 0 3 1
18.7 200 360 175 3.15 3.44 17.02 0 0 3 2
18.1 200 225 105 2.76 3.46 20.22 1 0 3 1
14.3 200 360 245 3.21 3.57 15.84 0 0 3 4
24.4 200 146.7 62 3.69 3.19 20 1 0 4 2
22.8 200 140.8 95 3.92 3.15 22.9 1 0 4 2
19.2 200 167.6 123 3.92 3.44 18.3 1 0 4 4
17.8 200 167.6 123 3.92 3.44 18.9 1 0 4 4
16.4 200 275.8 180 3.07 4.07 17.4 0 0 3 3
17.3 200 275.8 180 3.07 3.73 17.6 0 0 3 3
15.2 200 275.8 180 3.07 3.78 18 0 0 3 3
10.4 200 472 205 2.93 5.25 17.98 0 0 3 4
[ omitted 17 entries ]
copy(MT)[, let(cyl = 200, gear = 5)][]
data.table [32 x 11]
mpg cyl disp hp drat wt qsec vs am gear carb
21 200 160 110 3.9 2.62 16.46 0 1 5 4
21 200 160 110 3.9 2.875 17.02 0 1 5 4
22.8 200 108 93 3.85 2.32 18.61 1 1 5 1
21.4 200 258 110 3.08 3.215 19.44 1 0 5 1
18.7 200 360 175 3.15 3.44 17.02 0 0 5 2
18.1 200 225 105 2.76 3.46 20.22 1 0 5 1
14.3 200 360 245 3.21 3.57 15.84 0 0 5 4
24.4 200 146.7 62 3.69 3.19 20 1 0 5 2
22.8 200 140.8 95 3.92 3.15 22.9 1 0 5 2
19.2 200 167.6 123 3.92 3.44 18.3 1 0 5 4
17.8 200 167.6 123 3.92 3.44 18.9 1 0 5 4
16.4 200 275.8 180 3.07 4.07 17.4 0 0 5 3
17.3 200 275.8 180 3.07 3.73 17.6 0 0 5 3
15.2 200 275.8 180 3.07 3.78 18 0 0 5 3
10.4 200 472 205 2.93 5.25 17.98 0 0 5 4
[ omitted 17 entries ]
Alternatives
copy(MT)[, `:=`(cyl = 200, gear = 5)][]

copy(MT)[, c("cyl", "gear") := .(200, 5)][]
copy(MT)[, mean_cyl := mean(cyl, na.rm = TRUE)][]
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb mean_cyl
21 6 160 110 3.9 2.62 16.46 0 1 4 4 6.188
21 6 160 110 3.9 2.875 17.02 0 1 4 4 6.188
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 6.188
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 6.188
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 6.188
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 6.188
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 6.188
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 6.188
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 6.188
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 6.188
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 6.188
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 6.188
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 6.188
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 6.188
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 6.188
[ omitted 17 entries ]
copy(MT)[, gearplus := shift(gear, 1, type = "lead")][] # lead, lag, cyclic
data.table [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb gearplus
21 6 160 110 3.9 2.62 16.46 0 1 4 4 4
21 6 160 110 3.9 2.875 17.02 0 1 4 4 4
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 3
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 3
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 3
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 4
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 4
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 4
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 4
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 3
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 3
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 3
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 3
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 3
[ omitted 17 entries ]

1.5.3 Dynamic trans/mutate

LHS <- "mean_mpg"
RHS <- "mpg"
mtcars |> mutate({{LHS}} := mean(mpg))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb mean_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 20.091
21 6 160 110 3.9 2.875 17.02 0 1 4 4 20.091
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 20.091
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 20.091
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 20.091
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 20.091
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 20.091
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 20.091
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 20.091
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 20.091
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 20.091
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 20.091
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 20.091
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 20.091
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 20.091
[ omitted 17 entries ]
mtcars |> mutate("{LHS}" := mean(.data[[RHS]]))
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec vs am gear carb mean_mpg
21 6 160 110 3.9 2.62 16.46 0 1 4 4 20.091
21 6 160 110 3.9 2.875 17.02 0 1 4 4 20.091
22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 20.091
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 20.091
18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 20.091
18.1 6 225 105 2.76 3.46 20.22 1 0 3 1 20.091
14.3 8 360 245 3.21 3.57 15.84 0 0 3 4 20.091
24.4 4 146.7 62 3.69 3.19 20 1 0 4 2 20.091
22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 20.091
19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 20.091
17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 20.091
16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 20.091
17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 20.091
15.2 8 275.8 180 3.07 3.78 18 0 0 3 3 20.091
10.4 8 472 205 2.93 5.25 17.98 0 0 3 4 20.091
[ omitted 17 entries ]
mtcars |> mutate({{LHS}} := cur_data()[[RHS]] |> mean())
data.frame [32 x 12]
mpg cyl disp hp drat wt qsec