library(here) # Working directory management
library(pipebind) # Piping goodies
library(data.table) # Fast data manipulation (in-RAM)
library(tibble) # Extending data.frames (Tidyverse)
library(dplyr) # Manipulating data.frames - core (Tidyverse)
library(tidyr) # Manipulating data.frames - extras (Tidyverse)
library(stringr) # Manipulating strings (Tidyverse)
library(purrr) # Manipulating lists (Tidyverse)
library(lubridate) # Manipulating date/time (Tidyverse)
library(broom) # Tidying up models output (Tidymodels)
data.table::setDTthreads(parallel::detectCores(logical = FALSE))v1: 2022-05-19
v2: 2022-05-26
- Improved the section on keys (for ordering & filtering)
- Adding a section for translations of
Tidyr(and other similar packages)
- Capping tables to display 15 rows max when unfolded
- Improving table display (stripping, hiding the contents of nested columns, …)
v3: 2022-07-20
- Updating
data.table’s examples of dynamic programming usingenv
- Added new entries in processing examples
- Added new entries to Tidyr & Others: expand + complete, transpose/rotation, …
- Added
pivot_widerexamples to match thedcastones in the Pivots section
- Added some new examples here and there across the Basic Operations section
- Added an entry for operating inside nested data.frames/data.tables
- Added a processing example for run-length encoding (i.e. successive event tagging)
v4: 2022-08-05
- Improved
pivotsection: example of one-hot encoding (and reverse operation) + better examples of partial pivots with.value
- Added
tidyr::uncount()(row duplication) example.
- Improved both light & dark themes (code highlight, tables, …)
v5: 2023-03-12
- Revamped the whole document with grouped tabsets by framework for better readability
- Revamped the whole Basic Operations section: better structure, reworked examples, …
- Revamped the whole Joins section: better structure, new examples (e.g.
join_by), better explanations, …
- Updated code to reflect recent updates of the
Tidyverse:-
dplyr(1.1.0):.by,reframe,join_by,consecutive_id, …
-
purrr(1.0.0):list_rbind,list_cbind, …
-
tidyr(1.3.0): updated theseparate/separate_rowssection to the newerseparate_wider/longer_*
-
- Updated code to reflect recent updates of
data.table(1.14.9):let,DT(), …
Setup
Applying a custom theme to all gt tables
#-----------------------#
####🔺gt knit_prints ####
#-----------------------#
library(knitr)
library(gt)
knit_print.grouped_df <- function(x, options, ...) {
if ("grouped_df" %in% class(x)) x <- ungroup(x)
cl <- intersect(class(x), c("data.table", "data.frame"))[1]
nrows <- ifelse(!is.null(options$total_rows), as.numeric(options$total_rows), dim(x)[1])
is_open <- ifelse(!is.null(options[["details-open"]]), as.logical(options[["details-open"]]), FALSE)
cat(str_glue("\n<details{ifelse(is_open, ' open', '')}>\n"))
cat("<summary>\n")
cat(str_glue("\n*{cl} [{scales::label_comma()(nrows)} x {dim(x)[2]}]*\n"))
cat("</summary>\n<br>\n")
print(gt::as_raw_html(style_table(x, nrows)))
cat("</details>\n\n")
}
registerS3method("knit_print", "grouped_df", knit_print.grouped_df)
knit_print.data.frame <- function(x, options, ...) {
cl <- intersect(class(x), c("data.table", "data.frame"))[1]
nrows <- ifelse(!is.null(options$total_rows), as.numeric(options$total_rows), dim(x)[1])
is_open <- ifelse(!is.null(options[["details-open"]]), as.logical(options[["details-open"]]), FALSE)
cat(str_glue("\n<details{ifelse(is_open, ' open', '')}>\n"))
cat("<summary>\n")
cat(str_glue("\n*{cl} [{scales::label_comma()(nrows)} x {dim(x)[2]}]*\n"))
cat("</summary>\n<br>\n")
print(gt::as_raw_html(style_table(x, nrows)))
cat("</details>\n\n")
}
registerS3method("knit_print", "data.frame", knit_print.data.frame)1 Basic Operations
data.table general syntax:
DT[row selector (filter/sort), col selector (select/mutate/summarize/reframe/rename), modifiers (group/join by)]
Data
MT <- as.data.table(mtcars)
IRIS <- as.data.table(iris)[, Species := as.character(Species)]1.1 Arrange / Order
1.1.1 Basic ordering
1.1.2 Ordering with keys
- Keys physically reorders the dataset within the RAM (by reference)
- No memory is used for sorting (other than marking which columns is the key)
- No memory is used for sorting (other than marking which columns is the key)
- The dataset is marked with an attribute “sorted”
- The dataset is always sorted in ascending order, with NA first
- Using
keybyinstead ofbywhen grouping will set the grouping factors as keys
See this SO post for more information on keys.
To see over which keys (if any) the dataset is currently ordered:
Unless our task involves repeated subsetting on the same column, the speed gain from key-based subsetting could effectively be nullified by the time needed to reorder the data in RAM, especially for large datasets.
1.1.3 Ordering with (secondary) indices
-
setindexcreates an index for the provided columns, but doesn’t physically reorder the dataset in RAM.
- It computes the ordering vector of the dataset’s rows according to the provided columns in an additional attribute called index
data.table [32 x 11]
| [ omitted 17 entries ] |
We can see the additional index attribute added to the data.table:
names(attributes(MT))[1] "names" "row.names" "class"
[4] ".internal.selfref" "index"
We can get the currently used indices with:
indices(MT)[1] “cyl__gear”
Adding a new index doesn’t remove a previously existing one:
We can thus use indices to pre-compute the ordering for the columns (or combinations of columns) that we will be using to group or subset by frequently !
1.2 Subset / Filter
1.2.1 Basic filtering
mtcars |> filter(cyl >= 6 & disp < 180)data.frame [5 x 11]
1.2.2 Filter based on a range
1.2.3 Filter with a pattern
mtcars |> filter(str_detect(disp, "^\\d{3}\\."))data.frame [9 x 11]
MT[disp %like% "^\\d{3}\\."]data.table [9 x 11]
1.2.4 Filter on row number (slicing)
mtcars |> slice(1) # slice_head(n = 1)data.frame [1 x 11]
Slice a random sample of rows:
mtcars |> slice_sample(n = 5)data.frame [5 x 11]
MT[1]data.table [1 x 11]
MT[.N]data.table [1 x 11]
Slice a random sample of rows:
MT[sample(.N, 5)]data.table [5 x 11]
1.2.5 Filter distinct/unique rows
mtcars |> distinct(mpg, hp, .keep_all = TRUE)data.frame [31 x 11]
| [ omitted 16 entries ] |
Number of unique rows/values
n_distinct(mtcars$gear)[1] 3
1.2.6 Filter by keys
When keys or indices are defined, we can filter based on them, which is often a lot faster.
We do not even need to specify the column name we are filtering on: the values will be attributed to the keys in order.
setkey(MT, cyl)
MT[.(6)] # Equivalent to MT[cyl == 6]data.table [7 x 11]
setkey(MT, cyl, gear)
MT[.(6, 4)] # Equivalent to MT[cyl == 6 & gear == 4]data.table [4 x 11]
1.2.7 Filter by indices
To filter by indices, we can use the on argument, which creates a temporary secondary index on the fly (if it doesn’t already exist).
IRIS["setosa", on = "Species"]data.table [50 x 5]
| [ omitted 35 entries ] |
Since the time to compute the secondary indices is quite small, we don’t have to use setindex, unless the task involves repeated subsetting on the same columns.
When using on with multiple values, the nomatch = NULL argument avoids creating combinations that do not exist in the original data (i.e. for cyl == 5 here)
MT[.(4:6, 4), on = c("cyl", "gear"), nomatch = NULL]data.table [12 x 11]
1.2.8 Filtering on multiple columns
Filtering with one function taking multiple columns:
f_dat <- \(d) with(d, gear > cyl) # Function taking the data and comparing fix columns
f_dyn <- \(x, y) x > y # Function taking dynamic columns and comparing themcols <- c("gear", "cyl")Manually:
mtcars |> filter(f_dyn(gear, cyl))data.frame [2 x 11]
Dynamically:
Taking column names:
Taking the data:
Manually:
MT[f_dyn(gear, cyl),]data.table [2 x 11]
Dynamically:
Taking column names:
data.table [2 x 11]
Taking the data:
MT[f_dat(MT),] # Can't use .SD in idata.table [2 x 11]
In two steps:
.SD in the i clause of a data.table
But we can bypass that constraint by doing the operation in two steps:
- Obtaining a vector stating if each row of the table matches or not the conditions
- Filtering the original table based on the vector
MT[MT[, f_dat(.SD)]]data.table [2 x 11]
Combining multiple filtering functions:
This function filters rows that have 2 or more non-zero decimals, and we’re going to call it on multiple columns:
decp <- \(x) str_length(str_remove(as.character(abs(x)), ".*\\.")) >= 2cols <- c("drat", "wt", "qsec")Manually:
MT[decp(drat) & decp(wt) & decp(qsec), ]data.table [13 x 11]
Dynamically:
In two steps:
1.3 Rename
setnames changes column names in-place
Manually:
mtcars |> rename(CYL = cyl, MPG = mpg)data.frame [32 x 11]
| [ omitted 17 entries ] |
Dynamically:
mtcars |> rename_with(\(c) toupper(c), .cols = matches("^d"))data.frame [32 x 11]
| [ omitted 17 entries ] |
1.4 Select
1.4.1 Basic selection
MT[, .(mpg, disp)]data.table [32 x 2]
| [ omitted 17 entries ] |
Remove a column:
MT[, !"cyl"] # MT[, -"cyl"]data.table [32 x 10]
| [ omitted 17 entries ] |
In-place:
copy(MT)[, cyl := NULL][]data.table [32 x 10]
| [ omitted 17 entries ] |
Select & Extract:
mtcars |> pull(disp) [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0
[25] 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
Select & Rename:
mtcars |> select(dispp = disp)data.frame [32 x 1]
| [ omitted 17 entries ] |
Select & Extract:
MT[, disp] [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0
[25] 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
Select & Rename:
MT[, .(dispp = disp)]data.table [32 x 1]
| [ omitted 17 entries ] |
1.4.2 Dynamic selection
1.4.2.1 By name
cols <- c("cyl", "disp")Removing a column:
mtcars |> select(!{{cols}}) # select(-matches(cols))data.frame [32 x 9]
| [ omitted 17 entries ] |
MT[, ..cols]data.table [32 x 2]
| [ omitted 17 entries ] |
Removing a column:
MT[, !..cols]data.table [32 x 9]
| [ omitted 17 entries ] |
In-place:
copy(MT)[, (cols) := NULL][]data.table [32 x 9]
| [ omitted 17 entries ] |
1.4.2.2 By pattern
1.4.2.3 By column type
iris |> select(where(\(x) !is.numeric(x)))data.frame [150 x 1]
| [ omitted 135 entries ] |
IRIS[, .SD, .SDcols = !is.numeric]data.table [150 x 1]
| [ omitted 135 entries ] |
1.5 Mutate / Transmute
data.table can mutate in 2 ways:
- Using = creates a new DT with the new columns only (like dplyr::transmute)
- Using := (or let) modifies the current dt in place (like dplyr::mutate)
The function modifying a column should be the same size as the original column (or group).
If only one value is provided with :=, it will be recycled to the whole column/group.
If the number of values provided is smaller than the original column/group:
- With := or let, an error will be raised, asking to manually specify how to recycle the values.
- With =, it will behave like dplyr::summarize (if a grouping has been specified).
1.5.1 Basic transmute
Only keeping the transformed columns.
mtcars |> transmute(cyl = cyl * 2)data.frame [32 x 1]
| [ omitted 17 entries ] |
MT[, .(cyl = cyl * 2)]data.table [32 x 1]
| [ omitted 17 entries ] |
Transmute & Extract:
MT[, (cyl = cyl * 2)] [1] 12 12 8 12 16 12 16 8 8 12 12 16 16 16 16 16 16 8 8 8 8 16 16 16 16
[26] 8 8 8 16 12 16 8
1.5.2 Basic mutate
Modifies the transformed column in-place and keeps every other column as-is.
1.5.3 Dynamic trans/mutate
LHS <- "mean_mpg"
RHS <- "mpg"