library(here) # Working directory management
library(pipebind) # Piping goodies
library(data.table) # Fast data manipulation (in-RAM)
library(tibble) # Extending data.frames (Tidyverse)
library(dplyr) # Manipulating data.frames - core (Tidyverse)
library(tidyr) # Manipulating data.frames - extras (Tidyverse)
library(stringr) # Manipulating strings (Tidyverse)
library(purrr) # Manipulating lists (Tidyverse)
library(lubridate) # Manipulating date/time (Tidyverse)
library(broom) # Tidying up models output (Tidymodels)
data.table::setDTthreads(parallel::detectCores(logical = FALSE))
v1: 2022-05-19
v2: 2022-05-26
- Improved the section on keys (for ordering & filtering)
- Adding a section for translations of
Tidyr
(and other similar packages)
- Capping tables to display 15 rows max when unfolded
- Improving table display (stripping, hiding the contents of nested columns, …)
v3: 2022-07-20
- Updating
data.table
’s examples of dynamic programming usingenv
- Added new entries in processing examples
- Added new entries to Tidyr & Others: expand + complete, transpose/rotation, …
- Added
pivot_wider
examples to match thedcast
ones in the Pivots section
- Added some new examples here and there across the Basic Operations section
- Added an entry for operating inside nested data.frames/data.tables
- Added a processing example for run-length encoding (i.e. successive event tagging)
v4: 2022-08-05
- Improved
pivot
section: example of one-hot encoding (and reverse operation) + better examples of partial pivots with.value
- Added
tidyr::uncount()
(row duplication) example.
- Improved both light & dark themes (code highlight, tables, …)
v5: 2023-03-12
- Revamped the whole document with grouped tabsets by framework for better readability
- Revamped the whole Basic Operations section: better structure, reworked examples, …
- Revamped the whole Joins section: better structure, new examples (e.g.
join_by
), better explanations, …
- Updated code to reflect recent updates of the
Tidyverse
:-
dplyr
(1.1.0):.by
,reframe
,join_by
,consecutive_id
, …
-
purrr
(1.0.0):list_rbind
,list_cbind
, …
-
tidyr
(1.3.0): updated theseparate/separate_rows
section to the newerseparate_wider/longer_*
-
- Updated code to reflect recent updates of
data.table
(1.14.9):let
,DT()
, …
Setup
Applying a custom theme to all gt
tables
#-----------------------#
####🔺gt knit_prints ####
#-----------------------#
library(knitr)
library(gt)
knit_print.grouped_df <- function(x, options, ...) {
if ("grouped_df" %in% class(x)) x <- ungroup(x)
cl <- intersect(class(x), c("data.table", "data.frame"))[1]
nrows <- ifelse(!is.null(options$total_rows), as.numeric(options$total_rows), dim(x)[1])
is_open <- ifelse(!is.null(options[["details-open"]]), as.logical(options[["details-open"]]), FALSE)
cat(str_glue("\n<details{ifelse(is_open, ' open', '')}>\n"))
cat("<summary>\n")
cat(str_glue("\n*{cl} [{scales::label_comma()(nrows)} x {dim(x)[2]}]*\n"))
cat("</summary>\n<br>\n")
print(gt::as_raw_html(style_table(x, nrows)))
cat("</details>\n\n")
}
registerS3method("knit_print", "grouped_df", knit_print.grouped_df)
knit_print.data.frame <- function(x, options, ...) {
cl <- intersect(class(x), c("data.table", "data.frame"))[1]
nrows <- ifelse(!is.null(options$total_rows), as.numeric(options$total_rows), dim(x)[1])
is_open <- ifelse(!is.null(options[["details-open"]]), as.logical(options[["details-open"]]), FALSE)
cat(str_glue("\n<details{ifelse(is_open, ' open', '')}>\n"))
cat("<summary>\n")
cat(str_glue("\n*{cl} [{scales::label_comma()(nrows)} x {dim(x)[2]}]*\n"))
cat("</summary>\n<br>\n")
print(gt::as_raw_html(style_table(x, nrows)))
cat("</details>\n\n")
}
registerS3method("knit_print", "data.frame", knit_print.data.frame)
1 Basic Operations
data.table
general syntax:
DT[row selector
(filter/sort), col selector
(select/mutate/summarize/reframe/rename), modifiers
(group/join by)]
Data
MT <- as.data.table(mtcars)
IRIS <- as.data.table(iris)[, Species := as.character(Species)]
1.1 Arrange / Order
1.1.1 Basic ordering
1.1.2 Ordering with keys
- Keys physically reorders the dataset within the RAM (by reference)
- No memory is used for sorting (other than marking which columns is the key)
- No memory is used for sorting (other than marking which columns is the key)
- The dataset is marked with an attribute “sorted”
- The dataset is always sorted in ascending order, with NA first
- Using
keyby
instead ofby
when grouping will set the grouping factors as keys
See this SO post for more information on keys.
To see over which keys (if any) the dataset is currently ordered:
Unless our task involves repeated subsetting on the same column, the speed gain from key-based subsetting could effectively be nullified by the time needed to reorder the data in RAM, especially for large datasets.
1.1.3 Ordering with (secondary) indices
-
setindex
creates an index for the provided columns, but doesn’t physically reorder the dataset in RAM.
- It computes the ordering vector of the dataset’s rows according to the provided columns in an additional attribute called index
data.table [32 x 11]
[ omitted 17 entries ] |
We can see the additional index attribute added to the data.table
:
names(attributes(MT))
[1] "names" "row.names" "class"
[4] ".internal.selfref" "index"
We can get the currently used indices with:
indices(MT)
[1] “cyl__gear”
Adding a new index doesn’t remove a previously existing one:
We can thus use indices to pre-compute the ordering for the columns (or combinations of columns) that we will be using to group or subset by frequently !
1.2 Subset / Filter
1.2.1 Basic filtering
mtcars |> filter(cyl >= 6 & disp < 180)
data.frame [5 x 11]
1.2.2 Filter based on a range
1.2.3 Filter with a pattern
mtcars |> filter(str_detect(disp, "^\\d{3}\\."))
data.frame [9 x 11]
MT[disp %like% "^\\d{3}\\."]
data.table [9 x 11]
1.2.4 Filter on row number (slicing)
mtcars |> slice(1) # slice_head(n = 1)
data.frame [1 x 11]
Slice a random sample of rows:
mtcars |> slice_sample(n = 5)
data.frame [5 x 11]
MT[1]
data.table [1 x 11]
MT[.N]
data.table [1 x 11]
Slice a random sample of rows:
MT[sample(.N, 5)]
data.table [5 x 11]
1.2.5 Filter distinct/unique rows
mtcars |> distinct(mpg, hp, .keep_all = TRUE)
data.frame [31 x 11]
[ omitted 16 entries ] |
Number of unique rows/values
n_distinct(mtcars$gear)
[1] 3
1.2.6 Filter by keys
When keys or indices are defined, we can filter based on them, which is often a lot faster.
We do not even need to specify the column name we are filtering on: the values will be attributed to the keys in order.
setkey(MT, cyl)
MT[.(6)] # Equivalent to MT[cyl == 6]
data.table [7 x 11]
setkey(MT, cyl, gear)
MT[.(6, 4)] # Equivalent to MT[cyl == 6 & gear == 4]
data.table [4 x 11]
1.2.7 Filter by indices
To filter by indices, we can use the on
argument, which creates a temporary secondary index on the fly (if it doesn’t already exist).
IRIS["setosa", on = "Species"]
data.table [50 x 5]
[ omitted 35 entries ] |
Since the time to compute the secondary indices is quite small, we don’t have to use setindex
, unless the task involves repeated subsetting on the same columns.
When using on
with multiple values, the nomatch = NULL
argument avoids creating combinations that do not exist in the original data (i.e. for cyl == 5
here)
MT[.(4:6, 4), on = c("cyl", "gear"), nomatch = NULL]
data.table [12 x 11]
1.2.8 Filtering on multiple columns
Filtering with one function taking multiple columns:
f_dat <- \(d) with(d, gear > cyl) # Function taking the data and comparing fix columns
f_dyn <- \(x, y) x > y # Function taking dynamic columns and comparing them
cols <- c("gear", "cyl")
Manually:
mtcars |> filter(f_dyn(gear, cyl))
data.frame [2 x 11]
Dynamically:
Taking column names:
Taking the data:
Manually:
MT[f_dyn(gear, cyl),]
data.table [2 x 11]
Dynamically:
Taking column names:
data.table [2 x 11]
Taking the data:
MT[f_dat(MT),] # Can't use .SD in i
data.table [2 x 11]
In two steps:
.SD
in the i
clause of a data.table
But we can bypass that constraint by doing the operation in two steps:
- Obtaining a vector stating if each row of the table matches or not the conditions
- Filtering the original table based on the vector
MT[MT[, f_dat(.SD)]]
data.table [2 x 11]
Combining multiple filtering functions:
This function filters rows that have 2 or more non-zero decimals, and we’re going to call it on multiple columns:
decp <- \(x) str_length(str_remove(as.character(abs(x)), ".*\\.")) >= 2
cols <- c("drat", "wt", "qsec")
Manually:
MT[decp(drat) & decp(wt) & decp(qsec), ]
data.table [13 x 11]
Dynamically:
In two steps:
1.3 Rename
setnames
changes column names in-place
Manually:
mtcars |> rename(CYL = cyl, MPG = mpg)
data.frame [32 x 11]
[ omitted 17 entries ] |
Dynamically:
mtcars |> rename_with(\(c) toupper(c), .cols = matches("^d"))
data.frame [32 x 11]
[ omitted 17 entries ] |
1.4 Select
1.4.1 Basic selection
MT[, .(mpg, disp)]
data.table [32 x 2]
[ omitted 17 entries ] |
Alternatives
MT[ , .SD, .SDcols = c("mpg", "disp")]
MT[, .SD, .SDcols = patterns("mpg|disp")]
Remove a column:
MT[, !"cyl"] # MT[, -"cyl"]
data.table [32 x 10]
[ omitted 17 entries ] |
In-place:
copy(MT)[, cyl := NULL][]
data.table [32 x 10]
[ omitted 17 entries ] |
Select & Extract:
mtcars |> pull(disp)
[1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0
[25] 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
Select & Rename:
mtcars |> select(dispp = disp)
data.frame [32 x 1]
[ omitted 17 entries ] |
Select & Extract:
MT[, disp]
[1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0
[25] 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
Select & Rename:
MT[, .(dispp = disp)]
data.table [32 x 1]
[ omitted 17 entries ] |
1.4.2 Dynamic selection
1.4.2.1 By name
cols <- c("cyl", "disp")
Removing a column:
mtcars |> select(!{{cols}}) # select(-matches(cols))
data.frame [32 x 9]
[ omitted 17 entries ] |
MT[, ..cols]
data.table [32 x 2]
[ omitted 17 entries ] |
Removing a column:
MT[, !..cols]
data.table [32 x 9]
[ omitted 17 entries ] |
In-place:
copy(MT)[, (cols) := NULL][]
data.table [32 x 9]
[ omitted 17 entries ] |
1.4.2.2 By pattern
MT[, .SD, .SDcols = !patterns("^d")]
data.table [32 x 9]
[ omitted 17 entries ] |
MT[, .SD, .SDcols = \(x) all(x != 0)] # Only keep columns where no value == 0
data.table [32 x 9]
[ omitted 17 entries ] |
1.4.2.3 By column type
iris |> select(where(\(x) !is.numeric(x)))
data.frame [150 x 1]
[ omitted 135 entries ] |
IRIS[, .SD, .SDcols = !is.numeric]
data.table [150 x 1]
[ omitted 135 entries ] |
1.5 Mutate / Transmute
data.table
can mutate in 2 ways:
- Using =
creates a new DT with the new columns only (like dplyr::transmute
)
- Using :=
(or let
) modifies the current dt in place (like dplyr::mutate
)
The function modifying a column should be the same size as the original column (or group).
If only one value is provided with :=
, it will be recycled to the whole column/group.
If the number of values provided is smaller than the original column/group:
- With :=
or let
, an error will be raised, asking to manually specify how to recycle the values.
- With =
, it will behave like dplyr::summarize
(if a grouping has been specified).
1.5.1 Basic transmute
Only keeping the transformed columns.
mtcars |> transmute(cyl = cyl * 2)
data.frame [32 x 1]
[ omitted 17 entries ] |
MT[, .(cyl = cyl * 2)]
data.table [32 x 1]
[ omitted 17 entries ] |
Transmute & Extract:
MT[, (cyl = cyl * 2)]
[1] 12 12 8 12 16 12 16 8 8 12 12 16 16 16 16 16 16 8 8 8 8 16 16 16 16
[26] 8 8 8 16 12 16 8
1.5.2 Basic mutate
Modifies the transformed column in-place and keeps every other column as-is.
1.5.3 Dynamic trans/mutate
LHS <- "mean_mpg"
RHS <- "mpg"
data.frame [32 x 12]
[ omitted 17 entries ] |
data.frame [32 x 12]
[ omitted 17 entries ] |
data.frame [32 x 12]
[ omitted 17 entries ] |
data.table [32 x 12]
[ omitted 17 entries ] |
data.table [32 x 12]
[ omitted 17 entries ] |
1.5.4 Conditional trans/mutate
Mutate everything based on multiple conditions:
One condition:
data.frame [32 x 12]
[ omitted 17 entries ] |
Nested conditions:
mtcars |> mutate(Size = case_when(
cyl %between% c(2,4) ~ "small",
cyl %between% c(4,8) ~ "BIG",
.default = "Unk"
))
data.frame [32 x 12]
[ omitted 17 entries ] |
Mutate only rows meeting conditions:
Mutate everything based on multiple conditions:
One condition:
data.table [32 x 12]
[ omitted 17 entries ] |
Nested conditions:
copy(MT)[, Size := fcase(
cyl %between% c(2,4), "small",
cyl %between% c(4,8), "BIG",
default = "Unk"
)][]
data.table [32 x 12]
[ omitted 17 entries ] |
Mutate only rows meeting conditions:
copy(MT)[am == 1, BIG := cyl >= 6][]
data.table [32 x 12]
[ omitted 17 entries ] |
1.5.5 Complex trans/mutate
1.5.5.1 Column-wise operations
Apply one function to multiple columns:
data.frame [32 x 13]
[ omitted 17 entries ] |
As a transmute
:
data.frame [32 x 2]
[ omitted 17 entries ] |
Dynamically:
Apply multiple functions to one or multiple column:
col <- "mpg"
cols <- c("mpg", "disp")
data.frame [32 x 13]
[ omitted 17 entries ] |
data.frame [32 x 13]
[ omitted 17 entries ] |
Multiple columns:
data.frame [32 x 15]
[ omitted 17 entries ] |
1.5.5.2 Row-wise operations
Apply one function to multiple columns (row-wise):
data.frame [32 x 12]
[ omitted 17 entries ] |
data.frame [32 x 12]
[ omitted 17 entries ] |
Hybrid base R-Tidyverse:
Apply multiple functions to multiple columns (row-wise)
Apply an anonymous function inside the DT:
1.6 Group / Aggregate
The examples listed apply a grouping but do nothing (using .SD
to simply keep all columns as is)
1.6.1 Basic grouping
mtcars |> group_by(cyl, gear)
data.frame [32 x 11]
[ omitted 17 entries ] |
Dynamic grouping:
Use any_of
if you expect some columns to be missing in the data.
1.6.2 Current group info
mtcars |>
group_by(cyl) |>
filter(cur_group_id() == 1) |> # To only keep one plot
group_walk(\(d, g) with(d, plot(hp, mpg, main = paste("Cyl:", g$cyl))))
1.7 Row numbers & indices
1.7.1 Adding row or group indices
.I
: Row indices.N
: Number of rows
.GRP
: Group indices.NGRP
: Number of groups
1.7.1.1 Adding rows indices:
mtcars |> mutate(I = row_number())
data.frame [32 x 12]
[ omitted 17 entries ] |
copy(MT)[ , I := .I][]
data.table [32 x 12]
[ omitted 17 entries ] |
1.7.1.2 Adding group indices:
Adding group indices (same index for each group):
mtcars |> summarize(GRP = cur_group_id(), .by = cyl)
data.frame [3 x 2]
Mutate instead of summarize:
mtcars |> mutate(GRP = cur_group_id(), .by = cyl)
data.frame [32 x 12]
[ omitted 17 entries ] |
Adding row numbers within each group:
mtcars |> mutate(I_GRP = row_number(), .by = gear)
data.frame [32 x 12]
[ omitted 17 entries ] |
Adding group indices (same index for each group):
MT[, .GRP, by = cyl]
data.table [3 x 2]
Mutate instead of summarize:
copy(MT)[, GRP := .GRP, by = cyl][]
data.table [32 x 12]
[ omitted 17 entries ] |
Adding row numbers within each group:
1.7.2 Filtering based on row numbers (slicing)
1.7.2.1 Extracting a specific row
1.7.2.2 Slicing rows
tail(mtcars, 10)
data.frame [10 x 11]
data.frame [10 x 11]
mtcars |> slice_tail(n = 10)
data.frame [10 x 11]
tail(MT, 10)
data.table [10 x 11]
MT[(.N-9):.N]
data.table [10 x 11]
MT[MT[, .I[(.N-9):.N]]] # Gets the last 10 rows' indices and filters based on them
data.table [10 x 11]
1.7.2.3 Slicing groups
Random sample by group:
mtcars |> slice_sample(n = 5, by = cyl)
data.frame [15 x 11]
Filter groups by condition:
data.frame [25 x 11]
[ omitted 10 entries ] |
mtcars |> group_by(cyl) |> group_modify(\(d,g) if (nrow(d) >= 8) d else data.frame())
data.frame [25 x 11]
[ omitted 10 entries ] |
Random sample by group:
MT[, .SD[sample(.N, 5)], keyby = cyl]
data.table [15 x 11]
Filter groups by condition:
MT[, if(.N >= 8) .SD, by = cyl]
data.table [25 x 11]
[ omitted 10 entries ] |
MT[, .SD[.N >= 8], by = cyl]
data.table [25 x 11]
[ omitted 10 entries ] |
1.7.3 Extracting row indices
1.7.3.1 Getting the row numbers of specific observations:
Row number of the first and last observation of each group:
mtcars |> reframe(I = cur_group_rows()[c(1, n())], .by = cyl)
data.frame [6 x 2]
… while keeping all other columns:
mtcars |> mutate(I = row_number()) |> slice(c(1, n()), .by = cyl)
data.frame [6 x 12]
1.7.3.2 Extracting row indices after filtering
Extracting row numbers in the original dataset:
mtcars |> mutate(I = row_number()) |> filter(gear == 4) |> pull(I)
[1] 1 2 3 8 9 10 11 18 19 20 26 32
Extracting row numbers in the new dataset (after filtering):
mtcars |> filter(gear == 4) |> mutate(I = row_number()) |> pull(I)
[1] 1 2 3 4 5 6 7 8 9 10 11 12
.I
gives the vector of row numbers after any subsetting/filtering has been done
Extracting row numbers in the original dataset:
MT[, .I[gear == 4]]
[1] 1 2 3 8 9 10 11 18 19 20 26 32
Extracting row numbers in the new dataset (after filtering):
MT[gear == 4, .I]
[1] 1 2 3 4 5 6 7 8 9 10 11 12
1.8 Relocate
1.8.1 Basic reordering
Relocate a new column (mutate + relocate):
mtcars |> mutate(GRP = cur_group_id(), .by = cyl, .before = 1)
data.frame [32 x 12]
[ omitted 17 entries ] |
setcolorder(copy(MT), "cyl", after = last(colnames(MT)))[]
data.table [32 x 11]
[ omitted 17 entries ] |
setcolorder(copy(MT), c(setdiff(colnames(MT), "cyl"), "cyl"))[]
data.table [32 x 11]
[ omitted 17 entries ] |
Relocate a new column (mutate + relocate):
setcolorder(copy(MT)[ , GRP := .GRP, by = cyl], "GRP")[]
data.table [32 x 12]
[ omitted 17 entries ] |
1.8.2 Reordering by column names
setcolorder(copy(MT), sort(colnames(MT)))[]
data.table [32 x 11]
[ omitted 17 entries ] |
1.9 Summarize/Reframe
With data.table
, one needs to use the =
operator to summarize. It takes a function that returns a list of values smaller than the original column (or group) size. By default, it will only keep the modified columns (like a transmute
).
1.9.1 Basic summary
MT[, .(mean_cyl = mean(cyl))]
data.table [1 x 1]
1.9.2 Grouped summary
By default, dplyr::summarize
will arrange
the result by the grouping factor:
To order by the grouping factor, use group_by()
instead of .by
:
By default, data.table
keeps the order the groups originally appear in:
MT[, .N, by = cyl]
data.table [3 x 2]
To order by the grouping factor, use keyby
instead of by
:
MT[, .N, keyby = cyl]
data.table [3 x 2]
Grouped on a temporary variable:
MT[, .N, by = .(cyl > 6)]
data.table [2 x 2]
1.9.3 Column-wise summary
1.9.3.1 Apply one function to multiple columns:
mtcars |> summarize(across(everything(), mean), .by = cyl)
data.frame [3 x 11]
By column type:
By matching column names:
1.9.3.2 Applying multiple functions to one column:
With column names:
With column names:
1.9.3.3 Apply multiple functions to multiple columns:
Depending on the output we want (i.e. having the function’s output as columns or rows), we can either provide a list of functions to apply (list_of_fns
), or a function returning a list (fn_returning_list
).
One column per function, one row per variable:
data.frame [6 x 4]
One column per variable, one row per function:
data.frame [6 x 4]
One column per function/variable combination:
One column per function, one row per variable:
data.table [6 x 4]
One column per variable, one row per function:
data.table [6 x 4]
One column per function/variable combination:
data.table [3 x 5]
data.table [3 x 5]
Different column order & naming scheme:
MT[,
lapply(list_of_fns, \(f) lapply(.SD, f)) |>
unlist(recursive = FALSE),
by = cyl, .SDcols = cols
]
data.table [3 x 5]
Using dcast
(see next section for more on pivots):
dcast(MT, cyl ~ ., fun.agg = list_of_fns, value.var = cols) # list(mean, sd)
data.table [3 x 5]
2 Pivots
2.1 Melt / Longer
Data:
2.1.1 Basic Melt/Longer
data.table::melt
does partial argument matching and thus accepts shortened versions of its arguments. E.g.: variable.name
<=> variable
(or var
), value.name
<=> value
(or val
), measure.vars
<=> measure
, id.vars
<=> id
, pattern
<=> pat
, …
One group of columns –> single value column
pivot_longer(FAM1, cols = matches("dob_"), names_to = "variable")
data.frame [15 x 4]
One group of columns –> multiple value columns
# No direct equivalent
melt(FAM1, measure = patterns(child1 = "child1$", child2 = "child2$|child3$"))
data.table [10 x 5]
2.1.2 Merging multiple yes/no columns
Melting multiple presence/absence columns into a single variable:
Data:
(MOVIES_WIDE <- as.data.table(movies_wide))
data.table [3 x 4]
pivot_longer(
movies_wide, -ID, names_to = "Genre",
values_transform = \(x) ifelse(x == 0, NA, x), values_drop_na = TRUE
) |> select(-value)
data.frame [6 x 2]
2.1.3 Partial pivot
Multiple groups of columns –> Multiple value columns
Using .value
:
Using the .value
special identifier allows to do a “half” pivot: the values that would be listed as rows under .value
are instead used as columns.
pivot_longer(fam2, matches("^dob|^gender"), names_to = c(".value", "child"), names_sep = "_child")
data.frame [15 x 5]
Using .value
:
data.table [15 x 5]
Manually:
colA <- str_subset(colnames(FAM2), "^dob")
colB <- str_subset(colnames(FAM2), "^gender")
melt(FAM2, measure = list(colA, colB), val = c("dob", "gender"), var = "child")
data.table [15 x 5]
Alternatives
melt(FAM2, measure = list(a, b), val = c("dob", "gender"), var = "child") |>
substitute2(env = list(a = I(str_subset(colnames(FAM2), "^dob")), b = I(str_subset(colnames(FAM2), "^gender")))) |> eval()
Using measure
and value.name
:
melt(FAM2, measure = measure(value.name, child = \(x) as.integer(x), sep = "_child"))
data.table [15 x 5]
2.2 Dcast / Wider
General idea:
- Pivot around the combination of id.vars
(LHS of the formula)
- The measure.vars
(RHS of the formula) are the ones whose values become column names
- The value.var
are the ones the values are taken from to fill the new columns
Data:
(fam1l <- as.data.frame(FAM1L))
data.frame [15 x 4]
(fam2l <- as.data.frame(FAM2L))
data.frame [15 x 5]
2.2.1 Basic Dcast/Wider
pivot_wider(fam1l, id_cols = c("family_id", "age_mother"), names_from = "variable")
data.frame [5 x 5]
dcast(FAM1L, family_id + age_mother ~ variable)
data.table [5 x 5]
Using all the columns as IDs:
pivot_wider(fam1l, names_from = variable)
data.frame [5 x 5]
By default, id_cols = everything()
FAM1L |> dcast(... ~ variable)
data.table [5 x 5]
...
<=> “every unused column”
Multiple value columns –> Multiple groups of columns:
pivot_wider(
fam2l, id_cols = c("family_id", "age_mother"), values_from = c("dob", "gender"),
names_from = "child", names_sep = "_child"
)
data.frame [5 x 8]
Dynamic names in the formula:
var_name <- "variable"
id_vars <- c("family_id", "age_mother")
pivot_wider(fam1l, id_cols = c(family_id, age_mother), names_from = {{ var_name }})
data.frame [5 x 5]
Multiple dynamic names:
pivot_wider(fam1l, id_cols = all_of(id_vars), names_from = variable)
data.frame [5 x 5]
data.table [5 x 5]
dcast(FAM1L, family_id + age_mother ~ x) |> substitute2(env = list(x = var_name)) |> eval()
data.table [5 x 5]
Multiple dynamic names:
2.2.2 Renaming (prefix/suffix) the columns
pivot_wider(fam1l, names_from = variable, values_from = value, names_prefix = "Attr: ")
data.frame [5 x 5]
pivot_wider(fam1l, names_from = variable, values_from = value, names_glue = "Attr: {variable}")
data.frame [5 x 5]
2.2.3 Unused combinations
The logic is inverted between dplyr
(keep) and data.table
(drop):
pivot_wider(fam1l, names_from = variable, values_from = value, id_expand = T, names_expand = F)
data.frame [25 x 5]
[ omitted 10 entries ] |
2.2.4 Subsetting
fam1l |> filter(value >= lubridate::ymd(20030101)) |>
pivot_wider(id_cols = c("family_id", "age_mother"), names_from = "variable")
data.frame [3 x 5]
AFAIK, pivot_wider
can’t do this on its own.
2.2.5 Aggregating
In data.table
, not specifying the column holding the measure vars (the names) will result in an empty column counting the number of columns that should have been created for all the measures (i.e. the length()
of the result).
Customizing the default behavior (length()
) using the fun.aggregate
(<=> fun.agg
or fun
) argument:
Here, we count the number of child for each each combination of (family_id + age_mother) -> sum all non-NA value
(pivot_wider(
fam1l, id_cols = c(family_id, age_mother), names_from = variable, values_fn = \(x) !is.na(x)
)
|> mutate(child_count = apply(pick(matches("_child")), 1, \(x) sum(x)))
|> select(-matches("^dob_"))
)
data.frame [5 x 3]
Alternatives
(pivot_wider(fam1l, id_cols = c(family_id, age_mother), names_from = variable, values_fn = \(x) !is.na(x))
|> mutate(child_count = pmap_int(pick(matches("_child")), \(...) sum(...)))
|> select(-matches("^dob_"))
)
(pivot_wider(fam1l, id_cols = c(family_id, age_mother), names_from = variable, values_fn = \(x) !is.na(x))
|> rowwise()
|> mutate(child_count = sum(c_across(matches("_child"))))
|> ungroup()
|> select(-matches("^dob_"))
)
Applying multiple fun.agg
:
Data:
(DTL <- data.table(
id1 = sample(5, 20, TRUE),
id2 = sample(2, 20, TRUE),
group = sample(letters[1:2], 20, TRUE),
v1 = runif(20),
v2 = 1L
)
)
data.table [20 x 5]
[ omitted 5 entries ] |
- Multiple aggregation functions applied to one variable:
(pivot_wider(
DTL, id_cols = c("id1", "id2"), names_from = "group", values_from = "v1",
names_glue = "{.value}_{.name}", names_vary = "slowest", names_sort = TRUE,
values_fn = \(x) tibble("sum" = sum(x), "mean" = mean(x))
)
|> unnest(cols = starts_with("v1"), names_sep = "_")
)
data.frame [9 x 6]
- Multiple aggregation functions applied to multiple variables (all combinations):
(DTL |> pivot_wider(
id_cols = c("id1", "id2"), names_from = "group", names_vary = "slowest", names_sort = TRUE,
values_from = c("v1", "v2"), values_fn = \(x) tibble("sum" = sum(x), "mean" = mean(x))
)
|> unnest(cols = matches("^v1|^v2"), names_sep = "_")
)
data.frame [9 x 10]
- Multiple aggregation functions applied to multiple variables (one-to-one):
# Not possible with pivot_wider AFAIK
- Multiple aggregation functions applied to one variable:
- Multiple aggregation functions applied to multiple variables (all combinations):
- Multiple aggregation functions applied to multiple variables (one-to-one):
Here, we apply sum
to v1
(for both group
a & b), and mean
to v2
(for both group
a & b)
2.2.6 One-hot encoding
Making each level of a variable into a presence/absence column:
movies_long
data.frame [6 x 3]
pivot_wider(
movies_long, names_from = "Genre", values_from = "Genre",
values_fn = \(x) !is.na(x), values_fill = FALSE
)
data.frame [6 x 5]
3 Joins
3.1 Mutating Joins
The purpose of mutating joins is to add columns/information from one table to another, by matching their rows.
Data:
(CITIES <- as.data.table(cities))
data.table [10 x 3]
(COUNTRIES <- as.data.table(countries))
data.table [9 x 2]
3.1.1 Left/Right Join
Both left & right joins append the columns of one table to those of another, in the order they are given (i.e. columns of the first table will appear first in the result). However, how rows are matched (and how the ones not finding a match are handled) depends on the type of join:
- Left joins match on the rows of the first (left) table. Unmatched rows from the left table will be kept, but not the right’s.
- Right joins match on the rows of the second (right) table. Unmatched rows from the right table will be kept, but not the left’s.
To find out which country each city belongs to, we’re going to merge countries into cities.
Here, we want to add data to the cities
table by matching each city to a country (by their country_id
). The ideal output would have the columns of cities
first, and keep all rows from cities
, even if unmatched: thus we will use a left join.
- As a left join:
left_join(cities, countries, by = "country_id", multiple = "all")
data.frame [10 x 4]
data.table
natively only supports right joins
It filters the rows of the first table by those of the second (FIRST[SECOND]
), but only keeps the unmatched rows from the second table.
The normal output of the join
CITIES[COUNTRIES, on = .(country_id)]
data.table [10 x 4]
The unmatched rows from countries
were kept, but not the ones from cities
. Here are two possible workarounds:
Inverting the two tables (countries
first), and then inverting the order of the columns in the result:
COUNTRIES[CITIES, .(city_id, city, country_id, country), on = .(country_id)]
data.table [10 x 4]
Adding the columns of countries
(in-place) to cities
during the join:
We could accomplish a similar result with a right join by inverting the order of appearance of the columns. But the order of the columns in the result will be less ideal (countries first):
- As a right join:
right_join(countries, cities, by = "country_id", multiple = "all")
data.frame [10 x 4]
COUNTRIES[CITIES, on = .(country_id)][order(country_id)]
data.table [10 x 4]
3.1.2 Full Join
Fully merges the two tables, keeping the unmatched rows from both tables.
3.1.3 Cross Join
Generating all combinations of the IDs of both tables.
cross_join(select(cities, city), select(countries, country))
data.frame [90 x 2]
[ omitted 75 entries ] |
CJ(city = CITIES[, city], country = COUNTRIES[, country])
data.table [90 x 2]
[ omitted 75 entries ] |
3.1.4 Inner Join
Merges the columns of both tables and only returns the rows that matched between both tables (no unmatched rows are kept).
inner_join(countries, cities, by = "country_id", multiple = "all")
data.frame [9 x 4]
COUNTRIES[CITIES, on = .(country_id), nomatch = NULL]
data.table [9 x 4]
3.1.5 Self join
Merging the table with itself. Typically used on graph-type data represented as a flat table (e.g. hierarchies).
Data:
data.frame [5 x 4]
The goal here is to find the identity of everyone’s n+1 by merging the table on itself:
3.2 Filtering Joins
Use to filter one table (left) based on another (right): it will only keep the columns from the left table and will either keep (semi join) or discard (anti join) the rows where IDs match between both tables.
3.2.1 Semi join
Will give the same result as an inner join, but will only keep the columns of the first table (no information is added).
Here, it will filter countries
to only keep the countries having a matching country_id
in the cities table.
3.2.2 Anti join
Here, it will filter countries
to only keep the countries having no matching country_id
in the cities table.
3.3 Non-equi joins
Non-equi joins are joins where the the condition to match rows are no longer strict equalities between the tables’ ID columns.
We can divide non-equi joins between:
- Unequality joins: a general unequality condition between IDs, that could result in multiple matches.
- Rolling joins: only keep the match that minimizes the distance between the IDs (i.e. the closest to perfect equality).
- Overlap joins: matching to all values within a range.
Please refer to this page of the second edition of R4DS for more detailed explanations.
Data:
Events:
data.table [3 x 4]
Strikes:
data.table [4 x 4]
3.3.1 Unequality join
Inequality joins are joins (left, right, inner, …) that use inequalities (<
, <=
, >=
, or >
) to specify the matching criteria.
The condition has to be a simple inequality between existing columns: it cannot be an arbitrary function (e.g. date.x <= min(date.y) * 2
will not work).
- For each event, which strikes occurred (finished) before the event ?
inner_join(events, strikes, join_by(e.start >= s.end))
data.frame [2 x 8]
EVENTS[STRIKES, on = .(e.start >= s.end), nomatch = NULL]
data.table [2 x 7]
When specifying an equality or inequality condition, data.table
will merge the two columns: only one will remain, with the values of the second column and the name of the first. Here, e.start
will have the values of s.end
(which will be removed).
I’m not sure if this is a bug or not.
Data:
data.frame [3 x 2]
All permutations: with duplicates (order matters)
cross_join(people, people)
data.frame [9 x 4]
All combinations: without duplicates (order doesn’t matter)
inner_join(people, people, join_by(id < id))
data.frame [3 x 4]
3.3.2 Rolling joins
Rolling joins are a special type of inequality join where instead of getting every row that satisfies the inequality, we get the one where the IDs are the closest to equality.
- Which strike started the soonest after the beginning an event ?
inner_join(events, strikes, join_by(closest(e.start <= s.start)))
data.frame [3 x 8]
- Which strike ended the soonest before the start an event ?
inner_join(events, strikes, join_by(closest(e.start >= s.end)))
data.frame [1 x 8]
- Which strike started the soonest after the beginning an event ?
EVENTS[STRIKES, on = .(e.start == s.start), roll = "nearest"
][, .SD[which.min(abs(e.start - e.end))], by = "e.id"]
data.table [3 x 7]
Using the roll
argument relaxes the equality constraint of the join (e.start == s.end
).
- Which strike ended the soonest before the start an event ?
3.3.3 Overlap joins
dplyr
helper functions
dplyr
provides three helper functions to make it easier to work with intervals:
- between(x, y_min, y_max)
<=> x >= y_min, x <= y_max
: a value of the first table is within a given range of the second
- within(x_min, x_max, y_min, y_max)
<=> x_min >= y_min, x_max <= y_max
: the ranges of the first table are contained within the second’s
- overlaps(x_min, x_max, y_min, y_max)
<=> x_min <= y_max, x_max >= y_min
: the two ranges overlap partially or totally, in any direction
- Between: Which events had a strike staring in the two hours before the beginning of the event ?
First, we need to create the new “2 hours after the beginning of the event” column since we cannot use arbitrary functions in join_by()
(e.g. we cannot do between(s.start, e.start, e.start + hours(2))
)
inner_join(strikes, events2, join_by(between(s.start, e.start_minus2, e.start))) |>
select(colnames(events), colnames(strikes)) # Re-ordering the columns
data.frame [1 x 8]
By default, the value to match needs to be from the first table, and the range it falls within needs to be from the second table. Depending on the column order we need, this can force us to reorder the columns post-join (as in the above example).
This can be alleviated by manually specifying from which table each column comes from, using x$col
and y$col
(x referring the to first column).
inner_join(events2, strikes, join_by(between(y$s.start, x$e.start_minus2, x$e.start))) |>
select(-e.start_minus2)
data.frame [1 x 8]
Manually:
inner_join(events2, strikes, join_by(e.start_minus2 <= s.start, e.start >= s.start)) |>
select(-e.start_minus2)
data.frame [1 x 8]
- Within: Which strikes occurred entirely within the period of an event ?
inner_join(strikes, events, join_by(within(s.start, s.end, e.start, e.end)), multiple = "all") |>
select(colnames(events), colnames(strikes)) # Re-ordering the columns
data.frame [1 x 8]
As before, within()
requires the first range to be within the second by default, meaning the first table must be the one with the smaller range. Using x$col
and y$col
resolves the issue of column order.
inner_join(events, strikes, join_by(within(y$s.start, y$s.end, x$e.start, x$e.end)), multiple = "all")
data.frame [1 x 8]
Manually:
inner_join(events, strikes, join_by(e.start <= s.start, e.end >= s.end), multiple = "all")
data.frame [1 x 8]
- Overlaps: Which events overlap with each-other ?
inner_join(events, events, join_by(e.id < e.id, overlaps(e.start, e.end, e.start, e.end)))
data.frame [1 x 8]
Manually:
inner_join(events, events, join_by(e.id < e.id, e.start <= e.end, e.end >= e.start))
data.frame [1 x 8]
- Between: Which events had a strike staring in the two hours before the beginning of the event ?
copy(EVENTS)[, e.start_minus2 := e.start - hours(2)
][STRIKES, on = .(e.start_minus2 <= s.start, e.start >= s.start), nomatch = NULL
][, -"e.start_minus2"]
data.table [1 x 7]
- Within: Which strikes occurred entirely within the period of an event ?
EVENTS[STRIKES, on = .(e.start <= s.start, e.end >= s.end), nomatch = NULL]
data.table [1 x 6]
- Overlaps: Which events overlap with each-other ?
EVENTS[EVENTS, on = .(e.id < e.id, e.start <= e.end, e.end >= e.start), nomatch = NULL]
data.table [1 x 5]
4 Tidyr & Others
4.1 Remove NA
na.omit(IRIS, cols = "Species")
data.table [150 x 5]
[ omitted 135 entries ] |
na.omit(IRIS, cols = str_subset(colnames(IRIS), "Sepal"))
data.table [150 x 5]
[ omitted 135 entries ] |
4.2 Unite
Combine multiple columns into a single one:
4.3 Separate / Extract
4.3.1 Separate wider (extract)
(MT.ext <- MT[, .(x = str_c(gear, carb, sep = "_"))])
data.table [32 x 1]
[ omitted 17 entries ] |
Based on a delimiter:
MT.ext |> separate_wider_delim(x, delim = "_", names = c("gear", "carb"))
data.frame [32 x 2]
[ omitted 17 entries ] |
Based on a regex:
MT.ext |> separate_wider_regex(x, patterns = c(gear = "\\d{1}", "_", carb = "\\d{1}"))
data.frame [32 x 2]
[ omitted 17 entries ] |
Based on position:
MT.ext |> separate_wider_position(x, widths = c(gear = 1, delim = 1, carb = 1))
data.frame [32 x 3]
[ omitted 17 entries ] |
separate_wider_*
supersedes both extract
and separate
.
Based on a delimiter:
data.table [32 x 3]
[ omitted 17 entries ] |
Based on a regex:
copy(MT.ext)[, c("gear", "carb") := str_extract_all(x, "\\d") |> list_transpose()][]
data.table [32 x 3]
[ omitted 17 entries ] |
4.3.2 Separate longer/rows
Separating a row into multiple rows, duplicating the rest of the values.
Data
(SP <- data.table(
val = c(1,"2,3",4),
date = as.Date(c("2020-01-01", "2020-01-02", "2020-01-03"), origin = "1970-01-01")
)
)
data.table [3 x 2]
Based on a delimiter:
SP |> separate_longer_delim(val, delim = ",")
data.frame [4 x 2]
Based on position:
SP |> separate_longer_position(val, width = 1) |> filter(val != ",")
data.frame [4 x 2]
separate_longer_*
now supersedes separate_rows
Old syntax
SP |> separate_rows(val, sep = ",", convert = TRUE)
4.4 Duplicates
4.4.1 Duplicated rows
4.4.1.1 Only keeping duplicated rows
4.4.1.2 Removing duplicated rows
This is different from distinct/unique, which will keep one of the duplicated rows of each group.
This removes all groups which have duplicated rows.
MT[, if(.N == 1) .SD, by = .(mpg, hp)]
data.table [30 x 11]
[ omitted 15 entries ] |
Alternatives
# More convoluted
MT[!MT[, if(.N > 1) .SD, by = .(mpg, hp)], on = colnames(MT)]
fsetdiff(MT, setcolorder(MT[, if(.N > 1) .SD, by = .(mpg, hp)], colnames(MT)))
4.4.2 Duplicated values (per row)
(DUPED <- data.table(
A = c("A1", "A2", "B3", "A4"),
B = c("B1", "B2", "B3", "B4"),
C = c("A1", "C2", "D3", "C4"),
D = c("A1", "D2", "D3", "D4")
)
)
data.table [4 x 4]
mutate(DUPED, Repeats = apply(
pick(everything()), 1, \(r) r[which(duplicated(r))] |> unique() |> str_c(collapse = ", ")
)
)
data.table [4 x 5]
With duplication counter:
dup_counts <- function(v) {
rles <- as.data.table(unclass(rle(v[which(duplicated(v))])))[, lengths := lengths + 1]
paste(apply(rles, 1, \(r) paste0(r[2], " (", r[1], ")")), collapse = ", ")
}
DUPED |> mutate(Repeats = apply(pick(everything()), 1, \(r) dup_counts(r)))
data.table [4 x 5]
DUPED[, Repeats := apply(.SD, 1, \(r) dup_counts(r))][]
data.table [4 x 5]
4.5 Expand & Complete
Here, we are missing an entry for person B on year 2010, that we want to fill:
(CAR <- data.table(
year = c(2010,2011,2012,2013,2014,2015,2011,2012,2013,2014,2015),
person = c("A","A","A","A","A","A", "B","B","B","B","B"),
car = c("BMW", "BMW", "AUDI", "AUDI", "AUDI", "Mercedes", "Citroen","Citroen", "Citroen", "Toyota", "Toyota")
)
)
data.table [11 x 3]
4.5.1 Expand
4.5.2 Complete
Joins the original dataset with the expanded one:
4.6 Uncount
Duplicating aggregated rows to get back the un-aggregated version.
Data
cols <- c("Mild", "Moderate", "Severe")
dat_agg
data.frame [10 x 6]
dat_agg |>
pivot_longer(cols = all_of(cols), names_to = "Severity", values_to = "Count") |>
uncount(Count) |>
mutate(ID_new = row_number(), .after = "ID") |>
pivot_wider(
names_from = "Severity", values_from = "Severity",
values_fn = \(x) ifelse(is.na(x), 0, 1), values_fill = 0
)
data.frame [23 x 7]
[ omitted 8 entries ] |
Solution 1:
(melt(DAT_AGG, measure.vars = cols, variable.name = "Severity", value.name = "Count")
[rep(1:.N, Count)][, ID_new := .I]
|> dcast(... ~ Severity, value.var = "Severity", fun.agg = \(x) ifelse(is.na(x), 0, 1), fill = 0)
|> _[, -"Count"]
)
data.table [23 x 7]
[ omitted 8 entries ] |
Solution 2:
4.7 List / Unlist
When a column contains a simple vector/list of values (of the same type, without structure)
4.7.1 One listed column
Single ID (grouping) column:
Data:
MT_LIST
data.table [3 x 2]
mt_list |> unnest(cols = mpg)
data.frame [32 x 2]
[ omitted 17 entries ] |
MT_LIST[, .(mpg = unlist(mpg)), keyby = cyl]
data.table [32 x 2]
[ omitted 17 entries ] |
Alternative that bypasses the need of grouping when unlisting by growing the data.table
back to its original number of rows before unlisting:
Multiple ID (grouping) columns:
Data:
mt_list2
data.frame [8 x 3]
mt_list2 |> unnest(cols = mpg) # group_by(cyl, gear) is optional
data.frame [32 x 3]
[ omitted 17 entries ] |
4.7.2 Multiple listed column
Data:
mt_list_mult
data.frame [8 x 4]
4.8 Nest / Unnest
When a column contains a data.table/data.frame (with multiple columns, structured)
4.8.1 One nested column
Nesting
mtcars |> tidyr::nest(data = -cyl) # Data is inside tibbles
data.frame [3 x 2]
Nesting while keeping the grouping variable inside the nested tables:
mtcars |> tidyr::nest(data = everything(), .by = cyl)
data.frame [3 x 2]
MT[, .(data = .(.SD)), keyby = cyl]
data.table [3 x 2]
Nesting while keeping the grouping variable inside the nested tables:
MT[, .(data = list(data.table(cyl, .SD))), keyby = cyl]
data.table [3 x 2]
Unnesting
Data:
mtcars_nest <- mtcars |> tidyr::nest(data = -cyl)
MT_NEST <- MT[, .(data = .(.SD)), keyby = cyl]
4.8.2 Multiple nested column
Nesting:
Unnesting:
Using a pattern to specify the columns to unnest:
4.8.3 Operate on nested/list columns
Data:
mt_nest
data.frame [3 x 2]
Creating a new column using the nested data:
Keeping the nested column:
Dropping the nested column:
Creating multiple new columns using the nested data:
mt_nest |> group_by(cyl) |> group_modify(\(d, g) linreg(unnest(d, everything()))) |> ungroup()
data.frame [6 x 6]
Operating inside the nested data:
mt_nest |>
mutate(data = map(data, \(t) mutate(t, sum = pmap_dbl(pick(everything()), sum)))) |>
unnest(data)
data.frame [32 x 12]
[ omitted 17 entries ] |
Using the nplyr
package
library(nplyr)
mt_nest |>
nplyr::nest_mutate(data, sum = apply(pick(everything()), 1, sum)) |>
unnest(data)
4.9 Rotate / Transpose
Using pivots:
MT_SUMMARY |>
pivot_longer(!cyl, names_to = "Statistic") |>
pivot_wider(id_cols = "Statistic", names_from = "cyl", names_prefix = "Cyl ")
data.frame [6 x 4]
With dedicated functions:
# No function exists to do this AFAIK
data.table::transpose(MT_SUMMARY, keep.names = "Statistic", make.names = 1)
data.table [6 x 4]
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.1 (2023-06-16)
os Ubuntu 22.04.3 LTS
system x86_64, linux-gnu
ui X11
language (EN)
collate C.UTF-8
ctype C.UTF-8
tz Europe/Paris
date 2024-02-07
pandoc 3.1.11
Quarto 1.5.9
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P broom * 1.0.5 2023-06-09 [?] CRAN (R 4.3.0)
P crayon * 1.5.2 2022-09-29 [?] CRAN (R 4.3.0)
P data.table * 1.15.0 2024-01-30 [?] CRAN (R 4.3.1)
P dplyr * 1.1.4 2023-11-17 [?] CRAN (R 4.3.1)
P ggplot2 * 3.4.4 2023-10-12 [?] CRAN (R 4.3.1)
P gt * 0.10.0 2023-10-07 [?] CRAN (R 4.3.1)
P here * 1.0.1 2020-12-13 [?] CRAN (R 4.3.0)
P knitr * 1.44 2023-09-11 [?] CRAN (R 4.3.0)
P lubridate * 1.9.3 2023-09-27 [?] CRAN (R 4.3.1)
P pipebind * 0.1.2 2023-08-30 [?] CRAN (R 4.3.0)
P purrr * 1.0.2 2023-08-10 [?] CRAN (R 4.3.0)
P stringr * 1.5.0 2022-12-02 [?] CRAN (R 4.3.0)
P tibble * 3.2.1 2023-03-20 [?] CRAN (R 4.3.0)
P tidyr * 1.3.0 2023-01-24 [?] CRAN (R 4.3.0)
[1] /home/mar/Dev/Projects/R/ma-riviere.com/renv/library/R-4.3/x86_64-pc-linux-gnu
[2] /home/mar/.cache/R/renv/sandbox/R-4.3/x86_64-pc-linux-gnu/9a444a72
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────
Citation
@online{rivière2022,
author = {Rivière, Marc-Aurèle},
title = {Data Wrangling with Data.table and the {Tidyverse}},
date = {2022-05-19},
url = {https://ma-riviere.com/content/code/posts/data.table},
langid = {en},
abstract = {This post showcases various ways to accomplish most data
wrangling operations, from basic filtering/mutating to pivots and
non-equi joins, with both `data.table` and the Tidyverse (`dplyr`,
`tidyr`, `purrr`, `stringr`).}
}