Blog – Marc-Aurèle Rivière

Fast spatial data matching in R

Exploring different solutions to match locations based on their geographical distance

DuckDB

Spatial

GIS

Big Data

SQL

This post showcases various solutions to efficiently match unknown locations to known ones by their geographical proximity (using lat/long coordinates), on a dataset with millions of entries.

Jun 18, 2022

42 min

MCMC for ‘Big Data’ with Stan

Faster sampling with CmdStan using within-chain parallelization

Statistics

Bayesian Modeling

Stan

Big Data

This post is an extension (and a translation to R) of PyMC-Labs’ benchmarking of MCMC for “Big Data”.

The Stan code was updated to use within-chain parallelization and compiler optimization for faster CPU sampling. Stan was able to achieve similar sampling speeds as PyMC’s JAX + GPU solution, purely on CPU.

Jun 5, 2022

10 min

Data wrangling with data.table and the Tidyverse

Common data wrangling operations with both data.table and the Tidyverse.

Tidyverse

data.table

This post showcases various ways to accomplish most data wrangling operations, from basic filtering/mutating to pivots and non-equi joins, with both data.table and the Tidyverse (dplyr, tidyr, purrr, stringr).

May 19, 2022

45 min

Bayesian Rock Climbing Rankings

With R and Stan

Statistics

Bayesian Modeling

Stan

This post is a transposition to R of Ethan Rosenthal’s blog post on modeling Rock Climbing route difficulty using a Bayesian IRT (Item Response Theory) model.

The original Stan code was updated to use within-chain parallelization and compiler optimization for faster CPU sampling.

Several data processing solutions are showcased, using either data.table or dbplyr (with a DuckDB backend), with timings to compare their speed.

Apr 19, 2022

16 min

Reuse

CC BY 4.0

Categories

Fast spatial data matching in R

MCMC for ‘Big Data’ with Stan

Data wrangling with data.table and the Tidyverse

Bayesian Rock Climbing Rankings

Reuse