aggregate data for density plots

31 Mar 2018

When producing descriptive stats it’s quite useful to draw some histograms. A nice approach to that are the density plots that you can draw with ggplot2::geom_density. It is also useful to combine that with ggplot2::facet_grid to produce a grid of density plots.

When the data is not too big this is relatively easy to do, but if the data is large the usual process does not work because R can’t cope with that much data in memory.

Here I show a way of pre-aggregating the data so the plots can still be drawn.

I’ll start with an example where R, geom_density and facet_grid can deal with it.

library(dplyr)
library(ggplot2)
library(knitr)

simulator <- function(n) {
  t <- tibble(
    x = c(rnorm(n = n/2, 0, 1), rnorm(n = n, 2, 2), rnorm(n = n, 2, 3)),
    y = as.factor(c(rep("A", n/2), rep("B", n), rep("C", n))),
    z = as.factor(rbinom(n * 2.5, size = 1, prob = 0.5)),
    w = as.factor(sample(1:5, replace = TRUE, prob = rep(0.2, 5), size = n * 2.5))
  )
  return(t)
}

t <- simulator(1e5)
glimpse(t)

## Observations: 250,000
## Variables: 4
## $ x <dbl> 0.34173925, 2.09904014, -0.03973865, 0.45381049, 2.14291862,...
## $ y <fct> A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, A, ...
## $ z <fct> 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, ...
## $ w <fct> 4, 4, 2, 5, 2, 1, 4, 5, 4, 2, 1, 5, 3, 1, 4, 2, 5, 5, 2, 4, ...

summary(t)

##        x             y          z          w        
##  Min.   :-10.57888   A: 50000   0:124881   1:50406  
##  1st Qu.: -0.08537   B:100000   1:125119   2:50048  
##  Median :  1.37243   C:100000              3:49551  
##  Mean   :  1.60387                         4:50004  
##  3rd Qu.:  3.19154                         5:49991  
##  Max.   : 14.92003

p <- 
  t %>% 
    ggplot(aes(x = x, colour = y, fill = y)) +
      geom_density(alpha = 0.2, adjust = 1) +
      xlim(c(-10, 10)) +
      theme_bw() +
      facet_grid(w ~ z,
                 margins = TRUE,
                 as.table = TRUE)
p

and then let’s plot it

imatge1

Now, if we try to run the same process but with a 25 million records.

t <- simulator(1e7)

t %>% 
  ggplot(aes(x = x, colour = y, fill = y)) +
    geom_density(alpha = 0.2, adjust = 1) +
    xlim(c(-10, 10)) +
    theme_bw() +
    facet_grid(w ~ z,
               margins = TRUE,
               as.table = TRUE)

R chokes, and we get a message similar to this:

Error: cannot allocate vector of size 762.9 Mb

A solution to that is to aggregate the data before drawing the plots. Of course, there will be information loss, but that should be ok if the aggregation is done properly.

To make sure that I’m doing it right I started with a simple density plot. No groups.

To get an aggregated set of values I took the mean within each decile.

In this case I used just 10,000 to see that different options capture the differences.

nt <- 
  t %>% 
    mutate(parts = case_when(
      x < quantile(x, probs = 0.1) ~ 1,
      x < quantile(x, probs = 0.2) ~ 2,
      x < quantile(x, probs = 0.3) ~ 3,
      x < quantile(x, probs = 0.4) ~ 4,
      x < quantile(x, probs = 0.5) ~ 5,
      x < quantile(x, probs = 0.6) ~ 6,
      x < quantile(x, probs = 0.7) ~ 7,
      x < quantile(x, probs = 0.8) ~ 8,
      x < quantile(x, probs = 0.9) ~ 9,
      x <= quantile(x, probs = 1) ~ 10))

agg <- 
  nt %>% 
    group_by(parts, y) %>% 
    summarise(s = mean(x))

agg %>% 
  ggplot(aes(x = s)) +
    geom_density(alpha = 0.2,
                 adjust = 1,
                 fill = "red") +
    xlim(c(-2.5, 7.5)) +
    theme_bw()

imatge3

Which is a bit different to the original in blue below.

t %>% 
  ggplot(aes(x = x)) +
    geom_density(alpha = 0.2,
                 adjust = 2,
                 fill = "blue") +
    xlim(c(-2.5, 7.5)) +
    theme_bw()

imatge4

The quantilisation was ugly and difficult to adjust if I wanted more precision which I thought was the case.

I made two changes, first I ignored the need for the mean and took the percentile value. Second I built it in a way that a change in the number of percentiles was easy to adjust.

qx <- tibble(q = quantile(t$x, probs = seq(0, 1, by = 0.0001)))
qx %>% 
  ggplot(aes(x = q)) +
    geom_density(alpha = 0.2,
                 adjust = 2,
                 fill = "green") +
    xlim(c(-2.5, 7.5)) +
    theme_bw()

imatge5

This option was fast enough, so I expanded it to using one grouping variable.

library(tidyr)
library(broom)
library(purrr)

qx <- 
  t %>%
    nest(-y) %>%
    mutate(qtls = map(data, ~ quantile(.$x, probs = seq(0, 1, by = 0.0001)))) %>% 
    unnest(map(qtls, tidy))

qx %>% 
  ggplot(aes(x = x, colour = y, fill = y)) +
    geom_density(alpha = 0.2,
                 adjust = 2) +
    xlim(c(-2.5, 7.5)) +
    theme_bw()

plot of chunk unnamed-chunk-10

Which looks extremely similar to the original one:

t %>% 
  ggplot(aes(x = x, colour = y, fill = y)) +
    geom_density(alpha = 0.2,
                 adjust = 2) +
    xlim(c(-2.5, 7.5)) +
    theme_bw()

plot of chunk unnamed-chunk-11

And adding in the facet_grid it still runs fast and the plots look very very similar to what they would look like using the original data.

qx <- 
  t %>%
    nest(-y,
         -w,
         -z) %>%
    mutate(Quantiles = map(data, ~ quantile(.$x, probs = seq(0, 1, by = 0.0001)))) %>% 
    unnest(map(Quantiles, tidy))

qx %>% 
  ggplot(aes(x = x, colour = y, fill = y)) +
    geom_density(alpha = 0.2, adjust = 1) +
    xlim(c(-10, 10)) +
    theme_bw() +
    facet_grid(w ~ z,
               margins = TRUE)

plot of chunk unnamed-chunk-12

And the original was

plot of chunk unnamed-chunk-13

As always stackoverflow crowd provided 90% of the solution.

Send me any comments