journal of recreational computer science

home

Cumulative percentiles in PostgreSQL

17 Sep 2014

The Prison Visits Booking uses process metrics to determine how well each prison is doing in terms of time it takes to process a booking. There’s a longer back story behind this which I’d be happy to share over a pint, but for the purpose of this short write-up, let’s just say it’s a form-to-email system, that submits directly to a mailbox within a prison.

Given that we had a system in place to submit visit requests, it was only reasonable to have a system in place that would let us track how long it takes to process a visit - this could be used for a number of things - including identifying establishments with problems of various kinds, which can then be followed up on.

We needed a reasonably good metric to determine how well prisons were doing, and it was pretty clear that an average time would not cut it. This was due to all usual problems with averages - they’re independent of the distribution of the data, and do not provide useful information other metrics can.

While I was an intern at Google, we’ve used 99-th, 99.95 and 99.98 percentiles for request processing times and benchmark our performance based on those metrics. Given the amount of data we had, this was very useful - we’d have a figure describing the processing time for most requests, ignoring a small number of outliers that could have surfaced due to deployment delays, caches not being warmed, data being moved from/to virtual memory, etc.

It only made sense to use a similar approach here.

Given that we already had processing times for each visit, we could easily generate a summary of this using PostgreSQL. First, we need to retrieve a list of all visit for a particular establishment which would yield a table similar to the one below:

> SELECT prison_name, end_to_end_time
  FROM visit_metrics_entries
  WHERE prison_name = 'Alcatraz'
        AND end_to_end_time IS NOT NULL;

 prison_name | end_to_end_time
-------------+-----------------
 Alcatraz    |             284
 Alcatraz    |             301
 Alcatraz    |             321
 Alcatraz    |             321
 Alcatraz    |             342
 Alcatraz    |             389
 Alcatraz    |             434

Let us add another column which would indicate a cumulative percentile ranking for each row.

> SELECT prison_name, end_to_end_time, cume_dist()
         OVER (ORDER BY end_to_end_time)
  FROM visit_metrics_entries
  WHERE prison_name = 'Alcatraz'
  AND end_to_end_time IS NOT NULL

prison_name | end_to_end_time |      cume_dist
-------------+-----------------+---------------------
 Alcatraz    |             284 | 0.00268817204301075
 Alcatraz    |             301 | 0.00537634408602151
 Alcatraz    |             321 |   0.010752688172043
 Alcatraz    |             321 |   0.010752688172043
 Alcatraz    |             342 |  0.0134408602150538
 Alcatraz    |             389 |  0.0161290322580645
 Alcatraz    |             434 |  0.0188172043010753

Since we’re interested in the 95-th percentile, let us limit ourselves solely to rows which have cume_dist >= 0.95. For this to happen, we need to issue a subquery:

> SELECT prison_name, end_to_end_time, cume_dist FROM (
    SELECT prison_name, end_to_end_time, cume_dist()
           OVER (ORDER BY end_to_end_time)
    FROM visit_metrics_entries
    WHERE prison_name = 'Alcatraz'
    AND end_to_end_time IS NOT NULL
  ) AS foo WHERE cume_dist >= 0.95 LIMIT 1

 prison_name | end_to_end_time |     cume_dist
-------------+-----------------+-------------------
 Alcatraz    |          214916 | 0.951612903225806

Need to plot frequency histograms over a set of values? Postgres will do that for you as well - try this:

> SELECT width_bucket(end_to_end_time, 0, 604800, 20), COUNT(*)
  FROM visit_metrics_entries
  WHERE end_to_end_time IS NOT NULL
  AND prison_name = 'Alcatraz'
  GROUP BY width_bucket

 width_bucket | count
--------------+-------
            1 |    30
            2 |    11
            3 |    21
            4 |    12
            5 |    11
            6 |    88
            7 |    51
            8 |    56
            9 |    80
           10 |    11
           11 |    23
           12 |    44
           13 |    23
           14 |    76
           15 |    52
           16 |     2
           17 |    42
           18 |    10
           19 |     4
           20 |    22
           21 |   280

This will generate 20 buckets with lower bound being 0 and upper being 604800 (1-indexed!). Turning this into a graph is very easy to do.

comments powered by Disqus