The Prison Visits Booking uses process metrics to determine how well each prison is doing in terms of time it takes to process a booking. There’s a longer back story behind this which I’d be happy to share over a pint, but for the purpose of this short write-up, let’s just say it’s a form-to-email system, that submits directly to a mailbox within a prison.
Given that we had a system in place to submit visit requests, it was only reasonable to have a system in place that would let us track how long it takes to process a visit - this could be used for a number of things - including identifying establishments with problems of various kinds, which can then be followed up on.
We needed a reasonably good metric to determine how well prisons were doing, and it was pretty clear that an average time would not cut it. This was due to all usual problems with averages - they’re independent of the distribution of the data, and do not provide useful information other metrics can.
While I was an intern at Google, we’ve used 99-th, 99.95 and 99.98 percentiles for request processing times and benchmark our performance based on those metrics. Given the amount of data we had, this was very useful - we’d have a figure describing the processing time for most requests, ignoring a small number of outliers that could have surfaced due to deployment delays, caches not being warmed, data being moved from/to virtual memory, etc.
It only made sense to use a similar approach here.
Given that we already had processing times for each visit, we could easily generate a summary of this using PostgreSQL. First, we need to retrieve a list of all visit for a particular establishment which would yield a table similar to the one below:
> SELECT prison_name, end_to_end_time
FROM visit_metrics_entries
WHERE prison_name = 'Alcatraz'
AND end_to_end_time IS NOT NULL;
prison_name | end_to_end_time
-------------+-----------------
Alcatraz | 284
Alcatraz | 301
Alcatraz | 321
Alcatraz | 321
Alcatraz | 342
Alcatraz | 389
Alcatraz | 434
Let us add another column which would indicate a cumulative percentile ranking for each row.
> SELECT prison_name, end_to_end_time, cume_dist()
OVER (ORDER BY end_to_end_time)
FROM visit_metrics_entries
WHERE prison_name = 'Alcatraz'
AND end_to_end_time IS NOT NULL
prison_name | end_to_end_time | cume_dist
-------------+-----------------+---------------------
Alcatraz | 284 | 0.00268817204301075
Alcatraz | 301 | 0.00537634408602151
Alcatraz | 321 | 0.010752688172043
Alcatraz | 321 | 0.010752688172043
Alcatraz | 342 | 0.0134408602150538
Alcatraz | 389 | 0.0161290322580645
Alcatraz | 434 | 0.0188172043010753
Since we’re interested in the 95-th percentile, let us limit ourselves solely to rows which have cume_dist >= 0.95
. For this to happen, we need to issue a subquery:
> SELECT prison_name, end_to_end_time, cume_dist FROM (
SELECT prison_name, end_to_end_time, cume_dist()
OVER (ORDER BY end_to_end_time)
FROM visit_metrics_entries
WHERE prison_name = 'Alcatraz'
AND end_to_end_time IS NOT NULL
) AS foo WHERE cume_dist >= 0.95 LIMIT 1
prison_name | end_to_end_time | cume_dist
-------------+-----------------+-------------------
Alcatraz | 214916 | 0.951612903225806
Need to plot frequency histograms over a set of values? Postgres will do that for you as well - try this:
> SELECT width_bucket(end_to_end_time, 0, 604800, 20), COUNT(*)
FROM visit_metrics_entries
WHERE end_to_end_time IS NOT NULL
AND prison_name = 'Alcatraz'
GROUP BY width_bucket
width_bucket | count
--------------+-------
1 | 30
2 | 11
3 | 21
4 | 12
5 | 11
6 | 88
7 | 51
8 | 56
9 | 80
10 | 11
11 | 23
12 | 44
13 | 23
14 | 76
15 | 52
16 | 2
17 | 42
18 | 10
19 | 4
20 | 22
21 | 280
This will generate 20 buckets with lower bound being 0 and upper being 604800 (1-indexed!). Turning this into a graph is very easy to do.