How to calculate confidence intervals in SQL
- Blog
- Tech Talk
Statistical overconfidence: Dangerous and easy Imagine you have a small online business. This month 200 users signed up on your website, and 10 of them bought your $800 service. Great! You’ve made $8k of income. How much should you expect…
Statistical overconfidence: Dangerous and easy
Imagine you have a small online business. This month 200 users signed up on your website, and 10 of them bought your $800 service. Great! You’ve made $8k of income. How much should you expect to make this year?
The straightforward answer is $8k * 12 = $96k. But how confident should you be? Will your conversion rate always be so close to 5%? You could pad the estimate ±20% for safety, guessing at $77k to $115k. If $77k would cover all your expenses, should you feel secure?
This is a question of binomial probability. Using our favorite binomial confidence interval calculator, the 95% confidence interval for your conversion rate is about 2.5% to 9%.
With a confidence interval that wide, you should expect to make somewhere between $48k and $172k. Yikes! You could end up with half of your simple guess, and that’s if your business doesn’t change.
Automating statistics: Calculating confidence intervals in SQL
These confidence intervals are very informative, but turning to a calculator for every metric is tedious. If you’ve got hundreds of metrics across dozens of dashboards, it’s downright unsustainable.
Fortunately, the math for calculating confidence interval is simple to implement:
The normal approximation interval formula for binomial confidence intervals
n = number of users
x = number of conversions
p = probability of conversion = (x / n)
se = standard error of p = sqrt((p * (1 - p)) / n)
confidence interval = p ± (1.96 * se)
See Normal approximation interval on wikipedia. Note the 1.96 constant specifies a 95% interval on a two-tailed normal distribution.
Implementing the formula in SQL
Let’s start with a table of the total number of users, and how many converted. Any data that represents a rate — conversions per user, server errors per request, etc. — will also work.
select
count(1) as n,
sum(case when converted then 1 else 0 end) as x
from users
group by date_trunc('month', created_at);
With our basic data in hand, we want to implement the above formula in SQL. To keep things clear, we wrap each step of the calculation separately:
- Calculate the conversation rate, p.
- Using p, calculate the standard error, se.
- Compute the low and high confidence intervals.
- Include the original p conversion rate as our mid estimate.
select
rates.n as users,
rates.x as conversions,
p - se * 1.96 as low,
intervals.p as mid,
p + se * 1.96 as high
from (
select
rates.*,
sqrt(p * (1 - p) / n) as se -- calculate se
from (
select conversions.*,
x / n::float as p -- calculate p
from (
-- Our conversion rate table from above
select
count(1) as n,
sum(case when converted then 1 else 0 end) as x
from users
group by date_trunc('month', created_at);
) conversions
) rates
) intervals
You might be wondering why we’re seeing 8% on the high end, rather than the 9% mentioned in the introduction. We used the Adjusted Wald method in the introduction, which produces more accurate estimates for small amounts of data.
A refinement for little data: The Adjusted Wald method
The math explained above, though quite accurate with hundreds of users and a healthy conversion rate, becomes increasingly biased with less data or extremely high or low rates. A rule of thumb is to avoid using it with fewer than 5 conversions or 100 users.
One way to adjust for these shortcomings is to use a more robust binomial proportion confidence interval technique like the Adjusted Wald method. In short, it adds a bit of fuzziness to the estimated probability to smooth out the extremely high or low rates which are more common with few data points.
Given the z-score needed to reach a certain confidence level (1.96 for a 95% confidence), add 0.5 * z^2 to the number of conversions, and z^2 to the number of users. This is roughly +2 and +4 for the 1.96 z-score for 95%.
select
rates.n as users,
rates.x as conversions,
p - se * 1.96 as low,
intervals.p as mid,
p + se * 1.96 as high
from (
select
rates.*,
sqrt(p * (1 - p) / n) as se -- calculate se
from (
select
conversions.*,
(x + 1.92) / (n + 3.84)::float as p -- calculate p
from (
-- Our conversion rate table from above
select
count(1) as n,
sum(case when converted then 1 else 0 end) as x
from users
group by date_trunc('month', created_at);
) conversions
) rates
) intervals
The important adjustment is here, where we add the constants to the numerator and denominator when calculating p:
(x + 1.92) / (n + 3.84)::float as p -- calculate p
This isn’t a magical solution to not enough data: If you have an expected 1% conversion rate and only 100 users, this adjustment will triple the estimated conversion rate, giving you a confidence interval of 0-6%. More data is the answer. At 10 conversions and 1,000 users, the interval shrinks to 0.5% to 1.9%.
In general, the more data you have, the more statistical approaches like these will be helpful to you.
Who are we?
We make a tool that makes data analysis on large SQL databases fast and easy. You could use our Snippets feature to implement this logic once, and apply it to any dataset.
If you have a database with many millions or billions of rows, and running hundreds of analyses is getting slow and cumbersome, we think you’ll really love it. Sign up for a free demo. We can also set you up with a free trial on the same day!