Calculating averages and percentiles
Use case
We want to understand the distribution of values for a certain numeric property within a dataset. We’re used to average values and intuitively understand how to calculate them. However, we also know that average values can be misleading for skewed distributions which are common in the real world: for example, 2.5 is the average value for both(1, 2, 3, 4)
and (0, 0, 0, 10).
So, it’s usually better to use
percentiles. Parameterized by a
fractional number n = 0..1, where the n-th percentile is equal to a value that
exceeds a specified ratio of values in the distribution. The
median is a special case: it’s defined
as the 50th percentile (n = 0.5), and it can be casually thought of as “the
middle” value. 2.5 and 0 are the medians of (1, 2, 3, 4) and (0, 0, 0, 10),
respectively.
Data modeling
Let’s explore the data in theusers cube that contains various demographic
information about users, including their age:
avg type.
Calculating the percentiles would require using database-specific functions.
However, almost every database has them under names of PERCENTILE_CONT and
PERCENTILE_DISC,
Postgres and
Snowflake
included. For BigQuery,
you’d need to use the APPROX_QUANTILES function.
Result
Using the measures defined above, we can explore statistics about the age of our users.Source code
Please feel free to check out the full source code or run it with thedocker-compose up command. You’ll see the result, including
queried data, in the console.