Getting started with pre-aggregations
Often at the beginning of an analytical application’s lifecycle - when there is a smaller dataset that queries execute over - the application works well and delivers responses within acceptable thresholds. However, as the size of the dataset grows, the time-to-response from a user’s perspective can often suffer quite heavily. This is true of both application and purpose-built data warehousing solutions. This leaves us with a chicken-and-egg problem; application databases can deliver low-latency responses with small-to-large datasets, but struggle with massive analytical datasets; data warehousing solutions usually make no guarantees except to deliver a response, which means latency can vary wildly on a query-to-query basis.| Database Type | Low Latency? | Massive Datasets? |
|---|---|---|
| Application (Postgres/MySQL) | ✅ | ❌ |
| Analytical (BigQuery/Redshift) | ❌ | ✅ |
Pre-Aggregations without Time Dimension
To illustrate pre-aggregations with an example, let’s use a sample e-commerce database. We have a data model representing all ourorders:
| id | status | completed_at |
|---|---|---|
| 1 | completed | 2021-02-15T12:21:11.290 |
| 2 | completed | 2021-02-25T18:15:12.369 |
| 3 | shipped | 2021-03-15T20:40:57.404 |
| 4 | processing | 2021-03-13T10:30:21.360 |
| 5 | completed | 2021-03-10T18:25:32.109 |
orders cube:
Pre-Aggregations with Time Dimension
Using the same data model as before, we are now finding that users frequently query for the number of orders completed per day, and that this query is performing poorly. This query might look something like:orders cube:
granularity property with a value of month to this
definition. This allows Cube to aggregate the dataset to a single entry for each
month.
The next time the API receives the same JSON query, Cube will build (if it
doesn’t already exist) the pre-aggregated dataset, store it in the source
database server and use that dataset for any subsequent queries. A sample of the
data in this pre-aggregated dataset might look like:
| completed_at | count |
|---|---|
| 2021-02-01T00:00:00.000 | 2 |
| 2021-03-01T00:00:00.000 | 3 |