Implementing data snapshots
Use case
For a dataset that contains a sequence of changes to a property over time, we want to be able to get the most recent state of said property at any given date. In this recipe, we’ll learn how to calculate snapshots of statuses at any given date for a cube withProduct Id, Status, and Changed At dimensions.
We can consider the status property to be a
slowly changing dimension
(SCD) of type 2. Modeling data with slowly changing dimensions is an essential
part of the data engineering skillset.
Data modeling
Let’s explore thestatuses cube that contains data like this:
shipped status at a particular date?
First, we need to generate a range with all dates of interest, from the earliest
to the latest. Second, we need to join the dates with the statuses and leave
only the most recent statuses to date.
To generate a range of dates, here we use the
GENERATE_SERIES function
which is Postgres-specific. Other databases have similar functions, e.g.,
GENERATE_DATE_ARRAY
in BigQuery.status_snapshots cube
extend the original statuses
cube in order to reuse the dimension definitions. We only need to add a new
dimension that indicates the date of a snapshot. We’re also referencing the
definition of the statuses cube with the
sql() property.
Query
To count orders that remained in theshipped status at a particular date, we
will send a query that selects a snapshot by this date and also filters by the
status:
Result
If we execute a couple of such queries for distinct dates, we’ll spot the change:Source code
Please feel free to check out the full source code or run it with thedocker-compose up command. You’ll see the result, including
queried data, in the console.