Joins between cubes
Joins create relationships between cubes in the data model. They allow to build complex queries that involve members from multiple cubes. They also allow to reference members from other cubes in calculated members, views, and pre-aggregations. When defining joins, it’s important to understand join types and the direction of joins as well as how join paths and join hints are used to work with the joined cubes.Join types
Cube supports three types of join relationships often found in SQL databases:one_to_one, one_to_many, and many_to_one.
For example, let’s take two cubes, customers and orders:
customers cube:
one_to_many relationship and its’
resulting SQL, any guest checkouts would be excluded from the results. To remedy
this, we’ll remove the join from the customers cube and instead define a join
with a many_to_one relationship on the orders cube:
orders cube defines the relationship between
itself and the customer cube. The same JSON query now results in the following
SQL query:
orders, and customers is in
the LEFT JOIN clause; this means any orders without a customer will also be
retrieved.
Many-to-many joins
A many-to-many relationship occurs when multiple records in a cube are associated with multiple records in another cube. For example, let’s say we have two cubes,topics and posts, pointing to the
topics and posts tables in our database, respectively. A post can have
more than one topic, and a topic may have more than one post.
In a database, you would most likely have an associative table (also known as a
junction table or cross-reference table). In our example, this table name might
be post_topics.
The diagram below shows the tables posts, topics, post_topics, and their
relationships.
post_topics table was specifically created to handle this
association in the database, we need to create an associative cube
post_topics, and declare the relationships from it to topics cube and from
posts to post_topics.
The following example uses the
many_to_one relationship on the post_topics
cube; this causes the direction of joins to be posts → post_topics → topics.
Read more about the direction of joins.emails and
transactions. The goal is to calculate the amount of transactions per
campaign. Both emails and transactions have a campaign_id column. We don’t
have a campaigns table, but data about campaigns is part of the emails table.
Let’s take a look at the emails cube first:
campaigns cube:
Campaigns cube:
one_to_many
relationship on the associative cube,
campaigns in our case.
Join tree
When Cube analyzes a query, it builds a join tree that connects all cubes involved in the query in compliance with the direction of joins. If the join tree cannot be built, the query will fail to execute. The presence of bidirectional joins or diamond subgraphs can complicate the join tree structure.Direction of joins
All joins in Cube’s data model are directed. They flow from the source cube (the one where the join is defined) to the target cube (the one referenced in the join). Cube will respect the join graph when generating SQL queries. It means that source cubes will be on the left side ofJOIN clauses and target cubes will be on the right
side of JOIN clauses. Consider the following data model, consisting of cubes left
and right:
left to right. If you query for left.value
and right.value, Cube will generate the following SQL query. As you can see, left is
on the left side of the JOIN clause, and right is on the right side:
left cube and uncomment the join
definition in the right cube, Cube will generate the following SQL for the same query:
Bidirectional joins
As a rule of thumb, it’s not recommended to define bidirectional joins in the data model (i.e., having both cubes define a join to each other) by default. However, it can still be useful for some valid analytical use cases. Consider the following data model withorders and customers for an e-commerce that
has both registered and guest customers (they have NULL as customer_id):
customers.name and orders.order_count will produce the following result:
customers cube is on the left side of the JOIN clause, since the direction of
joins is from customers to orders. This means that the query will only return
registered customers and all orders by guest customers will be excluded.
Now, if you uncomment the join definition in the orders cube and comment out the
join definition in the customers cube, running the same query will produce the following
result:
Eve). Check the
generated SQL query, which reveals why:
Diamond subgraphs
A diamond subgraph is a specific type of join structure where there’s more than one join path between two cubes, e.g.,users.schools.countries and
users.employers.countries. Join structures like a.b.c + a.c or a.b.c.d + a.b.d
are also be considered diamond subgraphs for the purpose of this section.
In the following example, four cubes are joined together as a diamond: a joins to b
and c, and both b and c join to d:
a.d_via_b, Cube will generate the following SQL query, joining through
b:
a.d_via_c, Cube will generate the following SQL query, joining
through c:
Join paths
Join paths serve as the recommended mechanism to remove the ambiguity of joins in the data model, including cases of bidirectional joins and diamond subgraphs. A join path is defined as a sequence of cube names, separated by dots, that will be followed in that specific order when resolving a cube or its member, e.g.,users.locations.countries.flag_aspect_ratio.
Calculated members
When referencing a member of another cube in a calculated member, you can use a join path to specify the path to follow, as demonstrated in the diamond subgraphs example.Views
When referencing cubes in a view, you literally provide join paths via thejoin_path parameter. The bidirectional joins
example can be disambiguated with the following views:
Pre-aggregations
When referencing members of another cubes in a pre-aggregation, you can also use join paths, as shown in the following example:a is joined to c though b here, even though a
direct join from a to c is also defined in the data model. However, Cube respects
join paths from the pre-aggregation definition and uses them to generate the SQL query.
Join hints
Join hints serve as an auxiliary mechanism to remove the ambiguity of joins at the query time. When possible, use join paths in the data model instead.SQL API
In queries to the SQL API, cubes can be joined via theCROSS JOIN clause or via __cubeJoinField. In any case, Cube will analyze the
query and follow provided join hints.
Let’s run the following query with the data model from the diamond subgraphs
example:
CROSS JOIN c instead of CROSS JOIN b, then the
generated SQL query would contain a join through c instead of b.
REST API
In queries to the REST API, join hints can be provided via thejoinHints parameter.
The SQL API query from the previous section can be rewritten as a REST API query
with join hints as follows:
joinHints parameter contained [["a", "c"], ["c", "d"]] instead of
[["a", "b"], ["b", "d"]], then the generated SQL query would contain a join through
c instead of b.
Troubleshooting
Can't find join path
Sometimes, you might come across the following error message: Can't find join path to join 'cube_a', 'cube_b'.
It indicates that a query failed because it includes members from cubes that can’t be
joined in order to generate a valid query to the upstream data source.
Please check that you’ve defined necessary joins and that they have correct
directions.
Also, please consider using views since they
incapsulate join paths and completely remove the possibility of the error in question.
You might also consider setting the public parameter to false
on your cubes to hide them from end users.
If you’re building a custom data application, you might use the meta endpoint
of the REST API. It groups cubes into connectedComponents to help select those ones
that can be joined together.
Primary key is required when join is defined
Sometimes, you might come across the following error message: cube_a cube: primary key for 'cube_a' is required when join is defined in order to make aggregates work properly.
It indicates that you have a cube with joins and pre-aggregations.
However, that cube doesn’t have a primary key.
When generating SQL queries, Cube uses primary keys to avoid fanouts. A fanout happens
when two tables are joined and a single value gets duplicated in the end result, meaning
that some values can be double counted.
Please define a primary key dimension in this cube to make joins and
pre-aggregations work correctly.
If your data doesn’t have a natural primary key, e.g., id, you can define a composite
primary key by concatenating most or all of the columns in the table. Example:
Transitive join pitfalls
Let’s consider an example where we have a many-to-many relationship betweenusers and companies through the companies_to_users cube:
users.count as a measure and companies.name
as a dimension would yield the following error: Can't find join path to join 'users', 'companies'.
The root cause is that joins are directed and there’s no
join path that goes by join definitions in the data model from users to
companies or in the opposite direction.
In Visual Modeler, you can see that both users and companies
are to the right of companies_to_users, meaning that there’s no way to go
from users to companies moving left to right or right to left:
companies_to_users
cube to either users or companies cube. Please note that it would affect
the query semantics and thus the final result:
users to companies_to_users to companies. Visual Modeler
also shows that you can reach companies from cubes by going left to right.