Running in production
Cube makes use of two different kinds of cache:- In-memory storage of query results
- Pre-aggregations
Using Windows? We strongly recommend using WSL2 for Windows 10
to run the following commands.
Cube Store can further be configured via environment variables. To see a
complete reference, please consult the
CUBESTORE_* environment variables in
the Environment Variables reference.localhost (on
the default port 3030):
CUBEJS_CUBESTORE_HOST to let Cube know
where Cube Store is running.
You can also use Docker Compose to achieve the same:
Architecture
Scaling
Scaling Cube Store for a higher concurrency is relatively simple when running in cluster mode. Because the storage layer is decoupled from the query processing engine, you can horizontally scale your Cube Store cluster for as much concurrency as you require. In cluster mode, Cube Store runs two kinds of nodes:- one or more router nodes handle incoming client connections, manage database metadata and serve simple queries.
- multiple worker nodes which execute SQL queries
EXPLAIN and EXPLAIN ANALYZE SQL
commands to see how many partitions would be used in a specific Cube Store
query.
Resources required for the main node and workers can vary depending on the
configuration. With default settings, you should expect to allocate at least 4
CPUs and up to 8GB per main or worker node.
The configuration required for each node can be found in the table below. More
information about these variables can be found in the Environment Variables
reference.
| Environment Variable | Specify on Router? | Specify on Worker? |
|---|---|---|
CUBESTORE_SERVER_NAME | ✅ Yes | ✅ Yes |
CUBESTORE_META_PORT | ✅ Yes | — |
CUBESTORE_WORKERS | ✅ Yes | ✅ Yes |
CUBESTORE_WORKER_PORT | — | ✅ Yes |
CUBESTORE_META_ADDR | — | ✅ Yes |
CUBESTORE_WORKERS and CUBESTORE_META_ADDR variables should be set with
stable addresses, which should not change. You can use stable DNS names and put
load balancers in front of your worker and router instances to fulfill stable
name requirements in environments where stable IP addresses can’t be guaranteed.
To fully take advantage of the worker nodes in the cluster, we strongly
recommend using partitioned pre-aggregations.
Replication and High Availability
The open-source version of Cube Store doesn’t support replicating any of its nodes. The router node and every worker node should always have only one instance copy if served behind the load balancer or service address. Replication will lead to undefined behavior of the cluster, including connection errors and data loss. If any cluster node is down, it’ll lead to a complete cluster outage. If Cube Store replication and high availability are required, please consider using Cube Cloud.Storage
Cube Store cluster uses both persistent and scratch storage.Persistent storage
Cube Store makes use of a separate storage layer for storing metadata as well as for persisting pre-aggregations as Parquet files. Cube Store can be configured to use either AWS S3, Google Cloud Storage (GCS), or Azure Blob Storage as persistent storage. If desired, a local path on the server can also be used in case all Cube Store cluster nodes are co-located on a single machine.Cube Store can only use one type of remote storage at the same time.
Available on Enterprise and above plans.
As an additional layer on top of standard AWS S3, Google Cloud Storage (GCS), or
Azure Blob Storage encryption, persistent storage can optionally use Parquet
encryption for data-at-rest protection.
Scratch storage
Separately from persistent storage, Cube Store requires local scratch space to warm up partitions by downloading Parquet files before querying them. By default, this folder should be mounted to.cubestore/data inside the
container and can be configured by CUBESTORE_DATA_DIR environment variable.
It is advised to use local SSDs for this scratch space to maximize querying
performance.
AWS
Cube Store can retrieve security credentials from instance metadata automatically. This means you can skip defining theCUBESTORE_AWS_ACCESS_KEY_ID and CUBESTORE_AWS_SECRET_ACCESS_KEY environment
variables.
Garbage collection
Cleanup isn’t done in export buckets; however, it’s done in the persistent storage of Cube Store. The default time-to-live (TTL) for orphaned pre-aggregation tables is one day. Refresh worker should be able to finish pre-aggregation refresh before garbage collection starts. It means that all pre-aggregation partitions should be built before any tables are removed.Supported file systems
The garbage collection mechanism relies on the ability of the underlying file system to report the creation time of a file. If the file system does not support getting the creation time, you will see the following error message in Cube Store logs:XFS is known to not support getting the creation time of a file.
Please see this issue
for possible workarounds.
Security
Authentication
Cube Store does not have any in-built authentication mechanisms. For this reason, we recommend running your Cube Store cluster with a network configuration that only allows access from the Cube deployment.Data-at-rest encryption
Persistent storage is secured using the standard AWS S3, Google Cloud Storage (GCS), or Azure Blob Storage encryption. Cube Store also provides optional data-at-rest protection by utilizing the modular encryption mechanism of Parquet files in its persistent storage. Pre-aggregation data is secured using the AES cipher with 256-bit keys. Data encyption and decryption are completely seamless to Cube Store operations.Available on the Enterprise Premier plan.
Also requires the M Cube Store Worker tier.
Troubleshooting
Heavy pre-aggregations
When building some pre-aggregations, you might encounter the following error:CUBEJS_DB_QUERY_TIMEOUT environment variable.
However, it is recommended that you optimize your pre-aggregations instead:
- Use an export bucket if your data source supports it. Cube will then load the pre-aggregation data in a much more efficient way.
- Use partitions. Cube will then run a separate query to build each partition.
- Build pre-aggregations incrementally. Cube will then build only the necessary partitions with each pre-aggregation refresh.
- Set an appropriate build range if you don’t need to query the whole date range. Cube will then include only the necessary data in the pre-aggregation.
- Check that your pre-aggregation includes only necessary dimensions. Each additional dimension usually increases the volume of the pre-aggregation data.
- If you include a high cardinality dimension, Cube needs to store a lot of data in the pre-aggregation. For example, if you include the primary key into the pre-aggregation, Cube will effectively need to store a copy of the original table in the pre-aggregation, which is rarely useful.
- If a single pre-aggregation is used by queries with different sets of dimensions, consider creating separate pre-aggregations for each set of dimensions. This way, Cube will only include necessary data in each pre-aggregation.
- Check if you have a heavy calculation in the
sqlexpression of your cubes (rather than a simplesql_tablereference). If it’s the case, you can build an additionaloriginal_sqlpre-aggregation and instruct Cube to use it when building other pre-aggregations for this cube.