Skip to main content
Evals are currently in preview, and the user experience and file format may still change. Reach out to the Cube support team to activate this feature for your account.
Evals let you benchmark your agent’s answers against a known-correct ground truth, on any branch. You author a set of questions, each with the SQL or certified query that represents the right answer, run your agent against them, and get a per-question pass/fail plus an accuracy score for the run — so you can see, objectively, whether a data-model or agent change made the agent better or worse. You’ll find evals in the model IDE under the Evaluate tab, with two sub-tabs: Evaluations (runs) and Questions (the benchmark set).
Evaluation run results showing the question list with pass/fail icons and a selected question's detail with the agent's SQL next to the ground truth SQL

Concepts

TermWhat it is
QuestionA natural-language question plus its ground truth (the correct answer, as SQL or a certified-query reference). Authored as code in your data model.
Evaluation (run)One execution of the agent against the whole question set, on a specific branch and agent.
ResultThe agent’s answer to a single question in a run, graded against that question’s ground truth.
Accuracypassed / total for a run, shown as NN% (passed/total).

Authoring benchmark questions

Questions live in your data model repository, versioned and branched like the rest of it — under agents/eval_questions/*.yml. Each file has a top-level eval_questions list. A question needs a unique name, a question, and exactly one ground truth: a certifiedQuery reference or inline sql.
# agents/eval_questions/revenue.yml
eval_questions:
  - name: revenue_by_quarter
    question: What was our revenue by quarter over the last two years?
    certifiedQuery: revenue_by_quarter        # reference an existing certified query by name

  - name: arr_last_4_years
    question: What was our ARR over the last 4 years?
    sql: |                                    # ...or inline SQL ground truth
      SELECT date_trunc('year', created_at) AS year, SUM(arr) AS arr
      FROM subscriptions GROUP BY 1 ORDER BY 1
  • certifiedQuery references a certified query by name. Define it under agents/certified_queries/ (or via Certify this query in chat). A reference that doesn’t resolve to an existing certified query is flagged as a validation error.
  • sql is inline ground-truth SQL, run through the same Cube SQL API the agent uses (so MEASURE(...) and friends work).
  • Omitting both — or setting both — is a validation error.
  • An optional top-level space key scopes a file’s questions to a named space (defaults to auto). Question names are unique per space.
While in preview, the Questions tab is a read-only view of these files. To add or edit questions, edit the YAML in the IDE — there’s no in-product question editor yet.

Running an evaluation

On the Evaluations tab, click New evaluation and choose:
  • Branch — which branch’s data model and agent configuration to run against. Defaults to the active branch.
  • Agentauto (the implicit auto-agent) or a configured agent name.
The run starts immediately and you can close the dialog — it executes in the background. The run list shows live progress and then the outcome:
ColumnMeaning
Evaluation nameWhen the run was created.
EnvironmentWhere it ran — dev (your personal dev-mode branch, shown as “Name Dev Mode”), staging, or prod (the deploy branch, e.g. master or main).
AgentThe agent used.
Execution statusRunning, Completed, or Failed.
AccuracyNN% (passed/total).
Created byWho triggered the run.
Last updatedWhen it finished.

Reading the results

Open a run to see per-question results: the question list on the left, with a pass/fail icon for each, and the selected question’s detail on the right.
  • Assessmentpass, fail, review, or error.
  • Score reason — when a question doesn’t pass, a tag categorizing why: Row count mismatch, Missing columns, Value mismatch, Unexpected rows, Query error, Ground truth query failed, Ground truth not found, or Agent error.
  • Failure analysis — a plain-English explanation, e.g. “The agent returned 3 rows, but the ground truth has 5 rows.”
  • Model output · SQL vs. Ground truth SQL answer — the agent’s query side-by-side with the ground truth, so you can spot the difference.
  • Response — the agent’s full text answer, rendered as Markdown.

How grading works

Grading is execution-based, not text-based — the same approach used by industry text-to-SQL benchmarks such as BIRD and Spider 2.0. The agent’s SQL and the ground-truth SQL are both executed, and their result sets are compared. So an answer that’s worded or written differently but produces the same data still passes. The comparison is:
  • Sort-invariant — row order never matters.
  • Numeric-tolerant — values are compared to 4 significant figures, so float/representation noise (6646 vs. 6646.0) doesn’t fail.
  • Column-name-agnostic and lenient on extra columns — each ground-truth column must be reproduced by some agent column, matched by its values, so revenue vs. total aliases don’t matter. Extra columns the agent adds are ignored.
  • No standalone row-count gate — row count falls out of the comparison: a “top 5” question is enforced because the golden result has exactly 5 rows.
Verdicts:
VerdictWhen
passThe agent’s result set matches the ground truth.
failIt ran but the result set doesn’t match (see the score reason).
reviewNothing to compare automatically — the question has no ground truth, or the agent didn’t run a query. Compare manually.
errorThe agent run failed, the ground-truth query failed, or a referenced certified query wasn’t found.

Preview limitations

  • Evals must be activated for your account by the Cube support team.
  • Questions are authored as code only; the Questions tab is read-only.
  • Very large question sets can be slow to run.
  • Grading is execution-based on the result set; it does not semantically judge prose answers.