SignalPilot is now #1 on Spider 2.0-DBT with 65.63.
Back

Plausibly Wrong Is Worse Than No Agent at All

Daniel Schaffield

/8 min read
Plausibly Wrong Is Worse Than No Agent at All

Every data team is being told to hand its pipelines to an AI agent. The pitch is seductive and the demos are clean. The problem shows up later, in production, where an agent that's plausibly wrong is worse than no agent at all.

A wrong number that looks right doesn't get caught in review. It passes dbt run, passes a glance at the output, ships to a dashboard, and surfaces three weeks later in a leadership meeting when finance asks why revenue is 12% too high. By then nobody remembers which model the agent touched.

SignalPilot is built for that reality: produce data that is correct, not merely plausible, and stay safe to run on a real warehouse. Here are the production failures we uniquely prevent, the security model that makes an agent safe to point at your data, and the benchmark results that prove all of it.

the real risk

The failures that pass review and blow up later

The dangerous bugs in analytics engineering aren't the ones that crash. They're the ones that run cleanly and return the wrong answer. Four we see constantly, and what they cost in production:

Silent fan-out

An agent joins a lookup table with duplicate keys. The SQL looks textbook and runs clean, but every order quietly counts two or three times.

Revenue overstated in the board deck. Nobody notices until the quarter closes.

Dropped entities

A churn model is aggregated off the events table instead of the customer table. Customers with zero activity simply vanish.

Exactly the accounts a retention team needs to see are missing. The dashboard looks complete. It isn't.

Double-counted metrics

A column that's already a running total gets re-summed instead of carried forward.

Lifetime value doubles across the board, and every downstream metric inherits the error.

Broken contracts

An agent “optimizes” by stripping a column it judged empty or redundant.

A downstream model, test, or dashboard that depended on it breaks, or worse, silently returns nulls.

Every one of these is plausible SQL: the kind of code that passes "it ran" and a quick human skim. It's also the kind of code most AI agents produce, because they treat a warehouse like a text box: generate, ship, hope.

real questions, real stakes

The same failures, caught

Each of these starts from a real benchmark task: two from Spider 2.0-DBT, which poses the kind of plain business question a stakeholder would actually ask, and one from ADE-Bench. We don't have the other agents' code, so the mistake on the left is the plausible one any of them could make. The result on the right is what SignalPilot actually did.

Revenue counted more than onceSpider 2.0-DBT · retail001
the question a team asks

“Which countries have the highest total revenue, and what are the top 10 by revenue?”

The plausible mistakehypothetical

An invoice has several line items, so the agent joins invoices to their lines and sums the invoice total once per line. The top-10 table looks clean and ships. One market lands roughly 2.4× too high, and the team reprioritizes around revenue that was never there.

Board deck overstates a region's revenue by ~2.4×.

What SignalPilot does

SignalPilot catches the duplication in verification: it compares the rows it produced against the source rows and flags the multiplier before anything is final. Each invoice is counted once.

Country totals match the ledger, to the row.

Customers that quietly disappearSpider 2.0-DBT · asana001
the question a team asks

“Aggregate metrics for every team and user, including open tasks, completed tasks, and average close times.”

The plausible mistakehypothetical

The agent builds the report off the tasks table. Any team with nothing in flight has no rows there, so it silently drops out, exactly the idle teams a manager is looking for. The dashboard looks complete because the missing rows leave no trace.

The accounts that need attention are the ones that vanish.

What SignalPilot does

SignalPilot keeps the team and user as the driving table and joins activity onto it, so every entity appears, with a clean 0 where there's no work rather than a gap.

Every team is in the report, zero-activity ones included.

A running total, added up twiceADE-Bench · f1006
the question a team asks

“The season points look way too high. Figure out what's wrong and fix it.”

The plausible mistakehypothetical

The standings column is already a running total that grows each race. Summing it across the season stacks every race on top of the last, so the final number balloons, large enough to notice, plausible enough that an agent ships it anyway.

Season totals inflated several times over.

What SignalPilot does

SignalPilot recognizes the column is cumulative from its role in the data, not its name, and carries the final value forward instead of re-adding it.

Totals line up with the official standings.

how it gets it right

Reason, verify, then refuse to break things

SignalPilot doesn't guess its way to plausible. It reasons about the data and then checks itself.

01
Reasons about grain, not names

Knows a cumulative total carries with MAX, a childless parent stays with a zero, and the table you aggregate onto isn't always the one named first.

02
Verifies every model it builds

Deterministic checks before anything is called done: row counts, fan-out ratios, cardinality, column completeness, value spot-checks. Bad output fails its own gate.

03
Cannot break your warehouse

Every query passes a fail-closed gateway. DROP/ALTER/INSERT/DELETE blocked at the wire, auto-LIMIT and read-only, budget caps, full audit with PII redaction.

deterministic verification, every model
row countsfan-out ratioscardinalitycolumn completenessvalue spot-checks

And none of it matters if the agent can destroy the warehouse on the way. Every query passes through a fail-closed gateway.

AST validation
DROP, ALTER, INSERT, DELETE blocked at the wire
Auto-LIMIT + read-only
every query bounded, every scan capped to a budget
Full audit + PII redaction
every query logged with lineage, cost, and policy reason

Correctness gets you on the leaderboard. Governance is what makes the leaderboard mean something for your real data.

security

Safe to run on your actual warehouse

Getting the data right is only half of it. The other half is the reason a lot of teams still won't hand an agent the keys to production: they don't trust it not to leak a credential or quietly break something. That's fair. So we built AutoFyn, our open-source security agent, to go looking for exactly those problems. It has already found real vulnerabilities in projects as widely used as Next.js and MetaMask.

Before this release went out, we pointed it at SignalPilot and let it try to break in. Everything the agent touches, from the query gateway to the notebook sandbox, got tightened up based on what it found.

vulnerabilities it has disclosed
Next.jsMetaMask
now continuously hardening
SignalPilot

So "secured by AutoFyn" isn't a box someone ticked once. It's a security agent that re-checks the whole thing every time we ship, which is how the entire experience, your queries, your pipelines, and your notebooks, stays locked down between releases.

Governed agentic notebooks

It's also what lets us do the thing other data agents can't do safely. The moment an agent needs Python, for a feature, a forecast, a chart, you've handed it arbitrary code execution on top of your warehouse. Most "AI notebooks" answer this by giving the agent a kernel with full access and hoping. SignalPilot Notebooks run every session inside a per-tenant, mathematically isolated pod.

gVisor-sandboxed kernels
non-root, read-only root filesystem, argv-only exec.
Default-deny networking
per-org namespaces, the gateway is the only ingress/egress, cloud-metadata (IMDS) blocked outright.
Least-privilege by construction
scoped RBAC, admission policies that enforce the sandbox, session token via tmpfs.
A governed Data SDK
import signalpilot as sp, the same audited, read-only, budget-capped gateway, now inside a notebook.

The agent gets to do real data science. You don't get the blast radius. It's the same trade SignalPilot has always made, capability with guardrails, extended from queries to compute.

the proof

The hardest benchmarks in data

None of the above is worth much as a claim, so we measure it against the only honest scoreboard in this space and lead it. Spider 2.0-DBT drops an agent into broken, real-world enterprise dbt repos and grades whether it can actually fix them.

SignalPilot
65.6
Databao (JetBrains)
60.29
Shadowfax + GPT-5
41.18
Spider-Agent-Extended + GPT-5
39.71

ADE-Bench is dbt Labs' analytics-engineering benchmark, graded on exact row-level output equality. Across the full 64-task suite SignalPilot resolves 62 of 64 (96.9%). On the 43-task subset the rest of the field reports against:

SignalPilot
97.7
Paradime DinoAI
88.4
Altimate AI
74.4
dbt Labs
58.1
Claude Code (base harness)
39.5
39.5%
base model, bare harness
97.7%
same model, inside SignalPilot

A ~58-point swing on identical tasks. That isn't the model. It's the governance, the verification, and the data-aware reasoning wrapped around it.

The benchmarks aren't the point. They're how we prove, on tasks we didn't write, that the agent does in the lab exactly what it has to do in your warehouse.

the method

How we did it

There's no trick here, and no autonomous loop quietly grinding the score up overnight. We closed the gap the unglamorous way: run the full benchmark, read every failure against the actual expected output, find the single decision that went wrong, and fix it at the narrowest layer that generalizes. Never with a patch aimed at one task.

A few of the changes that moved the number:

01
One source of truth for the driving table

The agent kept building from the wrong table because two different tools were quietly recommending different ones. We removed the conflicting heuristic so the data-driven project scan is the single authority on which table to build from. That one change fixed the disjoint-key failures outright.

02
Grain over phrasing

We taught the agent that "aggregate X by Y" describes what to summarize and the output grain, not the table to drop into the FROM clause, and that a parent row with no matching children belongs in the output with a zero, not dropped.

03
Verifiers that don't punish correct work

Our verification subagents were flagging legitimately all-NULL metric columns (childless parents in a LEFT JOIN) as defects, which pushed the agent to "fix" correct output into wrong output. The bug was in the verifier, so we fixed the verifier, not the agent.

04
Metrics from the right rows

Event timestamps (first reply, last close, and the like) are aggregations of the detail rows, never a denormalized convenience column copied off the parent.

05
Explicit instructions outrank the schema

When a task says "don't include column X," that wins over a YML contract that happens to list it.

06
No two skills fighting over one decision

We resolved a date-arithmetic conflict between two skills, and adopted a standing rule that each topic lives in exactly one skill and the others never reference it, so the agent never gets two answers to the same question.

The constraint we held the whole way: every fix had to be general. Not one of them hardcodes a benchmark answer or a specific table name. When a failure analysis suggested "just memorize the expected value here," we threw it out, because a rule that only works on a benchmark is worthless on your warehouse (and, frankly, it's cheating). The only two tasks we don't pass are the two with no valid answer key at all: one ships no expected output, the other's entire prompt is "do nothing."

That's the method. Read the failures honestly, fix the root cause once, make sure it generalizes, and never teach the agent the test.

the bar

Trusted by default, not by accident

Two things make SignalPilot different, and both are things you can check rather than take on faith. It produces data that's correct where other agents produce data that's merely plausible, provable on the hardest public benchmarks. And it's safe to run on real production data, because the same autonomous rigor that finds vulnerabilities in Next.js and MetaMask is pointed back at SignalPilot itself.

Correctness you can verify. Security you can audit. That's the whole bar for an agent you'd actually let near your warehouse.

get started

Try it today

SignalPilot is open-source and installs in about a minute:

git clone https://github.com/SignalPilot-Labs/signalpilot.git
cd signalpilot && docker compose up -d

Add it to Claude Code:

/plugin marketplace add ./plugin
/plugin install signalpilot-dbt@signalpilot
newNow on Codex
# (Optional) Install the plugin for skills + agents (Codex)
codex plugin marketplace add SignalPilot-Labs/codex-signalpilot-plugin
codex plugin add signalpilot@signalpilot

We're building the vendor-neutral autonomous data stack: agents that are trusted by default, not trusted by accident. Whether you're a data engineer tired of babysitting pipelines, a founder betting on dbt, or an investor watching the AI-native data layer take shape, we want to hear from you.

Star the repo. Break the agent. Tell us what's missing. The next twelve months are going to be wild.

ai-agentsdbtdata-engineeringautofynsecuritybenchmarkgovernance

Try SignalPilot today

Install in under a minute. Open source. No credit card.

Get Started Free