Where coding agents quietly get the warehouse wrong

A coding agent will write a hundred lines of transformation SQL in the time it takes to describe the table you want, and most of it will look right. It names the staging models the way your project names them, wires up the joins, adds a couple of tests, and writes the documentation block. The code compiles. The pipeline runs green. The tests you already had still pass. And the numbers are wrong.

This is the failure that matters most in agent-assisted warehouse work, and it is not the one teams brace for. The agentic warehouse engineering pillar maps the full agentic workflow and the tool ecosystem around it; this article zooms in on one slice of that picture, the correctness slice, while data warehouse testing covers the checks that catch these failures and the loading and operations pillar covers the disciplines that keep production loads honest. The narrow focus here is the specific ways a coding agent produces warehouse code that is syntactically valid, operationally green, and semantically wrong. These are AI-generated SQL errors that do not look like errors, a text-to-SQL reliability problem in agent-written data models rather than a syntax one. It covers where those failures cluster, why a normal review misses them, and what actually catches them, for the data or analytics engineer who has started handing real transformation work to an agent and wants to know where to keep both hands on the wheel.

TL;DR. Warehouse code rarely crashes when it is wrong; it returns a plausible number. Coding agents are fluent enough to produce code that survives a read-through and a green test suite while answering a different question than the one that was asked. The failures cluster in a few predictable places, and the only reliable defense is reconciliation against an independent source of truth, not a closer reading of the generated SQL.

Why warehouse code fails without crashing

Application code tends to fail loudly. Pass the wrong type or dereference something that is not there, and the program throws where you can see it. Analytical SQL has almost none of that protection. A join to the wrong grain, a filter that drops a partition, an aggregate that skips nulls, a window that resets on the wrong key: each produces a result set, not an error. The query runs, returns rows, and the rows look like data. Nothing in the engine knows the answer is wrong, because correctness here is a question about business meaning, and the engine has no opinion about business meaning.

That is the structural reason agent-generated warehouse code carries a risk that agent-generated application code often does not. The agent inherits the same missing guardrails and adds fluency on top. A model that writes confident, idiomatic SQL raises the ceiling on how plausible a wrong answer can look, which is the wrong direction for a failure you detect by noticing that something looks off.

The benchmark picture is blunt about the gap between easy schemas and real ones. On Spider 1.0, the text-to-SQL benchmark that anchored the field for years, strong systems reach the high eighties on execution accuracy, with the best published result around 86 percent. On Spider 2.0, built from realistic enterprise schemas that carry hundreds to thousands of columns, the same class of frontier model falls to roughly 10 percent unaided, and the strong agentic systems in the original evaluations landed in the teens to mid-thirties. Public-leaderboard entries have since climbed much higher, but those are largely self-reported vendor submissions against a benchmark whose own gold answers carry documented errors, so they are weak evidence of real-world accuracy. BEAVER, assembled from real private warehouse query logs, puts the best agent near 11 percent. These are different tasks, not one task measured twice, and that is the point: the toy schema hands the model the right columns, the real warehouse does not. The errors are mostly not crashes. An analysis of several thousand incorrect text-to-SQL queries found that roughly a third executed cleanly and returned the wrong answer. Those are the ones that reach a report.

The failure catalog: how AI-generated SQL goes wrong

The failures are not random. Across the published error taxonomies and the hands-on write-ups, agent-generated warehouse code goes wrong in a small number of recognizable places. They share a shape: the code is valid, the result is plausible, and the mistake is a question of meaning rather than syntax.

Grain and fan-out. The agent joins a fact to a dimension, or two facts to each other, on a relationship that turns out to be one-to-many, and the result fans out. Every measure on the many side is now repeated, and any sum over it is inflated. The query reads fine line by line, and the row count goes up rather than down, which to a quick glance looks like more complete data rather than duplicated data. This is the classic fan trap, with its cousin the chasm trap, and it is the single most common way an agent quietly doubles a total. The defense is to assert the grain of every model and to reconcile row counts and totals against the source after the join, not to trust that the join looks right.

Join drift after a schema change. The agent writes an INNER JOIN where a LEFT JOIN was meant, or the reverse. On the day it is written, against the data that exists then, the two produce identical results and the test passes. Weeks later a source starts emitting nulls in the join key, or a key stops matching, and the INNER JOIN silently discards the unmatched rows. Nothing changed in the code, so nothing draws a reviewer's eye; the data changed underneath a query that was reviewed once and trusted thereafter. Row-count deltas against the prior load, and anti-join checks that count what failed to match, are what surface this. Reading the SQL again does not.

Merge keys and duplicate rows. Incremental and slowly changing dimension loads turn on a MERGE, and the MERGE turns on a key that is genuinely unique. Give it a key that is not unique, or one that can be null, and the load can update the wrong row or insert duplicates, leaving two current rows for one business entity. Whether this fails loudly or quietly depends on the engine and the configuration, which is its own trap: some warehouses raise an error on a non-deterministic merge by default and only go silent when that guard is switched off, while a dbt incremental model with no unique key, or a unique key containing nulls, will append duplicates without complaint. The guardrail is a uniqueness test on the key plus an invariant test that no business key carries more than one active row, both enforced as part of the build rather than assumed by the model. (Change data capture and surrogate key management cover the load mechanics this depends on.)

Nulls in aggregates. SQL aggregate functions ignore nulls, by standard and by design. AVG, SUM, and COUNT over a column skip the null rows entirely, so an average over six values where one is null divides by five, not six. When the intent was to treat the missing value as zero, the agent's perfectly standard SQL produces a number that comes out too high, and there is no error anywhere because the SQL is correct SQL doing what SQL does. Deciding null handling explicitly, with COALESCE where missing means zero, and reconciling the denominator against a known count, is what catches it.

Metric semantics the schema does not contain. Ask an agent for revenue, or active users, or monthly recurring revenue, and it will compute something. Whether it computes the definition your business actually uses is a separate question, because that definition usually does not live in the schema. Monthly recurring revenue is not the sum of order values; first response time is not first resolution time; an active user is whatever your company decided an active user is. The agent has the column names and the data types and does not have the tribal knowledge or the metrics catalog, so it picks a reasonable-looking definition and computes it flawlessly. This is the failure that better SQL accuracy cannot fix, because the information needed to be correct is not in the database. It is the strongest argument for a governed semantic layer that holds metric definitions as the source of truth rather than leaving each query, human or agent, to reinvent them.

Hallucinated and near-miss columns. Faced with a wide, unfamiliar schema, models reach for the column that sounds right. The published error taxonomies put wrong-table selection at roughly a quarter to a third of semantic errors, and wrong output columns higher still, and both get worse as the schema grows, which is precisely the condition a real warehouse presents. A near-miss column, amount instead of net_amount, or created_at instead of effective_date, returns data, passes the type check, and answers a subtly different question. Grounding the agent in the actual model and testing the output columns against expectations is the guard; trusting that a named column means what it sounds like is the exposure.

Silent data loss at the source. Before any of the modeling, there is extraction, and an agent wiring up a source can lose data there too. In one widely read 2026 experiment, a coding agent built a complete, runnable dbt project end to end, with staging, dimensions, snapshots, tests, and documentation, and it still hit a pagination limit that pulled roughly 1,493 of about 5,458 available records without flagging the gap, dropped several columns, and applied change tracking to only one of two entities that needed it. The build was green. As the author put it, wrong is worse than absent: a missing table gets noticed, a table that is quietly missing nearly three-quarters of its rows does not.

Why reviewing the output is not enough

The natural answer to all of this is review: a human reads the generated SQL before it merges. The problem is that the silent failures are, by construction, the ones a read-through does not reveal. The code looks right because looking right is exactly what a fluent model is good at; the join, the merge, and the aggregate are all idiomatic. Running the code does not reveal it either, because running it returns a number. The only thing that distinguishes a correct result from a plausible wrong one is comparison against an independent source of truth, and producing that comparison is the expensive part the agent did not do.

There is research-grade reason to distrust the cheap checks. When models are asked to evaluate their own SQL for semantic errors, they catch them only around three quarters of the time, so asking the agent to check its own work leaves a quarter of the silent failures in place. The benchmarks the field uses to measure all this are themselves unreliable: a 2026 audit found that more than half of the gold answers in one widely used set, and roughly three fifths in an enterprise set, were wrong or ambiguous, and in at least one case a query scored as correct only because the test database held too few rows to expose the bug. If the graders miss it, a busy reviewer on a Friday afternoon will too. There is also a human factor: studies of automation in the loop find that higher trust in a tool correlates with accepting more of its wrong suggestions.

This is why the bottleneck in agent-assisted warehouse work is verification, not generation. A single agent can open more correct-looking pull requests than a team can genuinely verify, and the work of verification did not get cheaper because the work of writing got faster. A 2026 dbt Labs survey of analytics engineers (a vendor-run, self-selected sample) captured the imbalance directly: roughly three quarters were prioritizing AI for writing code, and only about a quarter for the testing and observability that would tell them whether the code was right.

Design-time AI.

Deterministic runtime.

AI helps you build. Production runs deterministic SQL on your warehouse. No LLM calls at runtime.

See a demo

What actually catches silent failure

The pattern underneath every control is a single stance: treat a result as unverified until it has been reconciled against an independent source of truth, because in a warehouse validity is not correctness. Adopting that stance is what separates a team that gets a multiplier from its agents from a team that gets faster wrong answers, and none of the controls that follow from it are new. What changed is that the volume of generated code makes skipping them more dangerous, and the same agents that introduce the risk also make the controls cheaper to build and keep current.

Reconciliation against the source is the highest-leverage check and the one most often missing. A daily comparison of warehouse counts and totals against the corresponding source counts and totals catches the whole class of failures where a pipeline dropped rows, fanned out a total, or read from a stale replica, regardless of which specific mistake caused it. It is the one test that does not depend on a human having anticipated the failure in advance.

Tests are guardrails, not decoration, and the distinction matters with an agent in the loop. Uniqueness on every declared key, the active-row invariant on every versioned dimension, referential integrity on every foreign key, and explicit null-rate and range bounds on the columns that carry business logic are the assertions that turn a silent wrong answer into a failed build. The agent will happily write these tests; the discipline is requiring them and reviewing the assertions rather than the SQL.

Regression against an approved baseline catches the change that alters output without anyone intending it. Running the new version against a known dataset, diffing the result against the previously approved one, and requiring an explicit acknowledgment before the baseline moves is the mechanism, and it is what surfaces an INNER-for-LEFT swap or a redefined metric at the pull request rather than in a report.

A governed semantic or metrics layer is the structural fix for the one failure SQL accuracy cannot reach. When metric definitions live in one modeled place, neither a human nor an agent has to rederive them per query, and the definition is reviewed once instead of guessed repeatedly. The major platforms have moved in this direction: Snowflake, Databricks, dbt, and others have each shipped governed semantic or metric-layer features in front of their natural-language and agent querying, on the shared premise that an agent should query through a curated model rather than against raw tables. The specific features differ in scope and maturity by vendor. The accuracy case for the layer is suggestive rather than settled, and most of it comes from the vendors who sell the layers. One open benchmark from dbt Labs reported near-perfect accuracy through its semantic layer against the mid-eighties to around ninety for raw text-to-SQL, though only on the queries the layer already covered, and three of its eleven questions required adding new models before the layer could answer them at all. A separate vendor-authored study found accuracy rising from roughly 17 percent to 54 percent when a curated knowledge-graph layer sat between the question and the database, with the gain concentrated on simpler questions and much weaker, around the high thirties, on the most complex multi-table ones. The honest reading is not that a semantic layer makes agents accurate. It is that querying through a reviewed model is the only place the metric-semantics problem gets solved at all, and the independent evidence that raw enterprise schemas defeat agents is what makes the modeled layer worth its cost.

Underneath all of it is a reframe. A coding agent lowers the cost of writing warehouse code and does almost nothing about the cost of owning it. The durable work moves to either side of the agent: specifying intent precisely enough that the generated code can be right, and verifying output rigorously enough to know that it is. That is where the engineer's judgment now lives.

Closing

The useful mental model is that a coding agent is a fast, fluent junior who never says "I am not sure." It will produce a clean answer to the question it thinks you asked, and in a warehouse the gap between that question and the real one shows up as a number, not an error. Validity is not correctness, a green test suite is not a correct one, and a confident explanation of a query is not evidence the query is right. Build the reconciliation, the invariants, and the modeled definitions that make wrongness loud, and the agent becomes a genuine multiplier. Skip them, and it becomes a faster way to ship numbers nobody can trust. For the testing disciplines that operationalize this, see data warehouse testing; for the loading correctness it sits on top of, see the loading and operations pillar.

Sources

The figures and findings above trace to the following sources.

Toy-versus-enterprise text-to-SQL accuracy (around 86 percent on Spider 1.0, roughly 10 percent on Spider 2.0): the Spider 2.0 benchmark and DAIL-SQL (VLDB 2024).
Best agent near 11 percent on real private-warehouse query logs: BEAVER.
Roughly a third of incorrect text-to-SQL queries run cleanly and return the wrong answer: a 2025 error analysis of text-to-SQL systems.
Models detect their own semantic SQL errors only around three-quarters of the time, and the wrong-table and wrong-output-column breakdown: NL2SQL-BUGs (KDD 2025).
More than half of one benchmark's gold answers, and about three-fifths of an enterprise set's, are wrong or ambiguous, including a query that passed only because the test table was too small to expose the bug: a 2026 audit of text-to-SQL benchmarks.
Fan traps, chasm traps, and join hallucination as silent double-counting: Snowflake engineering on joins in complex schemas.
Metric definitions living outside the schema (first response versus first resolution, recurring versus total revenue): Omni on why text-to-SQL fails.
MERGE behavior on duplicate-key matches (an error by default in some engines, silent only when the guard is disabled): the Snowflake MERGE reference and Delta Lake MERGE.
dbt incremental models appending duplicates without a unique key, or with a key that contains nulls: the dbt incremental models documentation.
SQL aggregate functions silently skipping nulls: an aggregate-and-null gotcha walkthrough.
The coding agent that built a complete dbt project but silently retrieved about 1,493 of 5,458 records: Robin Moffatt's 2026 experiment.
Higher trust in an automated tool correlating with accepting more of its wrong suggestions: the "Bias in the Loop" preprint (arXiv, forthcoming HDSR).
The split between prioritizing AI for writing code (about three-quarters) and for testing and observability (about a quarter): dbt Labs' 2026 State of Analytics Engineering report.
Accuracy through a semantic layer versus raw text-to-SQL, on covered queries: dbt Labs' Semantic Layer versus Text-to-SQL benchmark.
Major platforms moving toward governed semantic/metric layers for agent querying (vendor product announcements, positioning rather than independent evidence): Databricks Unity Catalog metric views (GA April 2026), Snowflake Cortex Analyst, and the dbt Semantic Layer.
The 17-percent-to-54-percent knowledge-graph result (vendor-authored, and uneven across question complexity): data.world's knowledge-graph benchmark.

Data warehouse testing is the operational companion to this article: the four categories of test that turn silent failures into failed builds. Slowly changing dimensions and change data capture cover the load mechanics where merge-key and grain mistakes do the most damage, and surrogate key management covers the key discipline those loads depend on. Referential integrity and the grain and semantic layer glossary entries define the terms this article leans on. When the deciding factor is who owns and verifies the result, data warehouse automation vs AI coding agents is the build-vs-buy decision this article feeds into.