Building a warehouse with coding agents

For two years the AI question in data engineering was whether a model could write a passable SQL query. That question is settled, and it was the wrong one. By 2026 a coding agent does not just complete a query; it explores a source schema, scaffolds a staging-to-mart project, writes the models and the tests and the documentation, opens a pull request, and, when a scheduled run fails at 3 a.m., reads the logs and proposes a fix. The unit of work is no longer the query. It is the project, and increasingly the operation of the project over time.

This pillar is about that practice, which the field has started to call agentic engineering, or agentic data engineering: a data or analytics engineer directing AI coding agents to build and run an analytical warehouse, in a version-controlled project, under human review. It is a different activity from the model-driven automation covered in the data warehouse automation pillar, where the intelligence is embedded in a platform and the model is the source of truth that the tool generates pipelines from. Here the actor is a general-purpose coding agent operating your repo and reaching your warehouse, and the governing question is not "should the tool generate the pipeline from the model" but "should I delegate building the pipeline to an agent, and how do I keep it correct." The honest answer, supported by the evidence below, is that agents have become a real force multiplier on the mechanical and connective parts of warehouse work and remain unreliable on the parts that carry business meaning, so the durable work has shifted from writing transformations to specifying, reviewing, and governing the ones an agent generates.

TL;DR. Agentic warehouse engineering spans the whole lifecycle: source discovery, modeling, transformation, testing, documentation, deployment, and operations. A bare coding agent cannot do any of it safely; what turns it into a warehouse engineer is the tool ecosystem wired around it, chiefly the Model Context Protocol, governed semantic layers, and agent-side patterns like skills, spec-driven development, and subagents. Benchmarks show the same split everywhere: agents reliably handle the extract-and-load front of a pipeline and fail most of the time on the transformation back, where correctness is a question of meaning, not syntax. The workflow that works in 2026 is agent-as-first-pass with a mandatory human review gate.

What "agentic" means here, and the boundary that defines it

The vocabulary is recent and worth getting right. "Vibe coding," the term Andrej Karpathy coined in February 2025, describes steering an LLM in natural language and accepting its output without reviewing it closely, fine for a throwaway prototype, dangerous for a warehouse. "Agentic engineering," the framing Karpathy introduced at Sequoia's AI Ascent in April 2026, is the opposite stance: coordinating capable but fallible agents to go faster while the human stays accountable for the result and the quality bar holds. He frames the two as complementary rather than a rename, vibe coding raising the floor on who can build software, agentic engineering preserving the ceiling. The distinction matters for a warehouse because the cost of being wrong is not a broken prototype, it is a metric that disagrees with itself across the business.

The defining empirical fact about agentic warehouse work in 2026 is a boundary, and it runs straight through the middle of the pipeline. ELT-Bench, a benchmark built from a hundred real pipelines, separates the extract-and-load stage from the transformation stage and scores them independently. In its 2026 verified run, autonomous extraction and loading into a warehouse reached about 96 percent; transformation correctness reached only about 23 percent, rising to roughly 33 percent after benchmark errors were corrected. Put plainly, an agent can reliably move data into the warehouse and reliably gets the modeling wrong about two times in three. That asymmetry is the through-line of everything that follows. The mechanical front of the pipeline, connectors, loads, boilerplate, is where agents earn their keep; the semantically loaded back, grain, surrogate keys, slowly changing dimensions, metric definitions, is where they fail, and they fail silently because warehouse code returns a plausible number rather than an error. The field guide to those silent failures covers the specific modes; this pillar covers the workflow and the tooling that surround them.

The toolchain that turns a coding agent into a warehouse engineer

A coding agent on its own is a confident text generator with no access to your warehouse. Point one at a real schema and it hallucinates table names, invents joins, and guesses at column meanings. What converts it into something that can do warehouse work is not a better model; it is the layer of tooling wired around it, and that layer is where most of the genuinely new engineering of the last year has happened.

The connective tissue is the Model Context Protocol. MCP is an open standard Anthropic introduced in November 2024, shipping a Postgres reference server on day one, which tells you that database access was a founding use case rather than an afterthought. It went cross-vendor through 2025 as OpenAI and Google adopted it, and in December 2025 it was donated to the Linux Foundation's Agentic AI Foundation, with more than ten thousand active public servers reported at that point. For warehouse work, the significant thing is that every major platform now ships an MCP server on the same architectural pattern: a vendor-hosted, authenticated server that runs queries in place inside the warehouse and returns governed, bounded result sets, rather than pulling raw data into the agent's context. Governance is enforced at the warehouse layer, Unity Catalog on Databricks, IAM on BigQuery, role restrictions on Snowflake, and output is capped. The effect is that an agent can introspect a live schema, run a query, and read lineage without ever exceeding the permissions of the user it connects as, and every action is auditable. That is the difference between an agent guessing at your schema and an agent reading it.

The richest example is the dbt MCP server, an open-source server for both dbt Core and Cloud. It exposes four distinct capability groups: governed metric retrieval through the semantic layer, so an agent asks for "revenue by region" against a defined metric rather than writing raw SQL against tables; discovery and metadata, including lineage, model health, and parent and child relationships; the dbt command-line operations, build, run, test, compile; and code generation for model YAML, sources, and staging models. It splits into a local server with full command and codegen access and a remote, consumption-only server for metric and metadata queries. This is the shape of the ecosystem in miniature: the agent is given scoped, governed access to what the project already knows about itself.

MCP supplies access; a second layer supplies knowledge. Agent skills, an open format introduced in late 2025 and adopted across more than thirty agents, are markdown files that teach an agent how to do a class of work, instructions plus sample code, disclosed progressively so they do not flood the context window. dbt's own Agent Skills, shipped in February 2026, encode the analytics-engineering loop directly: before changing a model the agent previews real data, before writing a test it inspects the values it is asserting on so it does not hallucinate them, after a change it runs summary statistics to check the shape of the output, and it applies warehouse-specific economies like avoiding full-table scans during exploration. The distinction the dbt team draws is the useful one: MCP is how you give an agent access to tools, skills are how you give it the knowledge to use them well, and the two are complementary rather than substitutes. Above both, the semantic layer is consolidating into a vendor-neutral standard, the Open Semantic Interchange, so that metric definitions an agent queries against can be portable across tools, though real cross-tool interoperability is still emerging.

The two tiers of agent, and how they combine

The agents themselves fall into two tiers that are built to be combined rather than chosen between.

The first tier is general coding agents like Cursor, GitHub Copilot, OpenAI's Codex, Gemini's command-line agent, and Claude Code. Their strength is the repository, multi-file edits, cross-language work, git and pull-request mechanics, and orchestrating a multi-step task. Their weakness is that they have no native knowledge of your warehouse and reach it only through an MCP bridge. Without that bridge they are back to hallucinating schemas. These agents also carry the patterns that make delegation tractable, which the next section covers.

The second tier inverts the profile: warehouse-native agents that live inside a platform. Snowflake's in-warehouse agent, renamed CoCo at its June 2026 conference, runs under the warehouse's existing access controls with model inference kept inside the platform's security perimeter, adding audit logging, query tagging, and cost controls. Databricks shipped Genie Code to general availability in March 2026, an agent that builds declarative pipelines, runs models, generates dashboards, and debugs production pipelines, all under Unity Catalog governance. Google's BigQuery Data Engineering Agent reached general availability in April 2026, generating pipeline code from natural-language prompts, though notably it cannot execute pipelines directly, a human runs them. dbt's own Developer agent and its assistant, both in preview through mid-2026, are warehouse-adjacent and metadata-grounded, reading lineage, model health, tests, and semantic definitions rather than operating raw tables, and they explicitly position themselves against general coding agents. Amazon's Q in Redshift is the lightest of these, effectively a generative-SQL copilot rather than an autonomous pipeline agent. These agents know your schema and run inside your governance; what they lack is the broad, cross-tool coding range of the first tier.

The combination is the point. The general agent keeps the reasoning, the file and git work, and the composition, and delegates live warehouse operations to the native agent over MCP. A vendor walkthrough in mid-2026 had a general coding agent drive a warehouse-native agent through an entire multi-step build from a single prompt, which is a useful illustration of the pattern even though, as a vendor demonstration, it is not a benchmark. Two cautions belong here. The clean "general agents are good at breadth, native agents are good at depth" division is well attested but its sharpest articulations are vendor-authored, so treat it as a working model rather than a measured law. And the benchmark scores vendors publish for their own agents are self-reported and run-dependent, not a reconciled leaderboard, so they are worth little as comparative evidence.

Design-time AI.

Deterministic runtime.

AI helps you build. Production runs deterministic SQL on your warehouse. No LLM calls at runtime.

See a demo

The workflow: agent as first pass, human as gate

Across the lifecycle the operating model that actually works in 2026 is consistent: the agent does the first pass, a human owns the review, and nothing reaches production without passing through a pull request.

FIGURE 1The agentic build lifecycle
Figure 1. The build runs as a loop. An agent does a first pass across authoring, but every change reaches production only through a human review gate, and operations feeds back into discovery as sources change.

The agent-side patterns that make this safe are worth naming, because they are the difference between agentic engineering and vibe coding. Spec-driven development, codified by open toolkits that have appeared over the last year, makes a written specification rather than the code the source of truth: the engineer and agent agree a spec and a plan, including a data-model artifact, before any SQL is generated, which is explicitly positioned as the antidote to vibe coding and maps cleanly onto warehouse schema design. Subagents, a delegation mechanism now supported across several coding agents, let the main agent spawn a separate agent instance with its own context window, its own system prompt, and a scoped set of tools and permissions, defined as version-controlled configuration, so a "staging-model" subagent or a "test-writer" subagent does the noisy work and returns only a summary. Parallelism extends this, with mechanisms that split one large change across many isolated agents that each open their own pull request. And the safety patterns are now first-class: open agent harnesses for data work ship scoped modes, a builder mode that writes only with approval and hard-blocks destructive statements like DROP and TRUNCATE, an analyst mode restricted to reads, a plan mode that touches no data at all, alongside loop detection and a plan-first refinement step.

Operations is where the loop closes, and it is the clearest example of the human-gated pattern. A documented production setup from early 2026 fires on a webhook when a scheduled run fails, pulls the failure logs, searches a team knowledge base, and runs a headless agent in a sandbox with read-only production data and a cloned development copy. The agent follows a fixed diagnose-and-fix loop, reading the error summary, tracing dependencies, investigating the data, applying and verifying a fix, classifying the failure as a code problem or a source problem, and then opening a pull request. Every fix routes to a human for approval; nothing auto-merges. The broader idea of pipelines that detect and repair their own failures is real and advancing, but presenting it as a settled 2026 baseline overstates it; it is an emerging pattern, and the part that is solid is precisely the part that keeps a human in the loop.

Where it earns its keep, and where it fails without announcing itself

The honest accounting is that agents are genuinely valuable across a wide span of the lifecycle and genuinely unreliable on a narrow, critical part of it.

Where they earn their keep is the mechanical and connective work. The extract-and-load front, as the benchmark shows, is largely solved. Scaffolding a project, generating staging models and boilerplate transformations, drafting tests and documentation, translating SQL across dialects during a migration, and iterating on a broken DAG until it builds are all tasks agents do well and fast. Navigating lineage and metadata to answer what feeds a given model, and drafting first-pass semantic-model and metric definitions, are first-class wins in their own right rather than just SQL authoring. Human-gated operational remediation, the failure-diagnosis loop above, is a real productivity gain. Migration is the most cited win, and there are partner accounts of agent-assisted dbt migrations completing far faster than by hand, though those figures come from vendor case studies of single engagements and differ between tellings, so they are better read as illustrative than as benchmarks.

Where they fall down is transformation correctness, and the failure is silent by nature. The same independent practitioner builds that demonstrate the capability also document the danger. One widely read 2026 experiment had a coding agent build a complete, runnable dbt project and still silently retrieve only part of a paginated source, drop columns, and apply change tracking to one of two entities that needed it, all with a green build; the author's verdict was that a data engineer with an agent beats a data engineer alone, but the agent alone cannot be trusted to one-shot a production pipeline, and that wrong is worse than absent. Another build used inner joins that silently dropped rows and rebuilt a dimension that already existed, and its author concluded that reviewing the agent's output took every bit as much effort as writing it would have. The recurring diagnosis is that agents lack the operational instincts a practitioner takes for granted; as one observability vendor put it, an agent will confidently query a table that has not been refreshed in two weeks. The detailed taxonomy of these failures, grain and fan-out, join drift after a schema change, merge-key duplication, nulls in aggregates, misread metric semantics, lives in the field guide on where agents get the warehouse wrong, and the testing disciplines that catch them are the same ones that have always caught silent data errors.

The consequence for the workflow is that the review burden does not disappear; it moves and, by several independent accounts, barely shrinks. Generation got cheaper. Verification did not. When one agent can open more correct-looking pull requests than a team can genuinely check, the bottleneck becomes the checking, and an organization that measures only how much code its agents produce is measuring the wrong half of the system.

What this does to the engineer's job, and the governance it demands

The role that emerges is engineer as reviewer and orchestrator. The agent does the first-pass scaffolding, implementation, testing, and documentation; the human owns the architecture, the correctness review, and the judgment calls the agent cannot make because the information it needs is not in the database. This is not a diminished role. Holding the mental model of the warehouse well enough to catch a silently wrong number is harder than writing the query, and it is where the expertise now concentrates.

The adoption data says the same thing from the outside. Independent community surveys in 2026, which vary by sample, put daily AI use among data practitioners above eighty percent, overwhelmingly for code generation rather than autonomous multi-step work, while organizational embedding lags far behind individual use, on the order of ten percent of organizations with AI genuinely embedded in their workflows and around five percent running agents live in production with real users. The dbt 2026 survey captures the gap precisely: about three-quarters of teams prioritize AI for writing code and only about a quarter prioritize it for the testing and observability that would tell them whether the code is right, even as more than seventy percent worry about hallucinated outputs reaching stakeholders. Generation is outpacing governance, and the teams that close that gap are the ones that treat the governance as the product.

What that governance looks like is a set of category-level controls, none of them new, all of them now load-bearing because of the volume of generated code. Ground the agent in a curated context layer, lineage, model health, and metric definitions reached through MCP and skills, rather than letting it guess against raw tables. Keep metric definitions in a governed semantic layer so neither a human nor an agent rederives "revenue" per query. Make tests guardrails rather than decoration: uniqueness on every key, the active-row invariant on every versioned dimension, reconciliation against the source, all enforced in the build. Scope the agent's permissions hard, read-only or approval-gated modes, destructive statements blocked, least-privilege warehouse access against the prompt-injection and exfiltration risks that come with wiring an agent into a data store. Put a cost ceiling on agent-issued queries, because an unbounded scan on consumption-priced compute is a financial incident waiting to happen. And keep the human pull-request gate, especially at the transformation and deployment stages where the silent-failure risk is highest.

The durable claim, the one that will still be true when this year's tool names have changed, is that agentic engineering moves the warehouse engineer's work from writing transformations to specifying them precisely, verifying them rigorously, and governing the system that generates them. The agent is a fast, fluent contributor that never says it is unsure. The substrate that makes its output survive production, a conformed, well-modeled, auditable warehouse with its definitions and tests owned by people, is the thing that was always the real work. Agents make that substrate cheaper to build and far more dangerous to skip.

The data warehouse automation pillar covers the other route to less hand-coding, where the intelligence is embedded in a model-driven platform rather than driven by a coding agent, and the boundary between the two is where the heaviest cross-linking belongs. The field guide on where coding agents quietly get the warehouse wrong is the correctness companion to this pillar, and the decision page data warehouse automation vs AI coding agents is the framework for choosing between this route and the model-driven one. Data warehouse testing covers the checks that turn silent failures into failed builds, the dimensional modeling pillar covers the model structures agents most often get wrong, and the warehouse loading and operations pillar covers the load mechanics underneath. The semantic layer and data contract glossary entries define two of the governance primitives this pillar leans on.

Sources

The figures and findings above trace to the following sources.

The extract-load versus transformation split (about 96 percent loading, about 23 to 33 percent transformation correct): ELT-Bench.
The vibe-coding versus agentic-engineering framing (Andrej Karpathy, Sequoia AI Ascent, April 2026): Karpathy's AI Ascent 2026 recap.
dbt Agent Skills and the explore-before-you-change loop, and the access-versus-knowledge distinction between MCP and skills: dbt Labs on Agent Skills.
Agent Skills as an open format adopted across more than thirty agents: the open Agent Skills standard.
MCP as an open standard, its database-first origin, cross-vendor adoption, the move to the Linux Foundation, and the 10,000-plus server figure: Anthropic on donating the Model Context Protocol.
The vendor MCP pattern (hosted, authenticated, in-warehouse execution, governance at the warehouse layer, bounded output): the BigQuery MCP server documentation.
The dbt MCP server's capability groups (governed metrics, discovery and lineage, CLI, codegen; local versus remote): the dbt MCP repository.
Spec-driven development as the source-of-truth alternative to vibe coding: GitHub Spec-Kit.
Subagents as scoped, separately-permissioned delegated instances: the Claude Code subagents documentation.
Scoped agent-safety modes (builder, analyst, plan), blocked destructive statements, and loop detection: Altimate's open-source agent harness.
Snowflake's in-warehouse agent (renamed CoCo, in-perimeter execution under RBAC, audit logging and cost controls): Snowflake on CoCo (vendor-authored).
Databricks Genie Code (general availability, declarative pipelines, Unity Catalog governance): Databricks on Genie Code (vendor-authored).
The BigQuery Data Engineering Agent (general availability, generates pipeline code, cannot execute directly): the BigQuery Data Engineering Agent documentation (vendor docs).
Daily AI use above 80 percent, code generation dominating, and organizational embedding near 10 percent: the 2026 State of Data Engineering Survey; the roughly 5 percent agents-in-production figure is from vendor-authored Cleanlab on agents in production (directional).
The 72-percent-coding-versus-24-percent-pipeline-management split and the hallucination concern: dbt Labs' 2026 State of Analytics Engineering report.
Confident silent error as the dominant failure mode, and review taking as much effort as writing: Robin Moffatt's 2026 experiment.
The human-gated operations agent (webhook to sandbox to diagnose-and-fix loop to approved pull request): Hiflylabs on an AI agent for pipeline operations.
The semantic layer converging on a vendor-neutral standard, a multi-vendor effort spanning Snowflake, Salesforce, dbt Labs, and Databricks: the Open Semantic Interchange initiative.

Building a warehouse with coding agents

What "agentic" means here, and the boundary that defines it

The toolchain that turns a coding agent into a warehouse engineer

The two tiers of agent, and how they combine

The workflow: agent as first pass, human as gate

Where it earns its keep, and where it fails without announcing itself

What this does to the engineer's job, and the governance it demands

Sources

Warehouse Fundamentals

Dimensional Modeling

Data Vault Modeling

Loading and Operations

Modern Warehouse Platforms

Warehouse Automation

Analytics Modeling

What "agentic" means here, and the boundary that defines it

The toolchain that turns a coding agent into a warehouse engineer

The two tiers of agent, and how they combine

The workflow: agent as first pass, human as gate

Where it earns its keep, and where it fails without announcing itself

What this does to the engineer's job, and the governance it demands

Related content

Sources

Warehouse Fundamentals

Dimensional Modeling

Data Vault Modeling

Loading and Operations

Modern Warehouse Platforms

Warehouse Automation

Analytics Modeling