How to Build Self-Healing Data Pipelines with Claude MCP and Autonomous Agents
A practical guide to building data pipelines that detect, diagnose, and fix issues autonomously using Claude MCP and tiered guardrails -- with a working open-source implementation.
How to Build Self-Healing Data Pipelines with Claude MCP and Autonomous Agents
The Problem: Data Pipelines Break Silently
Every data engineer has lived this scenario. A pipeline fails at 3 AM. The on-call engineer wakes up, investigates for an hour, and discovers it was a schema change in a source system. The fix takes five minutes. The investigation took sixty.
Now multiply that by dozens of pipelines, hundreds of tables, and a team that is already stretched thin. The traditional approach -- monitor, alert, wake someone up, fix manually -- does not scale. What if the pipeline could fix itself?
This is not a thought experiment. I built it. The full implementation is open-source at github.com/michael-eng-ai/agentic-data-pipeline-mcp.
Why MCP Changes Everything for Data Infrastructure
The Model Context Protocol (MCP) has rapidly become the default protocol for AI agent connectivity. Originally introduced by Anthropic, MCP standardizes how AI models interact with external tools, databases, and APIs. Pinterest recently demonstrated MCP at production scale across their entire data ecosystem, validating what many of us saw early: MCP is to AI agents what REST was to web services.
For data engineering, MCP unlocks something specific and powerful. Instead of building custom integrations between an LLM and every tool in your stack (Spark, dbt, Great Expectations, Airflow), you expose each tool as an MCP server. The agent connects to all of them through a single protocol. Add a new tool? Add a new MCP server. The agent adapts immediately.
This is the foundation of the agentic-data-pipeline-mcp project. Claude connects to your data infrastructure through MCP servers, gaining the ability to inspect schemas, run quality checks, execute fixes, and log every action -- all through a standardized interface.
Architecture: Three Layers of Self-Healing
The system is designed around three distinct layers, each with increasing levels of autonomy and risk.
+--------------------------------------------------+
| ORCHESTRATION LAYER |
| Airflow / Prefect / Cron |
| Triggers pipeline runs, monitors outcomes |
+--------------------------------------------------+
| |
v v
+---------------------+ +---------------------+
| DATA PIPELINE | | AGENT LAYER |
| Spark / dbt | | Claude + MCP |
| Extract, Transform| | Detect, Diagnose |
| Load | | Fix, Audit |
+---------------------+ +---------------------+
| |
v v
+--------------------------------------------------+
| GUARDRAIL ENGINE |
| Tiered permissions: AUTO / REVIEW / BLOCKED |
| Policy-as-code definitions |
+--------------------------------------------------+
|
v
+--------------------------------------------------+
| STRUCTURED AUDIT LOG |
| Every decision, every action, full traceability |
+--------------------------------------------------+
Layer 1: Detection
The agent monitors pipeline execution through MCP-connected observability tools. When a run fails or a quality check flags an anomaly, the agent receives the context: which table, which check, what the expected vs actual values were.
Detection covers three categories:
- Schema drift: A source table added, removed, or changed a column. The agent compares the current schema against the registered data contract and identifies the delta.
- Quality degradation: Null rates spike, value distributions shift, freshness SLAs are missed. Great Expectations and Soda checks feed results into the agent.
- Execution failures: Runtime errors, timeout issues, dependency failures. The agent reads the error logs and stack traces directly.
Layer 2: Diagnosis and Decision
This is where the tiered guardrail system becomes critical. Not every fix should be automated. The agent classifies each issue and maps it to one of three tiers:
AUTO tier -- The agent fixes it immediately, no human approval needed. Examples:
- A new nullable column appeared in a source table. The data contract is updated, the downstream model ignores the new column, pipeline resumes.
- A freshness SLA was missed because the source system delayed by 30 minutes. The agent retriggers the extraction with a backoff window.
- A quality check failed because a known upstream system had a temporary outage. The agent marks the run for retry.
REVIEW tier -- The agent proposes a fix but waits for human approval. Examples:
- A column was renamed in the source system. The agent generates the migration script and the updated data contract, but a human must approve before execution.
- A quality check shows a distribution shift that could be legitimate business change or could be a data issue. The agent flags it with context.
BLOCKED tier -- The agent cannot and should not attempt a fix. Examples:
- A table was dropped from the source system entirely.
- Data contains PII that was not present before, triggering compliance concerns.
- The fix would require modifying production schemas or permissions.
The tier assignment is not hardcoded per error type. It is policy-as-code: a YAML configuration that maps issue patterns to tiers, with override rules based on table criticality, data sensitivity, and business impact.
Layer 3: Execution and Audit
For AUTO tier fixes, the agent executes the remediation through MCP tools. For REVIEW tier, it creates a pull request or a Slack notification with the proposed fix. For BLOCKED tier, it escalates with full diagnostic context.
Every single action -- detection, diagnosis, decision, execution -- is written to a structured audit log. This is non-negotiable. The log captures:
{
"timestamp": "2026-04-06T03:14:22Z",
"pipeline": "crm_contacts_daily",
"issue_type": "schema_drift",
"issue_detail": "Column 'phone_secondary' added to source",
"tier": "AUTO",
"action": "update_data_contract",
"action_detail": "Added phone_secondary as nullable string, no downstream impact",
"agent_confidence": 0.94,
"human_approved": false,
"execution_status": "success",
"duration_ms": 1847
}
This audit trail is what makes the system production-ready. Every autonomous decision is traceable, reviewable, and auditable.
Schema Drift Detection in Practice
Schema drift is the most common silent failure in data pipelines. A source system changes a column type from integer to string, and suddenly your pipeline produces garbage without any error.
The agent handles this through a schema comparison loop:
- Before each extraction, the agent fetches the current source schema via MCP.
- It compares against the registered data contract (stored as versioned YAML).
- If there is a delta, it classifies the change: additive (new column), destructive (removed column), or mutation (type change).
- Additive changes on non-critical tables go to AUTO. Destructive changes always go to BLOCKED. Mutations go to REVIEW with a proposed casting strategy.
This pattern alone eliminated roughly 60% of our manual pipeline interventions in testing.
Quality Fix Agents: Beyond Simple Retries
The quality fix agents are specialized sub-agents, each trained on a specific category of data quality issue. When the main orchestrating agent identifies a quality failure, it delegates to the appropriate specialist:
- Freshness Agent: Handles SLA violations by checking source system status, adjusting extraction windows, and retriggering with appropriate backoff.
- Completeness Agent: Investigates null spikes by comparing against historical baselines and upstream system health.
- Consistency Agent: Detects referential integrity violations and proposes remediation (reprocessing upstream dependencies).
- Accuracy Agent: Flags statistical anomalies in value distributions and cross-references with business event calendars (promotions, holidays, system migrations).
This multi-agent approach mirrors how Pinterest structures their MCP ecosystem: specialized agents for specialized domains, coordinated by an orchestrator. The difference is that each agent operates within the same tiered guardrail system.
Connecting to the Broader MCP Ecosystem
The agentic-data-pipeline-mcp project is designed to be composable. Each MCP server is independent:
mcp-server-spark: Schema inspection, data sampling, job submissionmcp-server-quality: Great Expectations and Soda check executionmcp-server-contracts: Data contract CRUD operationsmcp-server-git: Branch creation, commit, PR submission for REVIEW tier fixesmcp-server-notifications: Slack, email, PagerDuty integration
You can use any subset. If you only want schema drift detection without autonomous fixing, use the contracts and spark servers only. If you want the full self-healing loop, connect all five.
This composability is exactly why MCP is winning as the standard for AI agent connectivity. The protocol handles the complexity of tool registration, authentication, and context passing. You focus on the domain logic.
What This Is Not
Let me be direct about the boundaries:
- This is not a replacement for proper data engineering. You still need well-designed pipelines, clean data contracts, and solid orchestration.
- This is not a magic fix for bad data culture. If your organization does not value data quality, no agent will save you.
- This is not fully autonomous. The tiered guardrail system exists precisely because some decisions require human judgment. The goal is to handle the 70% of incidents that are mechanical and repetitive, freeing engineers for the 30% that actually need expertise.
Getting Started
The full implementation, including the MCP servers, guardrail engine, audit logging, and example pipelines, is available at github.com/michael-eng-ai/agentic-data-pipeline-mcp.
The repository includes:
- Complete MCP server implementations for each integration
- Guardrail policy-as-code YAML templates
- Structured audit log schema and query examples
- Docker Compose setup for local development
- Integration tests for each agent type
Start with the schema drift detection agent. It is the highest-value, lowest-risk entry point. Once you see your pipeline automatically updating a data contract for a new nullable column at 3 AM instead of paging you, you will understand why this approach matters.
The future of data engineering is not writing more pipelines. It is building pipelines that maintain themselves. MCP and autonomous agents make that future practical today.