Engineering· 8 min read

Building AI Agents That Don't Break: Lessons from 50+ Deployments

Production AI agents break in ways that demos never reveal. After 50+ deployments, here are the failure modes we've seen and the engineering practices that prevent them.

The Demo-to-Production Gap

AI agent demos are exceptional at hiding the things that kill production systems. A demo uses clean, curated inputs. It runs in a controlled environment. It shows the happy path. The developer is watching and can intervene if anything looks off. It runs once, successfully, and then gets screenshotted.

Production is the opposite. Inputs are messy, inconsistent, and occasionally adversarial. The environment has dependencies that go down. Edge cases appear at volume. No one is watching individual executions. The system runs continuously and must handle the 100th percentile of cases, not just the 80th.

After building and deploying agents across logistics, financial services, operations, and customer service, we've catalogued the ways agents break in production. The failure modes are consistent enough that we can now anticipate most of them in design reviews before a line of production code is written.

Failure Mode 1: Context Window Creep

Agentic conversations accumulate context. Tool call results, intermediate outputs, error messages, retry attempts — they all go into the context window. An agent session that starts at 2,000 tokens can easily reach 50,000 tokens over the course of a complex workflow. At a certain point, the model starts losing coherence on earlier parts of the conversation. At context limit, the agent crashes entirely.

The solution is progressive summarization. At defined checkpoints (every N tool calls, or when context reaches a threshold), summarize completed workflow stages and replace them with compact summaries. The summary preserves the essential state — what has been done, what was decided, what constraints are in effect — without the full verbatim history.

Implement this from the start. Retrofitting context management into a running production agent is painful. The summarization logic, the checkpoint triggers, and the context reconstruction after summarization all need to be part of the initial design.

Failure Mode 2: Cascading Errors

An agent extracts a date from a document. The extraction is subtly wrong — it reads "2024-03-01" as March 1 when the document meant January 3 (different date format conventions). That wrong date is used in the next step to calculate a deadline, which produces a wrong deadline, which is used in a customer communication, which causes a compliance issue.

Each individual step looked reasonable in isolation. The agent was doing what it was told. But an early error propagated silently through the workflow and materialized as a consequence several steps removed from the source.

Prevention: validate outputs at every stage boundary, not just at the end. Treat each workflow stage as a mini-system with its own input/output contracts. Before a stage's output becomes the next stage's input, run the validations. Explicitly check values that look like they could be plausibly wrong in a systematic way: dates, numbers, identifiers, references to external entities.

Failure Mode 3: Prompt Drift

Your prompt works on 95% of production inputs. The remaining 5% breaks it in ways you didn't anticipate. This isn't a bug in the model — it's a mismatch between the prompt's implicit assumptions and the real distribution of production inputs.

Common prompt drift causes: documents in unexpected languages or character encodings, inputs with formatting that the prompt wasn't designed for, edge cases in domain-specific terminology, inputs that are deliberately or accidentally adversarial, inputs that combine multiple cases the prompt handles individually but not in combination.

The fix is an edge case library. Every time a prompt fails on a production input, that input goes into a labeled test set. Before any prompt change ships, it runs against the edge case library. Over time, the library becomes a comprehensive regression suite that encodes institutional knowledge about where your specific prompts are fragile. This is one of the most valuable artifacts a team can build.

Failure Mode 4: External Dependency Failures

Agents call external systems: databases, APIs, file systems, internal services. Those systems go down. They time out. They return unexpected data formats. They return valid data that is stale or inconsistent. An agent that treats external dependencies as reliable is an agent that will fail unpredictably in production.

Circuit breakers for external calls are non-negotiable. If a dependency fails three times in a row, stop hitting it and fail fast with a clear error rather than continuing to retry and accumulate timeout latency. Each external call should have an explicit timeout — never wait indefinitely for a response.

The dependency failure policy

For each external dependency in an agent workflow, define explicitly: what happens when this is unavailable? Can the workflow complete without it in degraded form? Can it checkpoint and resume when the dependency recovers? Does a failure in this dependency warrant human escalation, or should the agent retry silently? These decisions made in advance are decisions that don't become incidents.

Failure Mode 5: Over-Confident Wrong Answers

Language models do not know what they don't know in a reliable way. An agent can produce a confidently-stated, grammatically perfect, completely wrong answer. In a human workflow, a human reviewer might sense uncertainty in how something is stated and double-check. An automated workflow that trusts model confidence at face value will process the wrong answer straight through.

Building in explicit uncertainty quantification helps but doesn't fully solve this. We ask models to self-rate confidence and flag ambiguous cases, but we don't rely on self-rated confidence as the primary quality gate. Business rule validation, consistency checks, and statistical anomaly detection (is this output unusual compared to the distribution of outputs on similar inputs?) are more reliable signals than a model's own confidence assessment.

Engineering Practices That Prevent Breakage

→Checkpoint and resume: agents on long workflows should checkpoint state at each major stage. If the agent crashes, it should be able to resume from the last checkpoint rather than starting over. This requires explicitly serializable state — if your state contains objects that can't be serialized, you can't checkpoint.
→Graceful degradation: define what the agent does when it can't complete its task. Routing to a human review queue, returning a partial result with a flag, or returning a structured error — any of these is better than an unhandled exception or a silent failure.
→Adversarial testing: before launch, throw bad inputs at the agent deliberately. Empty documents, documents in wrong formats, documents with contradictory information, inputs designed to confuse the extraction logic. Surface the failure modes before your customers do.
→Production replay testing: replay real production inputs against new versions before deploying. Catch regressions on real data before they reach customers.

Monitoring: Alert on Quality, Not Just Errors

Error rate monitoring catches hard failures. Quality monitoring catches the degradation that happens before hard failures — the gradual drift in extraction accuracy, the creeping increase in human review rates, the slowly growing tail of edge cases that aren't being handled well.

Alert when validation failure rates exceed baseline by more than 2 standard deviations. Alert when human review escalation rates increase week-over-week. Alert when average confidence scores drop. These signals arrive before the system fails visibly, which is when you want to investigate — not after. Claude's Constitutional AI training does reduce certain categories of failure (harmful outputs, certain hallucination patterns) but it doesn't substitute for application-layer quality monitoring on domain-specific tasks.

Engineering

Multi-Agent Systems: Architecture Patterns That Work

The orchestrator-worker pattern and other architectures proven in production.

Engineering

How We Deploy Claude in Production Environments

Prompt management, error handling, observability, and cost control at scale.

Want to talk through your project?

We're always happy to discuss real problems. No sales pitch.

Book a Discovery Call