Engineering· 8 min read

How We Deploy Claude in Production Environments

A practical guide to taking Claude from API playground to production system — covering prompt management, error handling, observability, and the operational concerns most tutorials ignore.

The Playground-to-Production Gap

Every enterprise AI project we've touched starts the same way: a developer spends an afternoon in the Anthropic console, gets impressive results, and walks into a meeting with a demo. The demo works. The stakeholders are convinced. Then the real work begins — and it's nothing like the demo.

The playground strips away every concern that makes production hard. There's no concurrent load, no malformed input, no rate limits, no cost pressure, no audit trail, and no one calling you at 2am because the system returned garbage. Getting Claude working in the playground is maybe 10% of the engineering effort. The other 90% is everything else.

What follows is the accumulated knowledge from deploying Claude-powered systems across financial services, logistics, and operations — the practices that actually make a difference between a demo and a system you can rely on.

Environment Management and Prompt Versioning

Treat prompts like code. That means version control, environment-specific configurations, and a deployment pipeline. We store all system prompts in the application repository under a prompts/ directory, structured by environment. Dev prompts can be loose and exploratory. Staging prompts mirror production but run against a test dataset. Production prompts are locked, reviewed, and tested before deployment.

Every prompt gets a version identifier — we use semver. When a prompt changes, it gets a new version, and that version is logged with every API call. This sounds like overhead until you need to debug why extraction quality dropped on Tuesday: you pull the logs, see that prompt v2.3.1 was deployed at 14:23, and immediately know where to look.

We also maintain a prompt test suite. Each prompt has a directory of input/expected-output pairs. Before any prompt change ships, the test suite runs automatically. It won't catch everything — LLM outputs aren't deterministic — but it catches regressions on known-good cases, which is most of what you need.

Error Handling That Actually Works

The Anthropic API returns errors in predictable categories and each requires a different response. Rate limit errors (429) should trigger exponential backoff with jitter — we start at 1 second and cap at 60, with full jitter to avoid thundering herd on retry. Timeout errors depend on context: if you're processing a document batch at 3am, retry immediately; if a user is waiting for a response, fail fast and show a graceful error.

Content filtering is the one that catches teams off guard. Claude will occasionally decline to process a request that is completely legitimate — a financial disclosure document that triggers safety filters, a legal contract with terminology that looks dangerous out of context. Build explicit handling for these cases. Log the refusal, alert your team, and have a fallback path (manual review queue, alternative prompt framing, or a different model). Never let a content filter silently drop a business-critical request.

Malformed outputs are the third category. Claude is usually very good, but "usually" is not a production-grade reliability guarantee. When you ask for JSON and get a preamble paragraph followed by JSON, your parser breaks. When you ask for a specific schema and get a near-match, downstream systems break. Build output normalization and validation into every pipeline. We'll say more about this below.

Observability: What to Actually Log

Most teams log too little early on, then scramble to add observability after something breaks. These are the fields we log on every API call, without exception:

→Prompt version and model version — so you can correlate quality changes to deployments
→Input and output token counts — for cost tracking and detecting prompt expansion over time
→End-to-end latency and time-to-first-token — separately, because they tell different stories
→Output validation result — did the output pass schema validation, business rule checks
→A truncated hash of the input — for deduplication and cache hit analysis, without storing PII
→Retry count and error type if any — to surface systemic API issues

We also run a weekly quality sampling process: a random 2% of outputs go through a lightweight automated scoring pass that checks for known quality indicators specific to the use case. Not perfect, but it catches gradual degradation before it becomes a business problem.

Cost Management in Practice

API costs at scale are real. A system processing 100,000 documents per day with an average of 2,000 input tokens and 500 output tokens per document, running on Claude Sonnet, costs roughly $2,000–$4,000 per day at standard rates. That's manageable, but it can surprise finance teams who approved a project without a real cost model.

The most effective cost lever is model routing. Not every task needs Sonnet. We maintain a classification layer that routes requests by complexity: simple extraction tasks that are well-defined and low-stakes go to Haiku, medium-complexity analysis goes to Sonnet, and anything that requires careful reasoning or has high business impact gets Sonnet with extended thinking or Opus. Getting this right typically reduces costs by 40–60% with no perceptible quality degradation on the Haiku-routed tasks.

Prompt caching is the second lever. Anthropic's prompt caching feature can dramatically reduce costs for workloads where the system prompt is large and repeated. We've seen 60–80% cost reductions on document processing workloads with large context windows by caching the system prompt and static document context. The implementation overhead is minimal — it's largely a configuration change.

Output Validation: The Non-Negotiable

Never trust raw model output in a production workflow. This is not a criticism of Claude — it's an acknowledgment of reality. Language models are probabilistic systems. They will, occasionally, produce output that is well-formed, confident, and wrong. In a document extraction pipeline, that means a field gets wrong data. In an agentic system, it means a wrong action gets executed.

Every output should pass through at least three validation layers before it touches anything consequential. First, structural validation: does the output match the expected format? JSON schema validation, regex checks on key fields, type checking. Second, business rule validation: does the extracted data make sense given domain knowledge? A date of birth in 1850 is a red flag. A price field with letters in it is a red flag. Third, consistency checks: if you extracted multiple related fields, do they agree with each other?

Fallback Pattern

When validation fails, don't crash — route to a fallback. For critical paths, the fallback is a human review queue. For non-critical paths, it might be a retry with a more explicit prompt, or a default value with a flag for later review. What it should never be is silent failure or blindly passing invalid data downstream.

We've adopted a pattern we call "extract with confidence" — the model is asked to return each field alongside a confidence score (high/medium/low) and a reason if confidence is not high. High-confidence extractions are auto-processed. Medium confidence gets a lightweight automated cross-check. Low confidence always routes to human review. The false negative rate on this pattern is low enough to make the economics work well even with human review costs factored in.

Practical Tips from Real Deployments

→Set explicit max_tokens on every call. Open-ended generation is an attack surface for runaway costs and a source of slow response times.
→Use structured output (JSON mode or tool use with schemas) wherever possible. Free-text output that you then parse is fragile by design.
→Build a prompt canary system: a small set of canonical test inputs that run against every new model version before it routes production traffic.
→Separate your prompt engineering work from your application code. Prompts that live deep in application logic are impossible to test and painful to change.
→Plan for Anthropic releasing new model versions. Have a tested upgrade path and don't pin to deprecated model versions until the last possible moment.
→Document your failure modes before launch. Know what the system does when the API is down, when outputs fail validation, and when costs spike unexpectedly.

Production AI is just software engineering with a probabilistic component. The same principles apply: observability, defensive coding, graceful degradation, and the discipline to build things properly even when the demo already works. Teams that treat prompt engineering as the whole job end up with systems that work 90% of the time. The remaining 10% is where production systems are actually built.

Engineering

Building AI Agents That Don't Break

Lessons from 50+ deployments on failure modes and how to prevent them.

Engineering

Multi-Agent Systems: Architecture Patterns That Work

The patterns that consistently work in production multi-agent deployments.

Want to talk through your project?

We're always happy to discuss real problems. No sales pitch.

Book a Discovery Call