Engineering· 10 min read

AI-Powered Document Processing: A Technical Deep Dive

A technical walkthrough of how we built a Claude-powered document processing pipeline that handles 10,000+ documents daily — architecture, prompt design, validation, and the lessons learned along the way.

Why Document Processing Is an Ideal AI Use Case

Document processing sits at an intersection that makes it particularly well-suited for AI: high volume, repetitive structure, reasoning requirement, and significant human cost. A financial services firm processing thousands of loan applications, insurance claims, or compliance disclosures daily has a workforce dedicated to reading, extracting, and interpreting information from documents — work that is simultaneously too varied for simple rule-based automation and too repetitive to be an effective use of human expertise.

Before large language models, document processing automation required extensive rules engineering, named entity recognition models trained on domain-specific corpora, and significant manual handling of edge cases. The economics rarely justified the investment except at very large scale. LLMs change the calculus substantially: they can understand context, handle format variation, reason about ambiguous content, and operate across document types without retraining. The challenge has shifted from "can the technology do this" to "how do we build a reliable production system around the technology."

The Naive Approach and Why It Fails

The naive approach is to take a document, pass it directly to the API with a prompt like "extract these fields," and use the output. This works in demos. In production at scale, it fails in predictable ways.

First, it ignores document preprocessing. PDFs are not text — they're visual layout objects. A PDF passed directly to a text extraction layer loses table structure, column relationships, and header/footer context. Scanned PDFs contain no text at all. Before a document reaches the LLM, it needs to go through a preprocessing pipeline that converts it to clean, structured text that the model can reason about.

Second, it provides no validation. Raw model output goes directly to downstream systems. This is an engineering error. Field extractions that fail validation, values that don't match expected formats, cross-field inconsistencies — all of these need to be caught before they touch anything consequential. The naive approach has none of this.

Document Preprocessing: The Foundation

Our preprocessing pipeline has three stages. The first is format detection and routing: is this a native PDF, a scanned PDF, a Word document, an image? Each type requires different handling. Native PDFs are parsed with a library like pdfplumber or pymupdf that preserves spatial relationships. Scanned PDFs and images go through OCR — we use AWS Textract for documents where table structure matters, and Tesseract for simpler cases where cost sensitivity is higher.

The second stage is normalization: converting whatever came out of stage one into a consistent text representation that the model can work with. We normalize whitespace, handle encoding issues, remove headers and footers that would otherwise consume context window space on every page, and restructure tables into a format the model handles well (markdown tables, not space-aligned ASCII).

The third stage is chunking for very long documents. Documents over about 100,000 characters get split at logical boundaries (section headers, page breaks) into overlapping chunks. The overlap ensures that information near chunk boundaries is fully captured. Each chunk is processed independently and results are merged. For Claude's 200K context window, this is only necessary for very long documents — most documents process as a single context.

The Two-Stage Pipeline: Classify, Then Extract

A key architectural decision: run a fast classification stage before deep extraction. The classification stage uses a lightweight prompt and Haiku to determine the document type, version, and any characteristics relevant to extraction (is this a single-borrower or joint application? is it a standard form or a non-standard document?). Classification is fast and cheap — typically under 200 output tokens.

The extraction stage then receives both the document content and the classification result. This allows the extraction prompt to be specialized for the document type rather than trying to handle all document types with a single universal prompt. Specialized prompts are more reliable, more accurate, and can include type-specific validation rules. A loan application extraction prompt knows the expected fields for a standard 1003 form and can flag when a field that should always be present is missing.

Cost comparison

Classification at 200 tokens per document costs roughly 10x less than extraction at 2,000 tokens. Routing documents to specialized extraction prompts via the classification result reduces extraction failures by avoiding a one-size-fits-all prompt. The two-stage approach costs slightly more than a single-stage approach but reduces human review requirements by enough to more than compensate.

Prompt Design for Extraction

Extraction prompts need to be precise about what they want. We structure extraction prompts with four components: role and context (what the model is and why it's doing this task), field definitions (each field with its name, data type, expected format, and any extraction rules), output schema (exact JSON schema the response must conform to), and handling instructions (what to do when a field is absent, ambiguous, or has multiple plausible values).

For each field, the confidence output is important. We ask the model to return, alongside the extracted value, a confidence level (high/medium/low) and a source quote from the document that supports the extracted value. The source quote serves double duty: it helps human reviewers verify extractions quickly, and it catches cases where the model is confabulating rather than extracting — if the claimed source text doesn't actually appear in the document, something has gone wrong.

We use Claude's tool use feature with a strict JSON schema for extraction output. This provides stronger output format guarantees than asking for JSON in the message body. The schema validates at the model layer before we even see the response, catching formatting errors early.

Output Validation and Confidence-Based Routing

After extraction, every document passes through a three-layer validation pipeline. Layer one is structural: does the JSON conform to the schema, are required fields present, are data types correct. Layer two is business rules: dates within expected ranges, values matching enumerated sets, relationships between fields consistent (a property address that contradicts a stated location, for example). Layer three is cross-document: for batches, do outlier values across the batch suggest a systematic error?

The routing logic after validation uses both the validation results and the per-field confidence scores. Documents where all fields pass validation and all fields are high-confidence are auto-processed — they go straight to the downstream system. Documents where fields fail validation or have low-confidence extractions are routed to a human review queue, with the review interface pre-populated with the extracted values and highlighted fields that need attention.

In our financial services deployment, 78% of documents auto-process with no human touch. 16% require a quick human review (average 90 seconds per document — mostly confirming a few flagged fields). 6% require full human processing, typically because the document is a non-standard format or has data quality issues. This is compared to 100% human processing in the baseline, with an average 18 minutes per document.

The Human Review Interface

The design of the human review interface significantly affects throughput and accuracy of the human review stage. Our interface shows reviewers the document (original PDF viewer), the extracted fields with their confidence scores, and the source quote for each field. Fields that failed validation or have low confidence are highlighted. Reviewers click a field, see the source quote highlighted in the original document, and confirm or correct the extracted value.

Corrections feed back into the system in two ways: immediately, they update the processed record; over time, they accumulate as labeled training data that we use to improve extraction prompts and identify systematic failure patterns. A field that is corrected consistently tells us something specific about where the extraction logic is failing, which is directly actionable in prompt tuning.

Performance Numbers and What We'd Do Differently

From our financial services deployment at steady state: 10,400 documents per day, average processing time 23 seconds end-to-end (preprocessing + classification + extraction + validation), 99.2% uptime on the pipeline, 97.1% field-level extraction accuracy on auto-processed documents. API costs at $0.38 per document processed including all stages and retries. Total cost per processed document including human review amortized: $1.12, down from an estimated $8.40 fully-loaded cost in the manual baseline.

What we'd do differently: invest more in preprocessing earlier. We underestimated how much document quality variance would affect extraction accuracy, and spent the first six weeks post-launch improving the preprocessing pipeline rather than the extraction prompts. The extraction prompts were mostly right; the inputs they were receiving were not clean enough. Preprocessing quality is upstream of everything else and deserves proportionate attention in the design phase.

Engineering

How We Deploy Claude in Production Environments

Prompt management, error handling, observability, and cost control at scale.

Engineering

Building AI Agents That Don't Break

Failure modes and engineering practices from 50+ production deployments.

Want to talk through your project?

We're always happy to discuss real problems. No sales pitch.

Book a Discovery Call