The most common question we get before starting an AI project: "Which model should we use?"
It's a reasonable question. But it's usually the wrong question to ask first.
The model isn't the product
Here's the thing about foundation models: they're commodities. Not identical — Claude is better at some things than GPT-4, which is better at some things than Gemini. But the performance differences between top-tier models on most business tasks are smaller than the performance differences between good and bad implementations of the same model.
Put another way: a mediocre implementation of Claude will underperform a good implementation of GPT-3.5 on most real-world tasks.
The model is an input to your system, not the system itself. What matters is what you build around it.
What actually determines production quality
Prompt engineering and context design.
The quality of what you send to the model matters enormously. Clear task framing, relevant context, few-shot examples, explicit output format requirements — these have more impact on output quality than model choice.
Retrieval quality.
If you're building a RAG system, your retrieval layer determines what context the model has to work with. Bad retrieval means the model is answering without relevant information, regardless of how capable it is. Chunking strategy, embedding model choice, reranking — these are high-leverage decisions that teams underinvest in.
Output validation.
Every model produces garbage sometimes. The question is whether your system catches it. Structured output validation, confidence scoring, human-review queues for edge cases — these are what separate production systems from demos.
Latency and cost architecture.
You don't need GPT-4 for every query. A classification step at the front of your pipeline can route simple queries to smaller, cheaper, faster models. This isn't theoretical — we've seen systems cut per-interaction costs by 60% with tiered model routing, with no measurable quality degradation.
Failure handling.
What happens when the model API returns a 429? What happens when the model returns a response that fails your output schema? What happens when the model is confidently wrong? Systems that handle these cases gracefully are production systems. Systems that don't are demos.
When model choice does matter
There are cases where model selection is genuinely consequential:
A better way to think about it
Instead of leading with "which model?", lead with "what does success look like?" Define your quality threshold, your latency requirement, your cost ceiling, and your compliance constraints. Those parameters narrow the model options significantly, usually to two or three viable choices.
Then evaluate those options empirically — not on benchmarks, but on your specific task with your specific data. Model choice is a decision that comes after you've defined the problem clearly. It's an engineering decision, not a product decision. Treat it like one.
We help enterprise teams navigate these decisions as part of our AI strategy practice — and then build the implementation around them.
Learn about AI Strategy