How Can Engineering Teams Cut LLM Costs Without Hurting Quality?
June 2, 2026
Worldwide spending on generative AI reached $644 billion in 2025, up 76.4% year over year (Gartner, 2025). Yet Gartner also predicted that at least 30% of GenAI projects would be abandoned after proof of concept by the end of 2025, with escalating cost a named driver (Gartner, 2024).
That gap is the whole problem. The per-token price of intelligence keeps falling, but bills keep climbing. This guide covers four levers engineering teams can pull today — right-sizing models, caching, batching, and trimming tokens — plus how to measure where the money actually goes.
Key Takeaways
- Inference prices drop roughly 10x per year for a fixed quality bar (a16z, 2024) — but rising usage outpaces it.
- Prompt caching cuts costs up to 90% and latency up to 85% on long, repeated prompts (Anthropic, 2025).
- Routing simple tasks to a small model is often 5x–16x cheaper than a frontier model.
- The Batch API gives a flat 50% discount on async work — the fastest win you can ship this week.
Why are token prices collapsing while your bill keeps climbing?
The cost of a fixed level of model quality falls about 10x every year — a trend Andreessen Horowitz calls "LLMflation" (a16z, 2024). Stanford measured the same collapse: querying a GPT-3.5-quality model fell from roughly $20 to $0.07 per million tokens between late 2022 and late 2024 (Stanford HAI, 2025).
So why isn't your invoice shrinking? Because cheap tokens invite more tokens. Teams add retrieval, longer context, multi-step agents, and retries. Each is reasonable on its own. Together they grow usage faster than prices fall.
| Date | Cheapest model at GPT-3 quality | Price ($/1M tokens) |
|---|---|---|
| Nov 2021 | GPT-3 (text-davinci) | $60.00 |
| ~2023 | interim models | ~$1.00 |
| Nov 2024 | Llama 3.2 3B | $0.06 |
Source: a16z, "LLMflation," 2024 — a ~1,000x drop in three years.
The lesson isn't "wait for prices to drop." It's that waste compounds. A wasteful pattern shipped today costs you every single day until you fix it. The four sections below are ordered by impact, so start at the top.
Worth knowing: the single biggest cost variable is usually architecture, not price-per-token. Two teams on identical pricing can see 10x different bills purely from how they structure prompts and pick models.
Which model should each request actually use?
Not every request needs your most expensive model. A small model can be 5x to 16x cheaper than a frontier one, and for classification, extraction, or routing, the quality difference is often invisible. Claude Haiku 4.5 lists at $1/$5 per million tokens versus Claude Opus 4.8 at $5/$25 (Anthropic, 2026). GPT-4o mini runs about 16x cheaper on input than GPT-4o (OpenAI).
| Model | Input ($/1M) | Output ($/1M) | Tier |
|---|---|---|---|
| Claude Opus 4.8 | $5.00 | $25.00 | Frontier |
| Claude Sonnet 4.6 | $3.00 | $15.00 | Mid |
| GPT-4o | $2.50 | $10.00 | Mid |
| Claude Haiku 4.5 | $1.00 | $5.00 | Small |
| GPT-4o mini | $0.15 | $0.60 | Small |
Source: Anthropic and OpenAI public pricing, 2026.
The pattern that works is a router: classify the task first, then send it to the cheapest model that can handle it. Reserve the frontier model for genuine reasoning and high-stakes output.
def choose_model(task):
if task.high_stakes or task.needs_reasoning:
return "claude-opus-4-8" # $5 / $25 per 1M
if task.simple_classification:
return "claude-haiku-4-5" # $1 / $5 per 1M
return "claude-sonnet-4-6" # $3 / $15 per 1M
A cheap small-model call can even do the routing — a fast classifier that decides whether the expensive model is warranted. The routing call costs a fraction of one frontier request.
Small models now match last year's frontier on many narrow tasks. Because per-token prices fall ~10x a year for a fixed quality bar (a16z, 2024), the model you "needed" six months ago is frequently overkill today. Re-test your tier choices each quarter — defaults rot fast.
Unique insight: don't pick a model per app. Pick one per request type. A single feature often mixes trivial and hard calls, and blanket-assigning the frontier model to all of them is where budgets quietly bleed.
How much can prompt caching and batching save?
These are the two highest-impact pricing features, and most teams under-use both. Prompt caching reduces costs up to 90% and latency up to 85% for long prompts (Anthropic, 2025) — we break that lever down in depth in our guide to prompt caching. The Batch API then takes a flat 50% off input and output for anything that can run asynchronously.
Caching works by storing a stable prompt prefix — your system instructions, few-shot examples, or a long document — so repeat requests skip reprocessing it. On Anthropic, a cache read costs just 0.1x the base input price; the one-time cache write costs 1.25x, so it pays for itself after a single hit (Anthropic, 2026). OpenAI applies caching automatically, with a 50% discount on cached input for prompts over 1,024 tokens (OpenAI, 2024).
| Pricing mode | Multiplier vs. base input | Net effect |
|---|---|---|
| Standard input | 1.0x | Full price |
| 5-min cache write | 1.25x | +25%, one time |
| 1-hr cache write | 2.0x | +100%, one time |
| Cache read (hit) | 0.1x | 90% cheaper |
| Batch API | 0.5x | 50% cheaper |
Source: Anthropic pricing docs, 2026.
Here's the catch most teams miss: caching only helps if the prefix is stable. Put your static system prompt and examples first; put the variable user input last. Reorder one message and you invalidate the cache.
messages = [{
"role": "user",
"content": [
{ # stable prefix — cached
"type": "text",
"text": LONG_SYSTEM_CONTEXT,
"cache_control": {"type": "ephemeral"},
},
{ # variable suffix — changes every call
"type": "text",
"text": user_question,
},
],
}]
Batching is the fastest win of all. If a workload doesn't need a real-time answer — nightly summarization, bulk classification, evals, backfills — send it to the Batch API and halve the bill with no code rewrite beyond the endpoint. Results return within 24 hours.
Worth knowing: caching and batching stack. A cached, batched job can land near a tenth of the naive cost — but only when your prompts are structured to be cacheable in the first place.
How do you stop sending and generating wasteful tokens?
Every token in and out is metered, and output usually costs 4x to 5x more than input (compare $5 input to $25 output on Opus 4.8). So the cheapest token is the one you never send. Trimming prompts and capping output is unglamorous, but it compounds across millions of calls.
Start with output. Set a sensible max_tokens so a model can't ramble into a 2,000-token answer when 200 will do. Ask for structured output — JSON or a tight schema — instead of prose. Structured responses are shorter, cheaper, and easier to parse.
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=256, # cap the expensive side
system="Reply with JSON only: {label, confidence}.",
messages=[{"role": "user", "content": text}],
)
Then attack input. Don't paste an entire knowledge base into context — retrieve the few relevant chunks (RAG) and send only those. Trim conversation history to a rolling window instead of replaying the full thread on every turn. Drop redundant boilerplate from system prompts.
Token discipline is a measurable engineering practice, not a vague aspiration. Capping output with max_tokens, requesting structured responses, and retrieving context instead of stuffing it routinely cut total tokens per request by half — and on output tokens priced at 4–5x input, that's where the real money sits.
From practice: the worst offender is usually unbounded conversation history. A chat feature that replays every prior turn grows linearly in cost per message. A rolling window or periodic summary fixes it in an afternoon.
How do you know where the money actually goes?
You can't cut what you can't see. Teams that treat LLM spend like cloud FinOps — tagged, attributed, and dashboarded — find waste that flat invoices hide. This matters because cost overruns are common: Gartner found a large share of GenAI projects exceed their budgets, and half of organizations using AI report at least one negative consequence (McKinsey, 2025).
Log tokens per request and tag each call by feature, model, and prompt version. That lets you answer the questions that drive savings: Which feature is most expensive? Which prompt version got pricier after the last edit? Where would caching pay off most?
A worked example shows why attribution matters. Classifying 10,000 support tickets at ~3,700 tokens each on Haiku 4.5 illustrates how the levers stack:
| Configuration | Approx. cost per 10k tickets |
|---|---|
| Standard input + output | ~$37 |
| With Batch API (0.5x) | ~$18–19 |
| Batched + heavy cache reads | materially lower again |
Illustrative math from Anthropic list pricing, 2026.
Without per-feature logging, that $37 is invisible inside a five-figure invoice. With it, the batching and caching wins above become obvious targets. Make spend a first-class metric in your dashboards, alongside latency and error rate.
Tie cost to prompt version specifically. A one-line prompt change can quietly double output length — and your bill — with no error to alert you. Version your prompts and watch cost per version, not just cost per feature.
PromptVault makes this lever practical. Ship prompt changes without a redeploy, A/B test versions for both quality and cost, and track which version is the cheapest in production. Start with PromptVault to put your prompts — and their spend — under version control.
Frequently Asked Questions
Does prompt caching work for short prompts?
Mostly no. OpenAI only caches prompts over 1,024 tokens and applies it automatically (OpenAI, 2024). Anthropic caching pays off after one hit because reads cost 0.1x base input (Anthropic, 2026), but short, one-off prompts see little benefit. Caching shines on long, repeated prefixes.
What's the cheapest model overall?
The cheapest model is the smallest one that still passes your evals for that task. Small models like GPT-4o mini ($0.15/$0.60 per 1M) or Claude Haiku 4.5 ($1/$5) cost a fraction of frontier models (Anthropic, 2026). Test per task type — "cheapest" depends entirely on the job.
Does using the Batch API hurt output quality?
No. The Batch API runs the same models at a 50% discount; the only trade-off is latency, with results returned within 24 hours (Anthropic, 2026). Quality is identical. Use it for any workload that doesn't need a real-time response, like evals or bulk processing.
How much can a team realistically save?
It varies, but the levers stack. Caching alone can cut long-prompt costs up to 90% (Anthropic, 2025), batching adds 50%, and right-sizing models adds 5x–16x on eligible calls. Combined, teams routinely cut blended costs by more than half without touching output quality.
Conclusion
Cheaper tokens won't save you — discipline will. Per-token prices fall about 10x a year (a16z, 2024), yet costs balloon when usage runs unchecked. The fix is four stacked levers: route each request to the right-sized model, cache stable prefixes, batch anything async, and trim the tokens you send and generate.
Start by measuring. Tag spend per feature and per prompt version, find your most expensive call, and apply the levers in order of impact. The fastest win — switching async work to the Batch API for a flat 50% off — you can ship this week.
Once the levers are in place, the durable habit is treating every prompt edit as a tracked change — see our guide to prompt management for teams.