Prompt Caching: How to Cut LLM Token Costs Without Touching Quality
June 2, 2026
Here's a bill most teams never read closely: a large share of what you pay an LLM provider is for text you already sent. The system prompt, the few-shot examples, the retrieved context — the same tokens, re-billed on every single call. Usage grows, the prompt prefix stays identical, and the meter keeps running on words the model has effectively seen a thousand times.
Enterprise LLM spend more than doubled in six months, from $3.5B in late 2024 to $8.4B by mid-2025 (Menlo Ventures, 2025). At that pace, the repeated portion of your prompts stops being a rounding error. This guide covers prompt caching — what the providers actually discount, where the savings come from, and how to structure prompts so you capture them.
Key Takeaways
- Anthropic, OpenAI, and Google now discount cached input tokens by 50–90%, with latency drops up to 85% on long prompts (Anthropic, 2025).
- The savings come from the static prefix — system prompt, examples, and RAG context — that repeats on every call. Caching bills that part once, then near-free on reads.
- Cache hits depend on a stable, unchanging prefix, which is exactly what disciplined prompt management already produces.
Why are LLM costs climbing faster than teams expect?
Because adoption and per-call payload are rising together. Worldwide generative AI spending is forecast to hit $644B in 2025, up 76.4% year over year (Gartner, 2025). Inside that number, enterprise investment in AI tripled to $37B, with foundation-model APIs alone capturing $12.5B (Menlo Ventures, 2025).
The usage curve is just as steep. Stack Overflow's 2025 survey found 84% of developers now use or plan to use AI tools, and 51% of professional developers use them daily (Stack Overflow, 2025). Every one of those calls carries a prompt, and modern prompts are heavy — a system message, a handful of examples, maybe thousands of tokens of retrieved context. Multiply a fat prefix by daily traffic and the cost isn't the model thinking. It's the model re-reading.
That's the part caching attacks. Not the work the model does once, but the setup it re-ingests every time.
What is prompt caching, and what does it actually save?
Prompt caching stores the processed form of a prompt prefix on the provider's side, so repeated calls reuse it instead of paying full price to process it again. The discounts are steep and now standard across the major providers. Anthropic reports up to 90% cost reduction and up to 85% latency reduction on long cached prompts (Anthropic, 2025).
The mechanics differ by provider, and the differences matter. Anthropic charges cache reads at 0.10x the base input rate — a 90% discount — while a cache write costs 1.25x (5-minute TTL) or 2.0x (1-hour TTL), with a 1,024-token minimum and up to four cache breakpoints (Anthropic, 2025). OpenAI applies a 50% discount on cached input automatically, with no code change and no extra fee, triggering on prompts of 1,024 tokens or more (OpenAI, 2024). Google's Gemini bills cached tokens at 10% of the standard input rate on Gemini 2.5 and newer (Google, 2025).
So the floor is a 50% discount and the ceiling is 90%, depending on which engine you run. Either way, you pay full freight once and a fraction thereafter.
Where do the savings actually come from?
From the part of your prompt that never changes. A production prompt is usually a long static prefix followed by a short dynamic suffix. The prefix — system instructions, few-shot examples, tool definitions, retrieved documents — is identical call after call. The suffix — the user's actual question — is the only thing that varies. In RAG applications, that static context often runs 5,000 to 30,000 tokens per request and doesn't change between turns (Groq, 2025).
Put real numbers on it. Say you run a 100,000-token prompt — a long document plus instructions — many times an hour on a premium model. Uncached, every call bills all 100,000 tokens at full input rate. With Anthropic's pricing, the first call writes the cache at 1.25x, then every subsequent read of that prefix costs 0.10x. A prefix that would bill at, say, $500 per million tokens uncached drops to roughly $50 per million on the cached reads. The expensive call is the first one. The rest are nearly free.
That's the whole game: isolate the repeated tokens and pay for them once.
How do you structure a prompt to maximize cache hits?
Put the stable content first and the volatile content last. Caching keys on the prefix, so the longest unchanging run at the start of your prompt is what gets reused. Reorder a prompt so the system message, examples, and context come first, and the user's variable input comes last — that single change is often the difference between a cache hit and a full re-bill.
The mistake we see most often is invisible churn. A timestamp in the system prompt, a session ID interpolated near the top, a "today's date is…" line — any of these changes the prefix on every call, and a changed prefix is a cache miss. The fix is boring and effective: freeze the prefix. Move per-request values out of the cached region and into the dynamic suffix. Keep the system prompt byte-for-byte identical across calls, and set your cache breakpoint at the boundary between stable and variable.
A quick checklist:
- Order static-first. Instructions, examples, and context before user input.
- Kill prefix churn. No timestamps, request IDs, or random ordering in the cached region.
- Set breakpoints deliberately. Mark the end of the stable block so the provider caches exactly that.
- Mind the TTL. Anthropic's default cache lives five minutes; steady traffic keeps it warm, sporadic traffic lets it expire.
How does caching fit into prompt management?
Caching rewards exactly the discipline good prompt management already enforces: stable, versioned, model-pinned prompts. A cache hit requires a prefix that doesn't change between calls — and a prompt that's edited carelessly, hardcoded in three places, or silently different per environment will never hold a cache. The same volatility that makes a prompt hard to debug also makes it impossible to cache.
This is where versioning and caching reinforce each other. When a prompt is a named, versioned artifact, its prefix is deterministic — you know precisely what bytes ship, so the cache stays warm. When you do change it, you change it on purpose, accept the one-time cache-write cost, and let the new version warm up. For the full picture on treating prompts as production artifacts, see our guide to prompt versioning best practices. Caching is the cost payoff of that same discipline.
Pin each prompt to its model, too. Cache behavior, token minimums, and discount rates vary by provider, so a prompt tuned and cached for one engine won't carry its savings to another untouched.
Frequently Asked Questions
Is prompt caching automatic, or do I have to enable it?
It depends on the provider. OpenAI applies caching automatically on prompts of 1,024 tokens or more, with no code changes and a 50% discount on cached input (OpenAI, 2024). Anthropic requires you to mark cache breakpoints explicitly in the request. Either way, structuring prompts static-first is what makes the caching effective.
Does caching change the model's output?
No. Caching stores the processed prefix to skip recomputation — it doesn't alter what the model generates. You get the same output you'd get uncached, faster and cheaper. Anthropic measured time-to-first-token on a 100K-token prompt dropping from 11.5s to 2.4s, a 79% reduction, with no change to the response (Anthropic, 2025).
How big does a prompt need to be for caching to help?
Both Anthropic and OpenAI set a 1,024-token minimum before caching kicks in (Anthropic, 2025). Below that, there's nothing worth caching. The bigger and more repeated your static prefix, the larger the payoff — long system prompts and RAG context are where it shines.
Does versioning my prompts hurt cache hit rates?
The opposite. A versioned prompt has a deterministic, stable prefix — exactly what caching needs to score a hit. Uncontrolled edits bust the cache; deliberate version bumps cost one cache write and then warm up. Discipline helps both reproducibility and cost.
The takeaway
Most LLM bills are inflated by repetition — the same prefix, re-billed on every call. Prompt caching fixes that directly, and the providers have made it cheap: a 50% discount at the floor, 90% at the ceiling, with latency cut by up to 85% on long prompts. The technique is simple. Put your stable content first, stop churning the prefix, and let the cache do the rest. Caching is one lever among several — for the full set, see how to cut LLM costs without hurting quality.
The catch is that caching only pays off when your prompts are stable enough to cache — which means treating them as versioned, model-pinned artifacts rather than strings you edit in a hurry. PromptVault was built for exactly that: managing, versioning, and deploying prompts so they stay stable, cacheable, and cheap to run. See how it works.