Multi-model prompt portability: managing prompts across Claude, GPT, and Gemini
June 6, 2026
Here's a failure mode you only notice in production: the prompt you tuned for months against GPT gets routed to Claude one morning — cheaper, faster, a clean swap — and the JSON it returns no longer parses. Nothing in your prompt changed. The model did. And the contract around that prompt, the part you never thought of as "the prompt," changed with it. Most teams now run more than one model, so this isn't a rare edge case anymore. It's Tuesday.
Key takeaways
- 81% of enterprises now run three or more model families in testing or production, up from 68% a year earlier (a16z, 2025).
- A prompt almost never breaks on the words — it breaks on the harness contract: system-prompt handling, tool-calling format, and output mode all differ across Claude, GPT, and Gemini.
- Model churn makes portability non-optional: Anthropic gives only 60 days' notice before retiring a model, and several were deprecated in a single year.
- A gateway routes the call; it doesn't version, diff, or evaluate the prompt variant each model needs. That gap is where portability actually lives.
Why are teams running prompts across multiple models?
Because one model is no longer the bet anyone wants to make. 81% of enterprises now use three or more model families in testing or production, up from 68% less than a year earlier, and 37% run five or more in production (a16z, 2025). Multi-model is the default posture now, not a hedge a few advanced teams take.
Why the spread? Partly differentiation — Gemini Flash for cheap high-volume calls, Claude for careful reasoning, GPT for breadth. Partly survival. The market keeps reshuffling underneath you. Anthropic's share of enterprise LLM spend climbed to 40% in 2025, up from 24% the year before, while OpenAI slipped to 27% from a one-time high of 50% (Menlo Ventures, 2025). When the leaderboard moves that fast, betting your whole product on one vendor looks less like focus and more like exposure.
Developers feel this directly. In the latest survey, 81% reported using OpenAI's GPT models, 43% Claude Sonnet, and 35% Gemini Flash (Stack Overflow, 2025). Those numbers overlap heavily — most developers already touch more than one provider. The question stopped being whether your prompts run on multiple models and became how well they survive the trip.
Our take: vendor lock-in used to be the headline reason teams went multi-model. It's now table stakes. The real driver is matching each job to the model that does it best — which means your prompts have to travel.
For the deeper argument on why the prompt is the surface you should be able to move freely, see our piece on why prompts are config, not code.
What actually breaks when you move a prompt between models?
Rarely the wording. What breaks is the contract around the words — how each provider expects the system prompt, the tool calls, and the output to be shaped. Claude takes the system prompt as a top-level system request parameter, while OpenAI expects a role: "system" message inside the messages array (Anthropic, 2026). Same instruction, different envelope — and the envelope is code, not prose.
The differences compound once you involve tools and output formatting. Here's the shape of it across the three providers most teams actually use:
| Contract surface | Claude | GPT (OpenAI) | Gemini |
|---|---|---|---|
| System prompt | Top-level system parameter | role: "system" message in array | systemInstruction field |
| Tool-call identity | tool_use blocks with IDs | tool_call_id echoed back | Per-call id echoed in functionResponse (Gemini 3) |
| Tool-choice modes | auto / any / tool | auto / required / named | AUTO / ANY / VALIDATED |
| Temperature | Deprecated on Opus 4.7+ (400 error if set) | Accepted | Accepted |
| OpenAI-SDK drop-in | Compatibility layer, caveats apply | Native | Base-URL swap; extras via extra_body |
That temperature row is the kind of thing that ruins an afternoon. On Claude Opus 4.7 and later, setting temperature, top_p, or top_k to a non-default value returns a 400 error (Anthropic, 2026). A parameter that's harmless boilerplate on GPT and Gemini is a hard failure on the newest Claude models. Gemini adds its own wrinkle: you can reach it through the OpenAI SDK by swapping the base URL, but its distinctive features — thinking config, cached content, search grounding — only exist through an extra_body parameter outside the OpenAI schema (Google, 2026).
What we keep seeing: teams debug these failures as if the prompt text were wrong. It usually isn't. The instruction is fine — the harness it rides in doesn't match the new model's contract. Fix the contract, and the same words work again.
Why does model churn make portability non-optional?
Because the model you wrote for today has an expiration date. Anthropic commits to at least 60 days' notice before retiring a publicly released model (Anthropic, 2026). Sixty days sounds generous until you're carrying a dozen prompts tuned to a model that's about to vanish. And the retirements come in waves:
| Model | Retirement date |
|---|---|
| Claude Sonnet 3.5 | Oct 28, 2025 |
| Claude Opus 3 | Jan 5, 2026 |
| Claude Sonnet 3.7 | Feb 19, 2026 |
| Claude Haiku 3.5 | Feb 19, 2026 |
Source: Anthropic model deprecations, 2026.
That's four retirements from one provider inside roughly a year. Multiply by three providers and the migration treadmill never really stops. Every forced move is a portability tax: re-test the prompt, adjust the harness, re-run your evals, ship. If your prompts aren't built to travel, you pay that tax in panic each time, against a 60-day clock.
There's a quieter cost too. When migration hurts, teams avoid it — and end up clinging to a model that's slower or pricier than what shipped last quarter, just because moving feels risky. Portability is what turns a deprecation email from a fire drill into a config change.
How do you actually make a prompt portable?
You design for portability instead of bolting it on after the first broken migration. The teams that move models smoothly tend to share five habits, and none of them are exotic:
- Separate the harness from the instruction. Keep the model-specific plumbing — system-prompt placement, tool schema, output parsing — in a thin adapter per provider. The instruction itself stays clean and provider-agnostic. When you swap models, you change the adapter, not the prompt.
- Put a gateway in front. Abstraction layers like LiteLLM (a unified interface to 100+ LLM APIs) and OpenRouter (a single API to hundreds of models) normalize the request and response shape so one call site reaches any provider (Xenoss, 2025).
- Tier your prompt variants. A prompt for a frontier reasoning model and one for a cheap high-volume model shouldn't be identical. Maintain variants by capability tier, not one prompt stretched to fit all.
- Evaluate per model, not once. A prompt that scores well on GPT can quietly regress on Gemini. Run your eval set against each target model before you route real traffic to it.
- Version everything with instant rollback. Every variant is a numbered version with a diff and an author, so a bad migration is one click to undo — no redeploy.
Notice how much of that list is about management, not routing. The gateway is necessary but not sufficient. For the mechanics of versioning specifically, our guide to prompt versioning best practices covers the diffs, authorship, and rollback that make a migration reversible.
Where does the gateway stop and prompt management begin?
Right at the prompt. A gateway routes the call — it normalizes the request and picks an endpoint. It does nothing about the content you send. It won't tell you that your Gemini variant drifted from your Claude variant, won't store the diff between them, and won't gate a swap on an eval. That's a different layer of the stack, and skipping it is how multi-model setups quietly rot.
Think about what a clean swap actually requires. You need the right prompt variant for the target model, proof it passes your evals there, a record of who changed what, and a way back if it misbehaves. The gateway gives you none of that. It hands the model your prompt; it has no opinion on whether that's the right prompt for that model.
Our take: the gateway and the prompt manager solve adjacent problems people constantly conflate. One answers "which endpoint?" The other answers "which prompt, validated how, and can I undo it?" You need both, and most teams only build the first.
That second layer is exactly what we built PromptVault for — versioned, evaluated, instantly reversible prompts that travel across providers without living in a Google Doc. If your prompts already serve more than one model, see how we think about prompt management for teams.
Frequently asked questions
Can't I just use the OpenAI SDK for everything?
Partly. Both Anthropic and Google offer OpenAI-compatible endpoints, so a base-URL swap gets you basic calls. But provider-specific features fall outside the shared schema — Gemini's thinking config and grounding require an extra_body parameter (Google, 2026). The SDK unifies the easy 80%, not the parts that break.
Do I need a separate prompt for each model?
Often, yes — at least separate variants. A prompt tuned for a frontier reasoning model rarely performs identically on a cheap, fast one, and 37% of enterprises now run five or more models in production (a16z, 2025). Maintain variants by capability tier and evaluate each against its target model before routing traffic.
Is a gateway like LiteLLM enough on its own?
It solves routing, not prompt management. LiteLLM and OpenRouter normalize calls across 100+ and hundreds of models respectively (Xenoss, 2025), but neither versions your prompt variants, diffs them, or gates a model swap on evaluation results. You still need a layer that owns the prompt content itself.
How often do models actually get deprecated?
Often enough to plan for. Anthropic gives at least 60 days' notice and retired four Claude models — Sonnet 3.5, Opus 3, Sonnet 3.7, and Haiku 3.5 — across a single year (Anthropic, 2026). Across three providers, expect a forced migration every few months. Build for it rather than reacting to each notice.
Portability is a property you design in
The prompt that breaks on a model swap was never really about the words. It was about everything wrapped around them — the system-prompt slot, the tool-call contract, the one parameter that's valid on two providers and a 400 error on the third. You can keep discovering those differences the hard way, one production incident per migration. Or you can treat portability as a property you build in: clean instructions, per-provider adapters, variants by tier, evals per model, and versions you can roll back in one click.
Multi-model is the baseline now, and the models will keep churning under you. The teams that stay fast are the ones whose prompts were built to travel. If lowering your bill is part of why you're routing across providers, our guide to reducing LLM costs picks up where this leaves off.