Few-Shot Example Selection: How to Pick Demonstrations That Actually Work
June 10, 2026
Which examples you put in a prompt matters more than most people writing them realize. In an early GPT-3 study, selecting the semantically closest demonstrations scored 46.0 exact-match on NaturalQuestions, while the farthest ones scored 31.0 (Liu et al., 2022, 2022). Same model, same number of examples — a 15-point swing from selection alone.
That's the part of few-shot prompting practitioners tend to under-invest in. We obsess over wording the instruction and then grab whatever examples are lying around. But the examples are the prompt. This post is about choosing and arranging them deliberately.
Key Takeaways
- Example selection alone moved exact-match from 31.0 to 46.0 in one GPT-3 study (Liu et al., 2022).
- Example order can swing a model "between near state-of-the-art and random guess" (Lu et al., 2022).
- Ground-truth labels barely matter — format, label space, and input distribution drive most of the gain (Min et al., 2022).
- Long context unlocks "many-shot" prompting that can rival fine-tuning (DeepMind, 2024).
Why does which examples you pick matter so much?
Similarity-based selection consistently beats random sampling. On SST-2 sentiment, randomly chosen examples scored 87.95% with a ±2.74 spread across trials, while retrieving the nearest neighbors (the KATE method) reached 91.99% with almost no variance (Liu et al., 2022). The lift is real, but the collapse in variance is the bigger story.
Random example sets make your prompt a slot machine. One run looks great, the next looks broken, and you can't tell whether your instruction changed anything. Retrieving examples close to the actual input — by embedding similarity — gives the model demonstrations that share vocabulary, structure, and edge cases with the query it's about to answer.
According to Liu et al., demonstrations retrieved by semantic similarity "leverage GPT-3's in-context learning capabilities much more effectively than the random sampling baseline, even when the number of examples is small" (Liu et al., 2022). For production systems, that means a small, well-matched example set often outperforms a large, generic one.
How many examples do you actually need?
Historically, returns flattened fast — the original GPT-3 paper saw gains up to about 64 examples, then a plateau (Brown et al., 2020). For years, "few-shot" really meant two to eight demonstrations, because that's all the context window and the marginal value supported.
Long context changed the math. DeepMind's "many-shot" work prompts models with hundreds or thousands of examples and reports "significant performance gains across a wide variety of generative and discriminative tasks," sometimes rivaling fine-tuning — though inference cost rises linearly with example count (Agarwal et al., 2024).
| Regime | Examples | Best for |
|---|---|---|
| Few-shot | 2–8 | Cheap, latency-sensitive calls; clear tasks |
| Mid-shot | 16–64 | Harder classification, more label diversity |
| Many-shot | 100s–1000s | Complex reasoning; a fine-tuning alternative |
Our take: many-shot isn't free. Every example rides along on every single request, so a prompt that's 800 demonstrations deep is a recurring token bill, not a one-time training cost. Reach for it when the task is hard and the volume is low — not as a default.
Does the order of examples change the answer?
Yes, often dramatically. Lu et al. showed that the order in which you provide the same examples "can make the difference between near state-of-the-art and random guess performance," and that this sensitivity is universal across model sizes (Lu et al., 2022). Shuffle your demonstrations and you can accidentally halve your accuracy.
Why? Models carry predictable biases. Zhao et al. identified majority-label bias (the most frequent answer in your examples gets over-predicted), recency bias (the last example weighs more), and common-token bias. Their contextual calibration fix improved average accuracy by up to 30.0% absolute and cut variance (Zhao et al., 2021).
The practical takeaways: balance your labels so no single class dominates, and don't let the last example silently anchor the output. If you can't calibrate, at least test a few orderings rather than trusting the first arrangement you wrote.
What actually makes a demonstration work?
Not the part you'd guess. Min et al. found that randomly replacing the labels in your examples — pairing inputs with wrong answers — "barely hurts performance" on a range of tasks (Min et al., 2022). The model isn't learning the input-to-label mapping from your examples the way fine-tuning would.
So what is it learning? Their analysis points to four things: the label space (which answers are even on the table), the distribution of input text, the overall format, and the fact that demonstrations exist at all. The examples teach the model what kind of thing to produce, not the specific right answers.
Our take: this reframes the whole job. Stop agonizing over whether every demonstration is perfectly correct, and start making sure your examples cover the full label space, match the real input distribution, and model the exact output format you need. Format fidelity beats label perfection.
How should you manage example sets in production?
Treat them as versioned data, not hardcoded strings. Once you accept that selection, ordering, and label balance all move accuracy by double digits, an example set becomes a production asset that deserves the same discipline as code — review, version history, and the ability to roll back a bad change.
In practice that means three habits. Retrieve examples dynamically by similarity to the live input instead of shipping one static block. Pin and version the example pool so you know exactly which demonstrations produced yesterday's results. And re-test sets per model, since order and selection sensitivity differ across them — see our guide to multi-model prompt portability.
This is exactly the gap a prompt management layer fills. If your few-shot examples live in scattered code constants, you can't audit, version, or A/B-test them. PromptVault lets you store prompts and their example sets as versioned, edge-served config — so you can change demonstrations and roll them back without a redeploy. For the underlying argument, see why prompts are config, not code.
Frequently Asked Questions
Is more examples always better in few-shot prompting?
No. Gains historically plateaued around 64 examples (Brown et al., 2020), and every example adds latency and cost on every call. Long-context "many-shot" can keep improving into the hundreds, but only justify it when the task is genuinely hard (Agarwal et al., 2024).
Does the order of few-shot examples really matter?
Significantly. Lu et al. found ordering alone can move a model between near state-of-the-art and random-guess performance, and the effect persists across model sizes (Lu et al., 2022). Balance your labels and avoid letting the final example dominate via recency bias.
Do my example labels need to be correct?
Less than you'd think. Min et al. showed randomly corrupted labels barely hurt accuracy (Min et al., 2022). What matters most is the label space, input distribution, and output format — though correct labels are still safer for high-stakes tasks.
How do I pick examples automatically?
Use embedding similarity. Retrieve the demonstrations whose inputs are nearest to the live query; this beat random selection by up to 15 exact-match points in one study and slashed run-to-run variance (Liu et al., 2022).
Conclusion
Few-shot prompting isn't about writing clever examples. It's about selecting the right ones, ordering them to dodge known biases, covering the full label space, and matching your real output format. The research is consistent: example choice and arrangement swing accuracy by double-digit margins, often more than the instruction itself.
Start treating your example sets like the production assets they are — retrieved dynamically, versioned, and tested per model. For the next step, see prompt versioning best practices.