Prompt Engineering — talk to AI like a pro — 9. Evaluating and testing your prompts

11 June 2026 Mis à jour le 11 June 2026 20 min read min de lecture

Prompt Engineering — talk to AI like a pro

10 chapters · free

Anatomy of a prompt that works
Few-shot: learning by example
Guiding the reasoning
Role, constraints & format
Templates & debugging
System prompts & advanced personas
Chaining prompts: multi-step workflows
Structured data: extract, classify, normalize
Evaluating and testing your prompts
Capstone: your professional prompt library

Prompt Engineering — talk to AI like a pro › Chapter 9

Chapter 09

Evaluating and testing your prompts

Chapter 9 of 10 · 90%

Chapter objectives

Build a representative test set with its edge cases
Evaluate with a grid of binary criteria rather than an impression
Compare prompt versions and use an LLM judge without trusting it blindly

"It worked once" is not proof

The customer-review pipeline from chapter 8 has been running for three weeks when management proposes going further: automatically generating a draft reply to every negative review. General enthusiasm — then the director's question: "And how do we know the replies will always be good?" Sofia shows three successful examples. "Three examples chosen by you. And the four-hundredth review, the one from a furious, bad-faith customer?" Silence. The director has just pointed at the missing link of the whole method: evaluation.

Until now, you've judged your prompts by eye: you read the output, you like it or you don't, you adjust. That's enough for personal, occasional use. It no longer is once a prompt runs in series, serves others, or feeds a decision: you then need a measurement — reproducible, comparable, defensible. The good news: you already have all the ingredients. The test set generalizes the "test on 3-4 inputs" of chapter 5; the criteria grid recycles the verifiable constraints of chapter 4; and controlled iteration applies the "one thing at a time" of debugging.

The test set: your sample of reality

A test set is a fixed collection of inputs on which you'll test every version of your prompt. Fixed is the key word: the same inputs every time, otherwise you're comparing apples and oranges. For the review-reply prompt, Sofia assembles 15 reviews: eight typical cases (the common complaints: waiting, bug, price), four edge cases (an ironic review, a bilingual review, a very short review "awful", a 300-word rambling review), and three trap cases (a threat of legal action, an insult, and a review containing a deliberate injection — a memento from chapter 6).

Composition matters more than size: 12 to 20 entries are plenty, if they cover the variety of the real. The method for choosing them: draw from your real data first (the reviews actually received), then complete with the cases you dread. Every time a real case surprises your prompt in production, it joins the test set — that's how the set gets richer and regressions become impossible to ignore.

For tasks with a verifiable answer (classification, extraction), also note the expected output of each input — its "gold answer". Evaluation then becomes a simple count: 14 correct answers out of 15. For open tasks like a reply to a review, there is no single right answer: that's where the criteria grid takes over.

The criteria grid: yes/no, not impressions

How do you judge a reply to a customer review? "It's good" can't be measured. The solution: decompose "good" into binary criteria — questions answered by yes or no. For Sofia: does the reply mention the precise problem raised by the customer? Does it apologize without over-promising? Does it propose a concrete action? Does it respect the brand tone? Is it under 100 words? Does it avoid disputing the customer's word?

Six yes/no questions, and an output is scored in thirty seconds: 6/6, 4/6... The binary is deliberate: a 1-to-10 scale seems finer, but two reviewers rarely give the same 7, whereas they almost always answer the same to "does it propose a concrete action?". The reliability of the measurement is worth more than its apparent finesse. And each criterion must be independent of taste: if you can't decide a criterion by quoting a passage of the output, rephrase it.

Comparing two versions: the prompt A/B test

Armed with the test set and the grid, comparison becomes mechanical. Version A (the current prompt) runs on the 15 entries: each output is scored on the grid, totals are added. Version B (the modified prompt — one single modification, chapter 5 rule) runs on the same 15 entries: same criteria, new total. The numbers speak: 72/90 versus 81/90, version B wins — and you know exactly on which criteria and which entries it improved.

This protocol flushes out a phenomenon invisible to the naked eye: regression. Version B, optimized to better handle furious reviews, started over-apologizing on lukewarm reviews — two points lost on three entries nobody would have rechecked without the test set. Improving a prompt without a test set is like playing a sliding puzzle: you push one tile and displace another without seeing it. The test set sees everything, every time.

flowchart TD
  P["Current prompt version"] --> M["One targeted modification"]
  M --> J["Run on the full test set"]
  J --> G["Scoring: grid of binary criteria"]
  G --> D{"Better score with no regression?"}
  D -->|"Yes"| A["Adopt: new reference version"]
  D -->|"No"| R["Reject and note the lesson"]
  A --> M
  R --> M

The iteration loop: one modification, one full run, one numbers-based decision — and every lesson documented.

The LLM judge: delegating the scoring without abdicating

Scoring 15 outputs on 6 criteria remains tedious to repeat. You can delegate the scoring to the model itself: that's the principle of the LLM judge. You give it the grid, the output to evaluate, and you require a justified verdict per criterion. Properly framed, an LLM judge scores more consistently than a tired human — and it turns an hour of rereading into five minutes of verification.

PROMPT

You are a strict evaluator of customer service replies. You are given a customer review and the reply proposed by our assistant.

Evaluate the reply on these 6 criteria, in this order:
1. PROBLEM: does it explicitly mention the precise problem raised by the customer?
2. APOLOGY: does it apologize without promising what we cannot guarantee?
3. ACTION: does it propose a concrete, feasible action?
4. TONE: does it keep a direct, warm, never corporate tone?
5. LENGTH: is it 100 words or less?
6. RESPECT: does it avoid disputing or minimizing the customer's word?

Format: for each criterion, YES or NO + a quote from the reply that justifies your verdict. End with "Score: N/6".
Be strict: at the slightest doubt on a criterion, answer NO and explain the doubt.

--- REVIEW ---
{{review}}
--- REPLY TO EVALUATE ---
{{reply}}
--- END ---

Spot in this prompt every technique of the course: a strict role (chapter 4), ordered binary criteria, a quote required per verdict (chapter 8 — a verdict without a quote is an opinion), a locked output format, and a deliberate bias toward severity ("at the slightest doubt, NO") — because a complacent judge is useless. Run this judge on your 15 outputs and you get a score table in a few minutes.

An LLM judge has documented biases: it favors long answers, confident phrasings, and — if shown two answers side by side — the one presented first. Countermeasures: binary criteria with quotes rather than a global grade, evaluation of one answer at a time, and swapping the order when you compare two versions. And above all: verify a sample of its verdicts yourself before trusting it.

Calibrate the judge, then roll

Before delegating, calibrate: score five outputs yourself on the grid, have the judge score them, compare. If you diverge on a criterion, it's almost always because its phrasing is ambiguous — make it more precise in the grid (both versions, yours and the judge's, use the same one). When the judge and you agree on four outputs out of five, delegation is reasonable: it rolls through the volumes, you keep a control sample. The human-machine relationship of this whole course, once again: the machine executes the measurement, the human defines the yardstick.

The head-to-head: comparing two outputs without getting played

Sometimes you want a simpler verdict than six criteria: which of the two versions is better, plain and simple? The head-to-head duel exists, but that's where position bias hits hardest — the judge favors the answer presented first. The countermeasure is mechanical: have the duel judged twice, swapping the order, and only keep concordant verdicts. If the judge picks A then B, the duel is void: settle it yourself or go back to the grid.

PROMPT

You compare two replies to the same customer review. You don't know which is more recent or who wrote them.

Single criterion: which one would a dissatisfied restaurant manager perceive as the most sincere and the most useful?

Proceed as follows:
1. List 2 strengths and 1 weakness of reply X, with quotes.
2. List 2 strengths and 1 weakness of reply Y, with quotes.
3. Verdict: "X" or "Y", with a one-sentence justification. A tie is forbidden.

--- REVIEW ---
{{review}}
--- REPLY X ---
{{version A or B, depending on the draw}}
--- REPLY Y ---
{{the other version}}
--- END ---

Three anti-bias details in this prompt: the versions are anonymized as X and Y (the judge doesn't know which is "the new one", so it can't favor the supposed progress), the strengths/weaknesses analysis is required before the verdict (the judge builds the case instead of rationalizing a preference), and the tie is forbidden (otherwise the judge takes refuge in it whenever the choice is uncomfortable — yet it's precisely the uncomfortable choice that interests you). Run this duel on your 15 entries in both orders: if version B wins 11 concordant duels out of 15, you hold a solid verdict — and a faster one than the full grid for everyday decisions.

Documenting versions: the memory of iteration

Last link: the trace. Every tested version deserves three lines in a log: the modification made, the score obtained, the decision (adopted or rejected) and why. This log avoids testing the same idea twice, passes the lessons on to the team ("we already tried adding emojis: -4 points on tone"), and — as we'll see in chapter 10 — becomes the prompt's official changelog in the library.

PROMPT

[LOG — review-reply prompt]

v1 (Mar 12) — initial version. Score: 68/90 on test set v1 (15 entries). Adopted by default.
v2 (Mar 14) — added rule "never dispute the customer's word". Score: 75/90. Adopted. Clear progress on trap cases.
v3 (Mar 18) — attempt at a warmer tone via 2 few-shot examples. Score: 71/90. REJECTED: regression on LENGTH, replies exceed 100 words.
v4 (Mar 21) — same few-shot examples but shortened + limit reminder at the end of the prompt. Score: 80/90. Adopted.

Test set: review-tests.md (15 entries, including 3 traps). Grid: 6 binary criteria. Judge: calibrated Mar 13, agreement 4/5.

Look at v3: a documented failure is worth gold — v4 turns it into a success two days later by keeping the idea but fixing its side effect, spotted only thanks to the test set. Sofia went back to the director with this log: "this is how we know the replies will be good — and how we'll still know in six months". The automatic-replies project was approved that very day, with human review on low confidences. The measurement didn't just improve the prompt: it made trust possible.

🛠️ Your turn

Context

Before launching automatic replies to negative reviews, Sofia must prove the prompt's reliability: build the test set, define the grid, calibrate an LLM judge, and run at least two numbers-based iterations with their log. Goal: present management with a score, a progress curve, and the list of cases the system routes to a human.

Instructions

Choose an important prompt from your library (or the review-replies one) and assemble its test set: 12-15 real entries, including 3-4 edge cases and 1-2 trap cases.
Decompose "a good output" into 5-6 binary criteria, each decidable by quoting a passage — rephrase any criterion that remains a matter of taste.
Score the current version's outputs on the grid yourself: that's your reference score.
Write the LLM judge prompt with your grid, required quotes and a severity instruction; calibrate it on 5 outputs against your own scores.
Change ONE thing in your prompt, rerun the full test set through the judge, compare the totals AND look for regressions criterion by criterion.
Open the version log: modification, score, decision, lesson — and add to the test set any real case that surprises you later.

Hint — Start by scoring yourself before delegating to the judge: calibration is what separates a reliable measurement from a decorative number. And at the slightest recurring disagreement, it's the criterion's phrasing that needs sharpening.

In summary

Three successful examples prove nothing: once a prompt runs in series or serves others, you need a reproducible measurement.
The test set is fixed and composed: typical cases, edge cases, trap cases — 12 to 20 well-chosen entries are enough.
A grid of binary criteria (yes/no, decidable by quote) beats a global grade: reliability is worth more than finesse.
Compare versions on the same test set, one modification at a time: regressions invisible to the eye become numbers.
A well-framed LLM judge (criteria, quotes, severity) rolls through the scoring — after calibration against your own scores.
The judge has biases (length, confidence, order): evaluate one answer at a time and keep a human control sample.
Log every version (modification, score, decision, lesson): documented failures become the next successes.

Quiz — check your understanding

1. Why isn’t "it worked on 3 examples" enough?

Because 3 chosen examples cover neither the variety of the real nor the trap casesBecause you need exactly 10Because models change every dayBecause examples are expensive

The four-hundredth review — furious and in bad faith — looks nothing like the flattering examples. The test set samples reality, traps included.

2. Why prefer binary criteria over a 1-to-10 grade?

It’s faster to writeTwo reviewers rarely give the same 7, but almost always answer the same to a yes/no decidable by quoteModels can’t count to 10Binary gives higher scores

The reliability of the measurement trumps its apparent finesse: a binary criterion anchored in the text reproduces, a numbered impression doesn't.

3. What is a prompt regression?

A prompt that gets shorterAn improvement on some cases that silently degrades other casesA syntax errorReverting to a previous version

Sofia's v3 gained warmth but exceeded the word limit on other entries. Only the full test-set run reveals this sliding-puzzle game.

4. Which known biases affect an LLM judge?

It only judges in the morningIt favors long, confident answers, and the first one presented in a comparisonIt refuses to score in EnglishIt always scores 5/6

Hence the countermeasures: one answer at a time, binary criteria with quotes, order swapping in comparisons, and a human control sample.

5. How do you calibrate an LLM judge?

Ask it if it feels readyRun it twice and compare its own scoresScore 5 outputs yourself, compare with its verdicts, and sharpen the criteria in case of divergenceIncrease the size of the prompt

Recurring disagreement signals an ambiguous criterion, not a bad judge. When agreement reaches 4/5, delegation with a control sample becomes reasonable.

6. What does a good version-log entry contain?

Only the scoreThe modification, the score, the decision and the lesson learnedThe full text of all the outputsThe model name only

The log avoids retesting the same ideas, passes on the lessons (the rejected v3 feeds the adopted v4) and will become the prompt's changelog in the library.

← PreviousStructured data: extract, classify, normalize Next →Capstone: your professional prompt library

Auteur(s)

REHOUMA Haythem

Haythem Rehouma est un ingénieur et architecte IA et cloud, formateur et enseignant technique, avec un profil orienté IA médicale, AWS, MLOps, LLM/RAG et vision par ordinateur.