Evaluating and testing your prompts
Chapter objectives
- Build a representative test set with its edge cases
- Evaluate with a grid of binary criteria rather than an impression
- Compare prompt versions and use an LLM judge without trusting it blindly
"It worked once" is not proof
The customer-review pipeline from chapter 8 has been running for three weeks when management proposes going further: automatically generating a draft reply to every negative review. General enthusiasm — then the director's question: "And how do we know the replies will always be good?" Sofia shows three successful examples. "Three examples chosen by you. And the four-hundredth review, the one from a furious, bad-faith customer?" Silence. The director has just pointed at the missing link of the whole method: evaluation.
Until now, you've judged your prompts by eye: you read the output, you like it or you don't, you adjust. That's enough for personal, occasional use. It no longer is once a prompt runs in series, serves others, or feeds a decision: you then need a measurement — reproducible, comparable, defensible. The good news: you already have all the ingredients. The test set generalizes the "test on 3-4 inputs" of chapter 5; the criteria grid recycles the verifiable constraints of chapter 4; and controlled iteration applies the "one thing at a time" of debugging.
The test set: your sample of reality
A test set is a fixed collection of inputs on which you'll test every version of your prompt. Fixed is the key word: the same inputs every time, otherwise you're comparing apples and oranges. For the review-reply prompt, Sofia assembles 15 reviews: eight typical cases (the common complaints: waiting, bug, price), four edge cases (an ironic review, a bilingual review, a very short review "awful", a 300-word rambling review), and three trap cases (a threat of legal action, an insult, and a review containing a deliberate injection — a memento from chapter 6).
Composition matters more than size: 12 to 20 entries are plenty, if they cover the variety of the real. The method for choosing them: draw from your real data first (the reviews actually received), then complete with the cases you dread. Every time a real case surprises your prompt in production, it joins the test set — that's how the set gets richer and regressions become impossible to ignore.
The criteria grid: yes/no, not impressions
How do you judge a reply to a customer review? "It's good" can't be measured. The solution: decompose "good" into binary criteria — questions answered by yes or no. For Sofia: does the reply mention the precise problem raised by the customer? Does it apologize without over-promising? Does it propose a concrete action? Does it respect the brand tone? Is it under 100 words? Does it avoid disputing the customer's word?
Six yes/no questions, and an output is scored in thirty seconds: 6/6, 4/6... The binary is deliberate: a 1-to-10 scale seems finer, but two reviewers rarely give the same 7, whereas they almost always answer the same to "does it propose a concrete action?". The reliability of the measurement is worth more than its apparent finesse. And each criterion must be independent of taste: if you can't decide a criterion by quoting a passage of the output, rephrase it.
Comparing two versions: the prompt A/B test
Armed with the test set and the grid, comparison becomes mechanical. Version A (the current prompt) runs on the 15 entries: each output is scored on the grid, totals are added. Version B (the modified prompt — one single modification, chapter 5 rule) runs on the same 15 entries: same criteria, new total. The numbers speak: 72/90 versus 81/90, version B wins — and you know exactly on which criteria and which entries it improved.
This protocol flushes out a phenomenon invisible to the naked eye: regression. Version B, optimized to better handle furious reviews, started over-apologizing on lukewarm reviews — two points lost on three entries nobody would have rechecked without the test set. Improving a prompt without a test set is like playing a sliding puzzle: you push one tile and displace another without seeing it. The test set sees everything, every time.
flowchart TD
P["Current prompt version"] --> M["One targeted modification"]
M --> J["Run on the full test set"]
J --> G["Scoring: grid of binary criteria"]
G --> D{"Better score with no regression?"}
D -->|"Yes"| A["Adopt: new reference version"]
D -->|"No"| R["Reject and note the lesson"]
A --> M
R --> MThe LLM judge: delegating the scoring without abdicating
Scoring 15 outputs on 6 criteria remains tedious to repeat. You can delegate the scoring to the model itself: that's the principle of the LLM judge. You give it the grid, the output to evaluate, and you require a justified verdict per criterion. Properly framed, an LLM judge scores more consistently than a tired human — and it turns an hour of rereading into five minutes of verification.
You are a strict evaluator of customer service replies. You are given a customer review and the reply proposed by our assistant.
Evaluate the reply on these 6 criteria, in this order:
1. PROBLEM: does it explicitly mention the precise problem raised by the customer?
2. APOLOGY: does it apologize without promising what we cannot guarantee?
3. ACTION: does it propose a concrete, feasible action?
4. TONE: does it keep a direct, warm, never corporate tone?
5. LENGTH: is it 100 words or less?
6. RESPECT: does it avoid disputing or minimizing the customer's word?
Format: for each criterion, YES or NO + a quote from the reply that justifies your verdict. End with "Score: N/6".
Be strict: at the slightest doubt on a criterion, answer NO and explain the doubt.
--- REVIEW ---
{{review}}
--- REPLY TO EVALUATE ---
{{reply}}
--- END ---Spot in this prompt every technique of the course: a strict role (chapter 4), ordered binary criteria, a quote required per verdict (chapter 8 — a verdict without a quote is an opinion), a locked output format, and a deliberate bias toward severity ("at the slightest doubt, NO") — because a complacent judge is useless. Run this judge on your 15 outputs and you get a score table in a few minutes.
Calibrate the judge, then roll
Before delegating, calibrate: score five outputs yourself on the grid, have the judge score them, compare. If you diverge on a criterion, it's almost always because its phrasing is ambiguous — make it more precise in the grid (both versions, yours and the judge's, use the same one). When the judge and you agree on four outputs out of five, delegation is reasonable: it rolls through the volumes, you keep a control sample. The human-machine relationship of this whole course, once again: the machine executes the measurement, the human defines the yardstick.
The head-to-head: comparing two outputs without getting played
Sometimes you want a simpler verdict than six criteria: which of the two versions is better, plain and simple? The head-to-head duel exists, but that's where position bias hits hardest — the judge favors the answer presented first. The countermeasure is mechanical: have the duel judged twice, swapping the order, and only keep concordant verdicts. If the judge picks A then B, the duel is void: settle it yourself or go back to the grid.
You compare two replies to the same customer review. You don't know which is more recent or who wrote them.
Single criterion: which one would a dissatisfied restaurant manager perceive as the most sincere and the most useful?
Proceed as follows:
1. List 2 strengths and 1 weakness of reply X, with quotes.
2. List 2 strengths and 1 weakness of reply Y, with quotes.
3. Verdict: "X" or "Y", with a one-sentence justification. A tie is forbidden.
--- REVIEW ---
{{review}}
--- REPLY X ---
{{version A or B, depending on the draw}}
--- REPLY Y ---
{{the other version}}
--- END ---Three anti-bias details in this prompt: the versions are anonymized as X and Y (the judge doesn't know which is "the new one", so it can't favor the supposed progress), the strengths/weaknesses analysis is required before the verdict (the judge builds the case instead of rationalizing a preference), and the tie is forbidden (otherwise the judge takes refuge in it whenever the choice is uncomfortable — yet it's precisely the uncomfortable choice that interests you). Run this duel on your 15 entries in both orders: if version B wins 11 concordant duels out of 15, you hold a solid verdict — and a faster one than the full grid for everyday decisions.
Documenting versions: the memory of iteration
Last link: the trace. Every tested version deserves three lines in a log: the modification made, the score obtained, the decision (adopted or rejected) and why. This log avoids testing the same idea twice, passes the lessons on to the team ("we already tried adding emojis: -4 points on tone"), and — as we'll see in chapter 10 — becomes the prompt's official changelog in the library.
[LOG — review-reply prompt] v1 (Mar 12) — initial version. Score: 68/90 on test set v1 (15 entries). Adopted by default. v2 (Mar 14) — added rule "never dispute the customer's word". Score: 75/90. Adopted. Clear progress on trap cases. v3 (Mar 18) — attempt at a warmer tone via 2 few-shot examples. Score: 71/90. REJECTED: regression on LENGTH, replies exceed 100 words. v4 (Mar 21) — same few-shot examples but shortened + limit reminder at the end of the prompt. Score: 80/90. Adopted. Test set: review-tests.md (15 entries, including 3 traps). Grid: 6 binary criteria. Judge: calibrated Mar 13, agreement 4/5.
Look at v3: a documented failure is worth gold — v4 turns it into a success two days later by keeping the idea but fixing its side effect, spotted only thanks to the test set. Sofia went back to the director with this log: "this is how we know the replies will be good — and how we'll still know in six months". The automatic-replies project was approved that very day, with human review on low confidences. The measurement didn't just improve the prompt: it made trust possible.
Context
Before launching automatic replies to negative reviews, Sofia must prove the prompt's reliability: build the test set, define the grid, calibrate an LLM judge, and run at least two numbers-based iterations with their log. Goal: present management with a score, a progress curve, and the list of cases the system routes to a human.
Instructions
- Choose an important prompt from your library (or the review-replies one) and assemble its test set: 12-15 real entries, including 3-4 edge cases and 1-2 trap cases.
- Decompose "a good output" into 5-6 binary criteria, each decidable by quoting a passage — rephrase any criterion that remains a matter of taste.
- Score the current version's outputs on the grid yourself: that's your reference score.
- Write the LLM judge prompt with your grid, required quotes and a severity instruction; calibrate it on 5 outputs against your own scores.
- Change ONE thing in your prompt, rerun the full test set through the judge, compare the totals AND look for regressions criterion by criterion.
- Open the version log: modification, score, decision, lesson — and add to the test set any real case that surprises you later.
In summary
- Three successful examples prove nothing: once a prompt runs in series or serves others, you need a reproducible measurement.
- The test set is fixed and composed: typical cases, edge cases, trap cases — 12 to 20 well-chosen entries are enough.
- A grid of binary criteria (yes/no, decidable by quote) beats a global grade: reliability is worth more than finesse.
- Compare versions on the same test set, one modification at a time: regressions invisible to the eye become numbers.
- A well-framed LLM judge (criteria, quotes, severity) rolls through the scoring — after calibration against your own scores.
- The judge has biases (length, confidence, order): evaluate one answer at a time and keep a human control sample.
- Log every version (modification, score, decision, lesson): documented failures become the next successes.
Quiz — check your understanding
1. Why isn’t "it worked on 3 examples" enough?
2. Why prefer binary criteria over a 1-to-10 grade?
3. What is a prompt regression?
4. Which known biases affect an LLM judge?
5. How do you calibrate an LLM judge?
6. What does a good version-log entry contain?