Prompt Engineering — talk to AI like a pro — 8. Structured data: extract, classify, normalize

20 min read min de lecture
Chapter 08

Structured data: extract, classify, normalize

Chapter 8 of 10 · 80%

Chapter objectives

  • Extract precise information from free text with a strict schema
  • Classify in batches with a closed taxonomy and per-category definitions
  • Normalize values and route doubtful cases to a human review

The hidden deposit in free text

Back from the Lyon trade show, Sofia brings home a cumbersome haul: 80 prospect emails, conversation notes taken on the fly, and the file of 200 customer reviews from the quarter that has been sleeping since chapter 4. All this material contains precious data — who wants a demo, which restaurant, what team size, what level of urgency — but locked in free text, unreadable for a spreadsheet. Re-typing it by hand? Two days. Having a well-designed prompt extract it? One hour, verification included.

That's the job of this chapter: turning free text into reliable structured data. Chapter 4 laid the basic brick — strict JSON with a schema. We're now going to industrialize it: multi-field extraction, classification with a closed taxonomy, value normalization, batch processing, and above all the routing of doubtful cases — because the difference between a demo gadget and a production tool plays out entirely on the cases that don't fit the boxes.

Extraction: a schema, rules per field

Extraction means spotting precise information in a text and filing it into defined fields. Quality depends on three elements you already know separately: an exact schema (chapter 4), a non-invention rule (chapter 6), and — the new part — per-field rules: for each field, what gets extracted, in what form, and what to put when the information is absent.

PROMPT
You extract contact data from prospect emails, for the sales team of a company making scheduling software for restaurants.

For each email between delimiters, return ONLY a JSON object:
{
  "name": "first and last name, or null if absent",
  "establishment": "name of the restaurant or group, or null",
  "team_size": "integer number of employees, or null if not mentioned",
  "request": "demo" | "pricing" | "question" | "other",
  "urgency": "high" | "normal",
  "key_quote": "the sentence of the email that justifies the request field"
}

Rules:
- NEVER invent a value: absent information = null.
- team_size: only if a number is written; "a big team" = null.
- urgency = high only if an explicit deadline is mentioned.
- key_quote: exact copy of the text, no rephrasing.

--- EMAIL ---
{{email}}
--- END ---

The key_quote field is the most profitable trick of this prompt: by requiring the exact sentence that justifies the classification, you make every row auditable in two seconds — the quote matches or it doesn't — and you reduce invention, because the model must anchor its answer in the text. It's the same principle as the verifiable reasoning of chapter 3, transposed to data. As for the per-field rules, they settle in advance the ambiguities you would otherwise discover in production: what to do with "a big team", what deserves "high" urgency.

Classification: a closed, defined taxonomy

Classifying means filing each item into a category from a closed list. The classic mistake is giving the category names without defining them: "classify these reviews into: service, product, price, other". The model then has its own idea of the boundary between "service" and "product" — and it changes from one review to the next. A professional taxonomy defines each category, gives an example, and specifies the tie-break rules.

PROMPT
Classify each customer review into ONE category from this closed list:

- "feature": the review is about what the software does or doesn't do. E.g.: "impossible to export the schedule to PDF".
- "usability": the review is about ease of use. E.g.: "it took me 3 days to find where to change a shift".
- "support": the review is about the help received from the team. E.g.: "reply in 10 minutes, problem solved".
- "price": the review is about cost or value for money.
- "other": none of the above clearly applies.

Tie-break rules:
- If the review touches several categories, choose the one of the most developed sentence and note the others in "secondary_categories".
- In case of genuine doubt, use "other" with "confidence": "low" — never force a category.

Format per review: { "id": N, "category": "...", "secondary_categories": [...], "confidence": "high|low", "key_quote": "..." }

Three mechanisms to remember. Definitions with examples stabilize the boundaries — that's few-shot (chapter 2) applied to categories. The tie-break rule handles the multi-theme case, inevitable with real reviews. And the confidence field gives the model an honorable way out when it hesitates: without it, it forces a category with aplomb, and your statistics lie in silence. An "other, low confidence" is honest data; a guessed "price" is toxic data.

Normalizing: clean data on the first pass

Extracting isn't enough: values must be comparable. If an extraction returns "15", "fifteen employees" and "~15 ppl", your spreadsheet drowns in variants and every sort becomes wrong. Normalization is prescribed in the prompt, field by field: dates in YYYY-MM-DD format, amounts as numbers without symbols, phone numbers as digits with no separators, establishment names in original case but without quotes. Every format specified upfront is an hour of cleaning saved later.

One specific trap deserves the warning: silent conversions. Ask for "the number of employees" and the model could convert "about ten" into 10 — an invention disguised as normalization. The "only if a number is written" rule of the extraction prompt exists for that. The general principle: normalization changes the form of a present value, it never creates an absent value. The line between the two must be written in black and white in your per-field rules.

Not everything needs to be JSON. For a one-off analysis meant for your own eyes, a Markdown table ("return a table: ID, category, quote") is more readable and pastes straight into a document. Reserve JSON for outputs meant for a script or a spreadsheet — the right format depends on the consumer of the data.

Before/after: the normalization pass in practice

Normalization can be built into the per-field rules right at extraction — that's the ideal — but it also works as a separate pass, very useful when you inherit data that was already extracted but dirty: a form export, an old spreadsheet filled in by hand, the trade-show contact file typed by three different people. The normalization prompt is then a chain step (chapter 7) in its own right:

PROMPT
You normalize manually entered prospect records, without changing the substance.

For each row between delimiters, return the normalized version:
- contact_date: YYYY-MM-DD format. "last Tuesday" or an ambiguous date = null.
- phone: digits only, no spaces or dots. Incomplete number = null.
- establishment: original case, no quotes, no redundant "Restaurant" prefix.
- team_size: integer only if a number appears in the record, otherwise null.
- city: full name, first letter capitalized, no postal code.

Absolute rules:
- You change the FORM of present values, you NEVER create an absent value.
- Add an "alerts" field listing everything that seemed ambiguous in the row.
- Keep the row identifier as is.

--- ROWS ---
{{rows ID ; date ; phone ; establishment ; size ; city}}
--- END ---

The alerts field plays here the role that confidence played in classification: an honorable way out for doubt. Is "06.12.34" a truncated phone number? Are "La Brasserie" and "Brasserie" the same establishment? The model flags instead of deciding in silence, and Sofia settles the alerts in one batch at the end — a few minutes, versus hours rereading every row. Normalized data without an audit trail is data you end up re-verifying entirely on the day of the first doubt.

Processing in batches without losing the thread

There remains the question of scale: 200 reviews can't be processed in one giant prompt. Beyond a certain volume, quality degrades in the middle of the list — the model's attention is better at the extremities, as seen in chapter 4 — and a single malformed output can ruin the entire batch. The robust practice: batches of 10 to 20 items, each carrying an explicit identifier, and a verification per batch before moving to the next.

Identifiers are the detail that saves you: number your entries ("REVIEW-001", "REVIEW-002") and require the identifier in every output object. You can then mechanically verify that no entry was skipped or duplicated — the model sometimes skips an item right in the middle of a batch without flagging it. Count the inputs, count the outputs, compare the identifiers: three ten-second checks that catch the silent losses.

flowchart LR
  T["Raw text: reviews, emails, notes"] --> L["Batches of 10 to 20 with identifiers"]
  L --> X["Extraction and classification per schema"]
  X --> C{"High confidence and valid JSON?"}
  C -->|"Yes"| D["Spreadsheet or dashboard"]
  C -->|"No"| H["Human review queue"]
  H -->|"Cases resolved"| D
The data pipeline: confident cases flow to the spreadsheet, doubtful cases to a human review — never a forced category.

The safety net: route the doubtful, verify by sampling

The diagram above contains the most important architecture decision of the chapter: low-confidence cases do not go into the spreadsheet, they go into a human review queue. Out of Sofia's 200 reviews, 14 came out with low confidence; she settled them by hand in ten minutes. The other 186 were reliable — she verified it by sampling: 15 reviews drawn at random, quote compared to the source text, zero errors. Clean sample, batch accepted. It's the same spirit as the test set we'll systematize in chapter 9.

If the sample reveals errors, don't fix the rows one by one: look for the pattern. Three errors on ironic reviews? Add a rule or a few-shot example of irony to the prompt and rerun the batch. Fixing the output repairs once; fixing the prompt repairs every following time — that's the lesson of the chapter 5 templates, applied to data. And keep the corrected version of the prompt in your library, with its test cases.

Never process sensitive personal data without asking the confidentiality question: a prospect email contains a name, sometimes a phone number. Check your company's policy and your AI tool's (no-training option, enterprise plan) before pasting customer data into it en masse.

In the end, Sofia's pipeline fits in one sentence: identified batches, a schema with per-field rules, quotes for auditing, a confidence field for routing, a sample for validating. Five simple mechanisms — and the file of 200 reviews that had been sleeping for two months became, in one morning, a dashboard the management checks every week.

🛠️ Your turn

Context

Back from the Lyon trade show, the sales team is waiting for the qualified prospect list: who wants a demo, what establishment size, what urgency. Sofia has 80 emails in bulk. She wants to build the complete extraction pipeline — schema, per-field rules, identified batches, routing of doubtful cases — and deliver a reliable table before Friday, with proof by sampling that the data is correct.

Instructions

  1. Choose a deposit of free text from your daily work (emails, reviews, notes, tickets) and list the 5-6 fields that would have value in a spreadsheet.
  2. Write the JSON schema with one rule per field: expected form, normalization, and value when absent (null, never invention).
  3. Add a key_quote field and a confidence field with its usage rule: when in doubt, low confidence rather than a forced category.
  4. Number 15-20 real entries, process them as one batch, then verify mechanically: as many outputs as inputs, all identifiers present, valid JSON.
  5. Sample 5 rows: compare each quote to the source text. If there are errors, look for the pattern and fix the prompt, not the rows.
  6. Handle the low-confidence queue by hand, then deliver the final table — and file the validated prompt in your library with its test cases.
Hint — The key_quote field is your best ally: a row whose quote doesn't match the source text is a row to reject — and an error pattern to look for in the prompt.

In summary

  • Reliable extraction = exact schema + per-field rules (form, normalization, value if absent) + a ban on inventing.
  • The key_quote field anchors every data point in the source text: auditable in two seconds, less invention.
  • A professional taxonomy defines each category with an example and tie-break rules — not just names.
  • The confidence field gives doubt an honorable way out: "other, low confidence" beats a forced category that skews the stats.
  • Normalization changes the form of a present value, it never creates an absent one.
  • Process in batches of 10-20 with identifiers, and verify mechanically: output count, identifiers, JSON validity.
  • Route doubtful cases to a human review and validate batches by sampling; if errors, fix the prompt (the pattern), not the rows.

Quiz — check your understanding

1. What is the key_quote field for in an extraction prompt?

The exact quote is verified in two seconds against the source text, and forces the model to base its classification on a real passage.

2. Why define each category of a taxonomy with an example?

"Service" vs "product" is ambiguous without a definition. Definitions + examples + tie-break rules stabilize the boundaries — it's few-shot applied to categories.

3. What do you do with an item the model hesitates to classify?

A forced category silently skews the statistics. Routing the doubtful to a human is the pipeline's key architecture decision.

4. What is the limit of normalization?

Converting "about ten" into 10 is an invention in disguise. "Only if a number is written" draws the line in black and white.

5. Why process 200 reviews in batches of 10-20 rather than one giant prompt?

Attention is better at the extremities, and small batches with identifiers allow a mechanical check batch by batch.

6. The sample reveals 3 errors on ironic reviews. What do you do?

Fixing the output repairs once; fixing the prompt repairs every following time. The error pattern is the most precious information of the sample.

Auteur(s)

R

REHOUMA Haythem

Haythem Rehouma est un ingénieur et architecte IA et cloud, formateur et enseignant technique, avec un profil orienté IA médicale, AWS, MLOps, LLM/RAG et vision par ordinateur.