Evaluation & Systematic Iteration
Measuring prompt quality and preventing regressions
Prompt engineering without evaluation is guesswork. A rigorous eval process uses a curated test suite of inputs with expected outputs, scores each prompt version against quality metrics, and prevents regressions when prompts are updated. LLM-as-judge scales evaluation beyond what manual review can handle.
The most common failure in prompt engineering is "vibes-based" evaluation — you try a prompt on three examples, it looks good, you ship it. Then it fails on the 50th example in a way you never tested. Systematic evaluation replaces intuition with data. Build a test suite of 30–100 representative inputs covering normal cases, edge cases, adversarial inputs, and known failure modes from previous prompt versions. For each input, define what a correct response looks like — either an exact expected answer or a set of quality criteria (is it grounded? does it follow the format? is it under 200 words?).
Scoring can be automated at three levels. Exact match and regex checks handle structured outputs (did the model return valid JSON with the required keys?). Semantic similarity (embedding cosine distance) checks whether the answer is close to a reference answer in meaning, even if phrased differently. LLM-as-judge uses a capable model to score responses against rubrics ("Rate from 1–5: is this answer grounded in the provided context?"). Each level trades precision for coverage — use exact match where possible, semantic similarity for free-text, and LLM-as-judge for complex quality dimensions like tone, helpfulness, and safety.
Every prompt change must be tested against the full eval suite before deployment. Track scores across versions to detect regressions — a prompt that improves accuracy on one category may degrade another. Store eval results alongside prompt versions in version control so you can trace any production issue back to the specific prompt change that caused it. The eval suite itself evolves: every time you discover a new failure mode in production, add it as a test case. Over time, the suite becomes the institutional knowledge of what your prompt must handle.
Key Concepts
- Build a test suite of 30–100 inputs covering normal, edge, and adversarial cases
- Three scoring levels: exact match (structured), semantic similarity (free-text), LLM-as-judge (quality)
- Every prompt change must pass the full eval suite before deployment — no vibes-based shipping
- Track scores across versions to detect regressions from prompt changes
- Add every production failure as a new test case — the suite is institutional knowledge