LLM-as-Judge

A dedicated judge LLM scores the generator's output on explicit criteria. If the score falls below a threshold, the generator rewrites the answer using the judge's feedback.

LLM-as-Judge replaces informal "does this look good?" checks with explicit, reproducible scoring. A separate judge LLM evaluates the generated answer on three dimensions — accuracy, helpfulness, and specificity — assigning numeric scores on a 1–10 scale for each.

If the combined score falls below `SCORE_THRESHOLD = 7`, the generator receives the judge's detailed feedback and produces a revised answer. The loop continues until the score passes or `MAX_ATTEMPTS = 3` is reached. Using a lower temperature (0) for the judge and a higher temperature for the generator creates complementary specialization.

This pattern is widely used for automated evaluation of RAG systems, chatbot responses, and generated reports. The same judge prompt can be reused across different generator variants, making it a powerful tool for A/B testing different agent configurations.