Learning Hub/Prompt Engineering/Self-Consistency & Verification
04 / 10Reasoning

Self-Consistency & Verification

Multiple reasoning paths and majority vote

Self-consistency samples multiple chain-of-thought reasoning paths for the same question and selects the most common final answer by majority vote. This reduces variance from any single reasoning chain and catches errors that a lone CoT path might produce.

A single chain-of-thought can be wrong even when it looks convincing. Self-consistency addresses this by sampling multiple independent reasoning paths (using temperature > 0) and aggregating the final answers. The intuition is that correct reasoning is more likely to converge on the same answer from different angles, while errors tend to be idiosyncratic. If five CoT paths produce answers [42, 42, 37, 42, 41], the majority answer 42 is very likely correct — even though two individual chains got it wrong.

Implementation is straightforward: send the same prompt N times (typically 5–20) with a moderate temperature (0.5–0.7), extract the final answer from each response, and take the plurality vote. For numeric answers, exact match works. For free-text answers, you may need fuzzy matching or a secondary LLM call to cluster semantically equivalent answers. The technique works best for tasks with a single correct answer — math, factual questions, classification. It is less useful for creative or open-ended generation where there is no single "right" output.

The cost is N× the tokens and latency of a single call, which limits self-consistency to high-value decisions where accuracy matters more than speed. A practical middle ground is to use self-consistency selectively: run a single CoT first, and only trigger multiple samples when the model expresses low confidence or when the task is known to be error-prone. You can also combine self-consistency with verification — after majority vote, ask the model to verify whether the winning answer is consistent with the question, catching residual errors.

Key Concepts

  • Self-consistency samples N reasoning paths and takes the majority-vote answer
  • Correct reasoning converges; errors are idiosyncratic — voting filters noise
  • Use temperature 0.5–0.7 for diversity; 5–20 samples balances accuracy vs. cost
  • Best for tasks with a single correct answer — math, classification, factual QA
  • Combine with verification: after voting, ask the model to double-check the winning answer