Self-RAG

The LLM controls every decision in the RAG pipeline: whether to retrieve, whether chunks are relevant, and whether its own generated answer is faithful and useful.

Self-RAG gives the LLM full control over the entire retrieval-generation pipeline. At each step, it decides: "Do I need to retrieve data at all, or do I already know the answer?" If retrieval is needed, it grades each chunk for relevance before passing it to the generator.

After generation, the model performs a self-critique: "Is my answer faithful to the retrieved evidence? Is it actually useful for the question?" If either check fails, the system retries retrieval — up to `MAX_RETRIES` times — with a refined query. Only then does it exit to `__end__`.

This pattern is the most autonomous form of RAG and produces the highest-quality answers — but at the cost of variable latency and higher token consumption. It's well suited for high-stakes question-answering where correctness matters more than speed.