Evaluate your RAG system with precision@k and faithfulness
Stop guessing whether your RAG system is good. Build a golden-Q&A eval harness that measures retrieval precision, answer faithfulness, and answer quality — the difference between 'I built it' and 'I ship it'.
You're on lesson 6 of 6 in the free RAG module. Unlock the full AI Engineer curriculum →
Eval is tests for AI
Every change to a prompt, model, retrieval strategy, or tool without an eval is a gamble. You are guessing it got better. Often it did not.
The fundamental problem: LLM outputs are non-deterministic and hard to judge manually at scale. You cannot read 500 responses and feel whether version B is better than version A. You need automated, repeatable evaluation.
Evals are to AI engineering what tests are to software engineering. They are not a nice-to-have. You wouldn't ship TypeScript without type-check; you can't ship RAG without eval.
A RAG eval harness measures two layers:
- Retrieval quality. Does the right chunk show up in the top-k? Precision@k answers this; it's the first metric you should instrument.
- Answer quality. Is the generated answer grounded in the retrieved chunks? Does it actually address the question? Faithfulness and answer-relevancy answer these; in practice they often use an LLM-as-judge to score.
This lesson builds both layers — the harness you'll use every time you change retrieval weights, swap embedding models, or rewrite a prompt. It composes Lesson 4's retrieve() and Lesson 5's groundedAnswer() into a measurable pipeline.
Not every eval needs a judge model. Mature AI teams usually separate:
- Deterministic checks. Schema validity, refusal regexes, citation substring checks, exact-match or contains assertions, latency budgets.
- Model-graded checks. Faithfulness, answer relevance, completeness, harmfulness, or nuanced factuality.
And they separate:
- Outcome metrics. Did the final answer meet the bar?
- Process metrics. Did retrieval return zero chunks, did the model refuse, how many claims were cited, how long did the request take?