Lesson 6 of 8

Evaluate your RAG system with precision@k and faithfulness

Stop guessing whether your RAG system is good. Build a golden-Q&A eval harness that measures retrieval precision, answer faithfulness, and answer quality — the difference between 'I built it' and 'I ship it'.

You're on lesson 6 of 6 in the free RAG module. Unlock the full AI Engineer curriculum →

Step 1 · concept

Eval is tests for AI

Every change to a prompt, model, retrieval strategy, or tool without an eval is a gamble. You are guessing it got better. Often it did not.

The fundamental problem: LLM outputs are non-deterministic and hard to judge manually at scale. You cannot read 500 responses and feel whether version B is better than version A. You need automated, repeatable evaluation.

Evals are to AI engineering what tests are to software engineering. They are not a nice-to-have. You wouldn't ship TypeScript without type-check; you can't ship RAG without eval.

A RAG eval harness measures two layers:

  • Retrieval quality. Does the right chunk show up in the top-k? Precision@k answers this; it's the first metric you should instrument.
  • Answer quality. Is the generated answer grounded in the retrieved chunks? Does it actually address the question? Faithfulness and answer-relevancy answer these; in practice they often use an LLM-as-judge to score.

This lesson builds both layers — the harness you'll use every time you change retrieval weights, swap embedding models, or rewrite a prompt. It composes Lesson 4's retrieve() and Lesson 5's groundedAnswer() into a measurable pipeline.

Not every eval needs a judge model. Mature AI teams usually separate:

  • Deterministic checks. Schema validity, refusal regexes, citation substring checks, exact-match or contains assertions, latency budgets.
  • Model-graded checks. Faithfulness, answer relevance, completeness, harmfulness, or nuanced factuality.

And they separate:

  • Outcome metrics. Did the final answer meet the bar?
  • Process metrics. Did retrieval return zero chunks, did the model refuse, how many claims were cited, how long did the request take?
You change the dense/lex fusion weight in your hybrid retriever from 0.7/0.3 to 0.5/0.5. Three users report the bot feels 'smarter' that afternoon. Should you merge the change?