AI Learning

Step 1 · concept

Eval is tests for AI

Every change to a prompt, model, retrieval strategy, or tool without an eval is a gamble. You are guessing it got better. Often it did not.

The fundamental problem: LLM outputs are non-deterministic and hard to judge manually at scale. You cannot read 500 responses and feel whether version B is better than version A. You need automated, repeatable evaluation.

Evals are to AI engineering what tests are to software engineering. They are not a nice-to-have. You wouldn't ship TypeScript without type-check; you can't ship RAG without eval.

A RAG eval harness measures two layers:

Retrieval quality. Does the right chunk show up in the top-k? Precision@k answers this; it's the first metric you should instrument.
Answer quality. Is the generated answer grounded in the retrieved chunks? Does it actually address the question? Faithfulness and answer-relevancy answer these; in practice they often use an LLM-as-judge to score.

This lesson builds both layers — the harness you'll use every time you change retrieval weights, swap embedding models, or rewrite a prompt. It composes Lesson 4's retrieve() and Lesson 5's groundedAnswer() into a measurable pipeline.

Not every eval needs a judge model. Mature AI teams usually separate:

Deterministic checks. Schema validity, refusal regexes, citation substring checks, exact-match or contains assertions, latency budgets.
Model-graded checks. Faithfulness, answer relevance, completeness, harmfulness, or nuanced factuality.

And they separate:

Outcome metrics. Did the final answer meet the bar?
Process metrics. Did retrieval return zero chunks, did the model refuse, how many claims were cited, how long did the request take?

precision@k

You change the dense/lex fusion weight in your hybrid retriever from 0.7/0.3 to 0.5/0.5. Three users report the bot feels 'smarter' that afternoon. Should you merge the change?