AI Learning

Step 1 · concept

Why chunk at all?

Embedding models have a maximum input length — typically 512 to 8192 tokens. Real documents — PDFs, articles, codebases — are far longer. You cannot embed a 50-page document as a single vector and expect meaningful retrieval.

Even if you could, a single vector for an entire document would average out the meaning of every topic it covers. A query about "error handling in Python" would match poorly against a vector representing an entire Python textbook.

Chunking solves this: break documents into smaller pieces, embed each piece, and retrieve only the relevant chunks.

The quality of your chunks directly determines the quality of your retrieval. Bad chunking is the most common cause of poor RAG performance — and unlike a bad prompt, you can't fix it by changing the model.

token window

Why can't you just embed an entire 50-page document as a single vector, even with a model that has an 8192-token window?