← Back to writing

RAG chunking strategies that actually improve retrieval

Chunking is the least glamorous part of a RAG pipeline and one of the most consequential. How you split documents before embedding decides what the retriever can ever find. Get it wrong and the answer to a question is physically split across two chunks, so no retrieval strategy can return it whole.

The 2026 benchmarks also carry an uncomfortable message: the clever strategy is not always the winner. Here is how to choose chunking that holds up in production.

Why chunk size decides what is findable

Retrieval works on chunks, not documents. If a chunk is too large, its embedding averages several topics and matches everything weakly, so it ranks below tighter chunks even when it holds the answer. If a chunk is too small, the answer gets severed from the context that makes it meaningful, and the model receives a fragment it cannot use.

This is why chunking sits upstream of every other retrieval fix. As we noted in the piece on two-stage retrieval, a reranker can only reorder what was retrieved; it cannot reassemble an answer your chunking tore in half.

What the 2026 benchmarks actually found

The headline result surprised a lot of teams. On a 50-document real-world retrieval benchmark in early 2026, plain recursive splitting at 512 tokens scored about 69 percent accuracy, while semantic chunking scored about 54 percent. The supposedly smarter, embedding-aware method lost to the simple one on that corpus.

The benchmark-validated default that emerged is recursive character splitting at roughly 512 tokens with 10 to 20 percent overlap (about 50 to 100 tokens). It costs nothing extra to implement, handles mixed document types gracefully, and is the right starting point for most use cases.

Overlap is more nuanced than the advice suggests

Overlap exists to stop an answer from being cut at a chunk boundary. But the benefit is conditional. A January 2026 analysis using sparse retrieval found that overlap provided no measurable benefit and only increased indexing cost, while dense vector search benefited more. The practical reading: use 10 to 20 percent overlap with dense retrieval, and do not assume overlap is free when you index large corpora.

Semantic chunking has a real cost

Semantic chunking, which splits on meaning shifts rather than token counts, can help on some document types, but it ran roughly 14 times slower than token-based chunking in the same benchmarks (about 0.33 MB per second versus 4.82). On a large or frequently re-indexed corpus that is a real operational bill, not a rounding error. It earns its place only when the accuracy gain on your documents justifies the indexing cost.

Match the strategy to your documents

The durable insight from 2026's research is that matching your strategy to your document type matters more than picking the "smartest" approach. A few rules of thumb we apply when we ship:

Prose and articles split cleanly with recursive 512-token chunks; the default works well. Structured documents with tables, code, or specs do better when you chunk along their natural structure (rows, functions, sections) so a unit of meaning stays intact. Short FAQ-style content is often best chunked one question-answer pair per chunk, which keeps each chunk self-contained.

Whatever you pick, treat embeddings and chunking as a paired decision. The way text is split interacts with how it embeds; our explainer on how embeddings work covers why a chunk that mixes topics produces a muddy vector.

Measure chunking, do not guess it

Chunking choices are cheap to change and expensive to get wrong silently, which makes them perfect candidates for a small eval set. Take 30 to 50 real questions with known answers, index the corpus under two chunking configurations, and compare recall@k. The faster, simpler configuration often wins outright, and when it does not, you now have a number that justifies the slower one. This is the same evaluation discipline we apply across RAG work; see common RAG evaluation mistakes for the traps to avoid, and the production RAG architecture guide for where chunking fits in the full pipeline.

Frequently asked questions

What chunk size should I start with?

Recursive character splitting at about 512 tokens with 10 to 20 percent overlap is the benchmark-validated default for most document types. Start there and only deviate when an eval on your own documents shows a clear gain.

Is semantic chunking worth it?

Sometimes, but not by default. In 2026 benchmarks it underperformed simple recursive splitting on a real corpus and ran roughly 14 times slower to index. Use it only when it beats the simpler default on your documents by enough to justify the indexing cost.

How much overlap should chunks have?

Use 10 to 20 percent overlap with dense retrieval to avoid severing answers at boundaries. With sparse keyword retrieval the benefit is smaller and may not justify the added indexing cost, so test it rather than assuming.

Can a reranker fix bad chunking?

No. A reranker reorders retrieved candidates; it cannot reassemble an answer that chunking split across two chunks or surface a chunk that was too diluted to retrieve. Fix chunking before you tune downstream stages.

Chunking is the upstream decision that bounds everything your retriever can do. Get it right with a simple default, measure on your own documents, and only pay for complexity that earns its keep. If you want a retrieval pipeline tuned end to end, see how we ship production AI features for US SaaS teams.

Get shipped

Rather we just build it?

Book a free scoping call and we'll ship your production-safe AI feature this week.