← Back to writing

How to cut vector database cost with quantization and truncation

Vector search gets expensive quietly. You ship a RAG feature on a few hundred thousand chunks, everything is fast, and then the corpus grows to tens of millions of vectors and your index no longer fits in RAM. Now you are paying for bigger memory-optimized nodes, and the bill scales with the corpus, not with traffic. The good news: most teams are storing their embeddings at 4 to 32 times the size they actually need.

Two techniques do the heavy lifting: quantization (store each dimension in fewer bits) and Matryoshka truncation (store fewer dimensions). Used together they shrink a vector store by an order of magnitude or more, and with a rescoring step you keep recall close to the full-precision baseline. Here is how the math works and how to apply it without wrecking retrieval quality.

  • A float32 embedding uses 4 bytes per dimension. Scalar int8 quantization cuts that to 1 byte (4x smaller); binary quantization cuts it to 1 bit (32x smaller).
  • Matryoshka-trained models let you truncate dimensions (for example 1024 down to 256) with little recall loss, stacking on top of quantization.
  • Combined, a 1024-dim float32 vector (about 4 KB) can drop to a 128-dim binary vector (about 16 bytes) - roughly a 256x reduction in memory.
  • Recover the recall you lose by over-fetching candidates and rescoring the shortlist with full-precision vectors or a reranker.

Where the cost actually lives

The dominant cost of a production vector store is memory, because approximate nearest neighbor indexes like HNSW are held in RAM for low-latency search. Memory footprint is roughly the number of vectors times the bytes per vector, plus index overhead.

Do the arithmetic on a realistic corpus. Take 100 million chunks embedded at 768 dimensions in float32. Each vector is 768 times 4 bytes, about 3 KB, so the raw vectors alone are near 300 GB before index overhead. That does not fit on a commodity node, so you shard across expensive memory-optimized instances. Convert those same vectors to binary and each is 768 bits, about 96 bytes, so the raw set drops to under 10 GB. The workload did not change; the storage format did. Choosing the right database matters too, and we compared the main options in our vector database comparison of pgvector, Pinecone, and Qdrant, but format is the lever that moves the bill the most.

Quantization: fewer bits per dimension

Scalar (int8) quantization

Scalar quantization maps each float32 value to an 8-bit integer using a per-dimension min and max range. That is a clean 4x reduction with modest recall loss, usually a point or two of recall@10 on general corpora. It is the safe default: low risk, meaningful savings, and most vector databases support it natively. If you only make one change, make this one.

Binary quantization

Binary quantization is more aggressive: each dimension collapses to a single bit based on its sign. That is a 32x reduction, and distance is computed with Hamming distance, which is extremely fast. The catch is accuracy. Binary vectors lose more information than int8, so raw binary recall can fall noticeably on hard queries. Reserve it for cases where memory is the binding constraint, and always pair it with rescoring. It works best on high-dimensional embeddings (1024 dimensions and up), where there is enough signal left after the sign collapse.

Matryoshka truncation: fewer dimensions

Matryoshka Representation Learning trains a model so that the most important semantic information is packed into the earliest dimensions of the vector. That means you can slice a 1024-dim embedding down to 512 or 256 dimensions by simply taking the front of the vector, and retrieval quality degrades gracefully instead of falling off a cliff. Not every model supports this; you need an embedding model explicitly trained with MRL, which most current general-purpose embedding models now are.

Truncation stacks with quantization because they attack different axes. Truncation reduces the dimension count; quantization reduces the bytes per dimension. A 1024-dim float32 vector at 4 KB, truncated to 128 dimensions and binarized, becomes 16 bytes. That is the 256x figure you see quoted, and it is why a corpus that once needed a fleet of memory nodes can fit on a single machine.

Keep recall with a rescoring step

The move that makes aggressive compression safe is two-stage retrieval. Search the compressed index to pull a larger candidate set than you need - say the top 200 instead of the top 10 - then rescore that shortlist with full-precision vectors kept on disk, or with a cross-encoder reranker. The cheap index does the coarse filtering across millions of vectors; the expensive scorer only touches a couple hundred. You get most of the compression savings and most of the full-precision recall.

This is the same shape as the retrieval pipeline in our note on two-stage retrieval with reranking, and it composes with lexical signals as covered in hybrid search with BM25 and embeddings. Compression is a retrieval-architecture decision, not a knob you flip in isolation, so validate it against your own eval set rather than trusting a benchmark from a different corpus. If you are still assembling that pipeline, start from our production RAG architecture guide.

A practical rollout order

Do not jump straight to binary. Move in steps and measure recall@k at each one against a fixed eval set. Start with int8 scalar quantization for a low-risk 4x win. If memory is still the constraint, add MRL truncation and find the smallest dimension count that holds your recall target. Only reach for binary quantization when the corpus is very large and you have a rescoring stage in place to catch the accuracy you give up. At every step, the question is not how small can the vectors get but how small can they get while your eval numbers hold.

FAQ

How much recall do I actually lose?

It depends on the corpus, the model, and whether you rescore. Int8 quantization typically costs a point or two of recall@10. Binary quantization on its own can cost much more, but with over-fetching and full-precision rescoring the end-to-end recall usually lands within a point or two of the uncompressed baseline. Always measure on your own data; published numbers come from different corpora.

Does quantization make search slower?

Usually the opposite. Smaller vectors mean less memory bandwidth per comparison, and binary vectors use Hamming distance, which is faster than float dot products. The added rescoring stage touches only a small shortlist, so its cost is bounded. Net latency generally improves rather than regresses.

Which should I do first, quantization or truncation?

Scalar quantization first: it is the lowest-risk change with a guaranteed 4x saving and broad database support. Add Matryoshka truncation next if you need more, and treat binary quantization as the last and most aggressive step, only with rescoring in place.

Do I need to re-embed my whole corpus?

For quantization, no - it is applied to existing float32 vectors at index time. For Matryoshka truncation you need an MRL-trained embedding model; if your current model was not trained that way, switching models does mean re-embedding, so factor that one-time cost into the decision.

Get shipped

Rather we just build it?

Book a free scoping call and we'll ship your production-safe AI feature this week.