Qdrant Reduced RAM Without Memory Loss, but With a Gotcha

For a resource-starved project, I wanted to index some 600k PDF reports of variable length, from 10 to 200 pages, using a basic chunking technique to get started. Estimated vector usage with the good old all-MiniLM-L6-v2 was 90GB. Remember: Resource-starved project. Waaaaay too much.

With a basic function that skipped probable boilerplate, plus a few heuristics for deciding whether a chunk was worth keeping, I got the chunk count down to about a quarter. Still too much to keep in memory and keep it snappy.

Quantization helped, but with a small catch I hadn’t considered: Qdrant still keeps the original embeddings on disk. Quantization in Qdrant keeps the original vectors around for storage modes and rescoring paths that trade a bit of speed for better recall. That meant I did need to check disk usage. Fortunately, the project was less resource-starved on disk than on memory.

The config I ended up with was: always_ram=true for the int8 quantized vectors and on_disk=true for the float32 vectors.

Tradeoffs

I tested whether I could just keep the original embeddings on disk and skip quantized search. You can do that by querying with quantization.ignore=true.

  1. Collection with vectors.on_disk: true and quantization enabled.
  2. Run the same query set with quantization.ignore: false.
  3. Run the same query set with quantization.ignore: true.
  4. Compare latency and, if you care, recall/top-k overlap.

With 5 warmup queries, quantized search took 9.5ms and originals took 1776ms, a 187x gap. With 20 warmup queries, the originals dropped to 4.2ms versus 3.2ms for quantized, but that was the cache doing the work after the original float32 vectors had already been pulled into RAM.

Mode Cold (5 warmup) Warm (20 warmup)  
Quantized 9.5ms 3.2ms  
Originals 1776ms 4.2ms  
Speedup 187× 1.3×  

The warm result is correct, but a bit misleading. If the same few topics get queried repeatedly, originals on disk can look fast. For a larger production workload spread across unseen parts of vector space, the cold case is the one worth paying attention to. In my test, the recall overlap was 93%, which made the tradeoff easy for now: keep the quantized vectors hot in RAM and leave the originals off-heap unless they are needed.

Mode RAM usage Speed Notes
originals in RAM highest fastest best raw performance
originals on disk + quantized in RAM medium fast balanced
originals on disk and searched directly lower slower saves memory, pays in latency
everything on disk lowest slowest smallest RAM footprint

Qdrant version: v1.13.4