Be aware of tail latency in your search cluster (daily search tip)


Don't push complex ranking into the search engine. Layering in operation on top of plugin on top of who-knows-what-else harms user experience.

Why? Tail latency

In other words, in a distributed system, your query is as fast as your slowest node.

A rare event for a single node becomes frequent on the full cluster.

Consider a single node benchmark: p50 of 50 ms, p99 200 ms. Seems reasonable.

With 100 nodes, on average one node hits p99 every request. The cluster must wait for this slow node to complete the request. Users experience the p99 (200 ms) every request.

We make this worse when we add complexity. The system needs to page more memory, context switch threads, and occasionally take a winding path through IO. Node execution becomes unpredictable and burste. Now, perhaps, per-node p99 jumps to 1000 ms:

The tail stretches.

Since p99 of a node == p50 of a 100 node cluster, from the user’s perspective:

  • Cluster p50 is 1000ms
  • Cluster p99 is very far along the tail

So the larger the cluster, the simpler you should keep first-pass retrieval.

-Doug

PS today at 12:30 PM prices increase for Cheat at Search with Agents: (http://maven.com/softwaredoug/cheat-at-search)

Events · Consulting · Training (use code search-tips)

You're subscribed to Doug Turnbull's daily search tips where I share tips, blog articles, events, and more. You can always manage your profile:

Doug Turnbull

I share search tips, blog articles, and free events I'm hosting about the search+retreval industry, vector databases, information retrieval and more.

Read more from Doug Turnbull

Reviewing Bayesian BM25 - a new approach to creating calibrated BM25 probabilities for hybrid search. I talk about this vs naive approaches I've used to do similar things. Enjoy! https://softwaredoug.com/blog/2026/03/06/probabilistic-bm25-utopia -Doug Events · Consulting · Training (use code search-tips) You're subscribed to Doug Turnbull's daily search tips where I share tips, blog articles, events, and more. You can always manage your profile:

You may know BM25 lets you tune two parameters: k1: how quickly to saturate document term frequency’s contribution b: how much to bias towards below average length docs What you may NOT know is there is another parameter k3 What does k3 do? It handles repeated query terms. Old papers suggest k3=100 to 1000, which immediately saturates. That’s why Lucene ignores k3. It just uses the query term frequency. Some other search engines like Terrier set it to 8. So for the query, “Best dog toys for...

Rare terms have high inverse document frequency (IDF). BM25 scoring treats high IDF terms as more relevant. Why? We assume if a term occurs rarely in the corpus, it must unambiguously point to what the user wants. It’s specific. But that’s not always true. Not all text is created equal. Corpuses violate this assumption frequently. Why? No need to use a common term - Book titles may rarely mention the word “book”, but clearly “book” in a book index has low specificity. Language gaps between...