Consider pairwise evals instead of pointwise (daily search tip)


If pointwise evals asks “How relevant is this from 1-5” - pairwise search evals says “Which of these two results is more relevant - X or Y?”

Comparing two items at a time has some advantages:

  • Less chance for per-decision error - harder to screw up one is better than another
  • More precise results - fine grain details that can’t be shoved into a 1-5 scale
  • Faster decisions - comparisons often can be made quicker

However, two major downsides remain

  • Pairwise evals take more time - instead of rating 10 items 1-5, you need to compare 10 items against 9 other items to get a complete picture
  • Pairwise evals need to be transformed into pointwise - to use traditional search metrics or ranking data, we need a single score per-document

Luckily these factors can be mitigated.

  • LLMs can do a lot of the simpler evals / comparisons - such as my approach to LLM as a judge
  • A system like Elo - used in competitions like chess - can be used to turn 1 vs 1 competitions (like pairwise comparisons) into a pointwise rating

-Doug

Events · Consulting · Training (use code search-tips)

You're subscribed to Doug Turnbull's daily search tips where I share tips, blog articles, events, and more. You can always manage your profile:

Doug Turnbull

I share search tips, blog articles, and free events I'm hosting about the search+retreval industry, vector databases, information retrieval and more.

Read more from Doug Turnbull

Just sharing my post on Bayesian BM25 and other ways of normalizing BM25 scores. Enjoy! https://softwaredoug.com/blog/2026/03/06/probabilistic-bm25-utopia Do you have any thoughts on normalizing BM25 scores? -Doug Events · Consulting · Training (use code search-tips) You're subscribed to Doug Turnbull's daily search tips where I share tips, blog articles, events, and more. You can always manage your profile:

Its convenient to have a lexical score normalized from 0-1. Sadly BM25 scores tend to be all over the place (0.5? 5.1? 12.51?). Fine for ranking. Annoying for other goals. That's why I wrote a post about one way to compute probabilities from BM25. In that post, I allude to one hack that forces BM25 to 0-1. Let's walk through it. A query term’s BM25 score is IDF * TF. Lucene’s TF is already normalized Lucene drops the (k1 + 1) in the numerator of BM25, giving you: Now we’ve got a TF term...

Reviewing Bayesian BM25 - a new approach to creating calibrated BM25 probabilities for hybrid search. I talk about this vs naive approaches I've used to do similar things. Enjoy! https://softwaredoug.com/blog/2026/03/06/probabilistic-bm25-utopia -Doug Events · Consulting · Training (use code search-tips) You're subscribed to Doug Turnbull's daily search tips where I share tips, blog articles, events, and more. You can always manage your profile: