How pointwise evals fall apart (daily search tip)


A judgment list labels a document as relevant / irrelevant for a query. So you get a label, say 1-5 for how relevant the movie First Blood is for the query Rambo.

Here’s what happens though in practice:

  • First, a rater see Rambo III - they give it a rating of 5 / 5
  • Next they see First Blood, the original Rambo movie, they also rate it 5/5
  • That rater might reflect - wait should I go back and adjust my original rating for the sequel?

Even with careful coaching, raters often use inconsistent rating criteria. Some raters, especially those less savvy in the domain will give more optimistic labels - looks like a Rambo movie, 5/5. Other raters, especially those very savvy in the domain, can skew pessimistic - nit-picking far beyond what users think matters - “this specific BluRay isn’t the BEST edition of First Blood, it should get a 1/5”

You can mitigate this with careful coaching, feedback, and great care. But it’s not easy.

-Doug

Events · Consulting · Training (use code search-tips)

You're subscribed to Doug Turnbull's daily search tips where I share tips, blog articles, events, and more. You can always manage your profile:

Doug Turnbull

I share search tips, blog articles, and free events I'm hosting about the search+retreval industry, vector databases, information retrieval and more.

Read more from Doug Turnbull

Just sharing my post on Bayesian BM25 and other ways of normalizing BM25 scores. Enjoy! https://softwaredoug.com/blog/2026/03/06/probabilistic-bm25-utopia Do you have any thoughts on normalizing BM25 scores? -Doug Events · Consulting · Training (use code search-tips) You're subscribed to Doug Turnbull's daily search tips where I share tips, blog articles, events, and more. You can always manage your profile:

Its convenient to have a lexical score normalized from 0-1. Sadly BM25 scores tend to be all over the place (0.5? 5.1? 12.51?). Fine for ranking. Annoying for other goals. That's why I wrote a post about one way to compute probabilities from BM25. In that post, I allude to one hack that forces BM25 to 0-1. Let's walk through it. A query term’s BM25 score is IDF * TF. Lucene’s TF is already normalized Lucene drops the (k1 + 1) in the numerator of BM25, giving you: Now we’ve got a TF term...

Reviewing Bayesian BM25 - a new approach to creating calibrated BM25 probabilities for hybrid search. I talk about this vs naive approaches I've used to do similar things. Enjoy! https://softwaredoug.com/blog/2026/03/06/probabilistic-bm25-utopia -Doug Events · Consulting · Training (use code search-tips) You're subscribed to Doug Turnbull's daily search tips where I share tips, blog articles, events, and more. You can always manage your profile: