Measure your rater reliability (daily search tip)


In the previous tip, we discussed how pointwise 1-5 labels fall apart. The expert rater gives only nit-picky ratings, way beyond the considerations of actual users. A naive rater has little knowledge of the domain, and may tend to consider most results relevant.

How do we handle this situation?

We handle it by using multiple raters for the same document. We can’t rely on just one!

Then, when we have enough ratings, we can use a metric like Fleiss’s Kappa to measure whether raters tend to agree / disagree. You apply this to your full dataset - or perhaps an entire query to understand how aligned your raters are for that specific query.

Fleiss’s Kappa approaches 1 when raters are neatly in agreement, and can even go negative when they heavily disagree.

If you’re using pointwise evals, use a statistic like Fleiss’s. Or get surprised by the nitpicker’s conflict with the naive rater

-Doug

Events · Consulting · Training (use code search-tips)

You're subscribed to Doug Turnbull's daily search tips where I share tips, blog articles, events, and more. You can always manage your profile:

Doug Turnbull

I share search tips, blog articles, and free events I'm hosting about the search+retreval industry, vector databases, information retrieval and more.

Read more from Doug Turnbull

Just sharing my post on Bayesian BM25 and other ways of normalizing BM25 scores. Enjoy! https://softwaredoug.com/blog/2026/03/06/probabilistic-bm25-utopia Do you have any thoughts on normalizing BM25 scores? -Doug Events · Consulting · Training (use code search-tips) You're subscribed to Doug Turnbull's daily search tips where I share tips, blog articles, events, and more. You can always manage your profile:

Its convenient to have a lexical score normalized from 0-1. Sadly BM25 scores tend to be all over the place (0.5? 5.1? 12.51?). Fine for ranking. Annoying for other goals. That's why I wrote a post about one way to compute probabilities from BM25. In that post, I allude to one hack that forces BM25 to 0-1. Let's walk through it. A query term’s BM25 score is IDF * TF. Lucene’s TF is already normalized Lucene drops the (k1 + 1) in the numerator of BM25, giving you: Now we’ve got a TF term...

Reviewing Bayesian BM25 - a new approach to creating calibrated BM25 probabilities for hybrid search. I talk about this vs naive approaches I've used to do similar things. Enjoy! https://softwaredoug.com/blog/2026/03/06/probabilistic-bm25-utopia -Doug Events · Consulting · Training (use code search-tips) You're subscribed to Doug Turnbull's daily search tips where I share tips, blog articles, events, and more. You can always manage your profile: