Measure your rater reliability (daily search tip)

Published 13 days ago • 1 min read

In the previous tip, we discussed how pointwise 1-5 labels fall apart. The expert rater gives only nit-picky ratings, way beyond the considerations of actual users. A naive rater has little knowledge of the domain, and may tend to consider most results relevant.

How do we handle this situation?

We handle it by using multiple raters for the same document. We can’t rely on just one!

Then, when we have enough ratings, we can use a metric like Fleiss’s Kappa to measure whether raters tend to agree / disagree. You apply this to your full dataset - or perhaps an entire query to understand how aligned your raters are for that specific query.

Fleiss’s Kappa approaches 1 when raters are neatly in agreement, and can even go negative when they heavily disagree.

If you’re using pointwise evals, use a statistic like Fleiss’s. Or get surprised by the nitpicker’s conflict with the naive rater

-Doug

Events · Consulting · Training (use code search-tips)

You're subscribed to Doug Turnbull's daily search tips where I share tips, blog articles, events, and more. You can always manage your profile:

Blog post on Bayesian BM25

Just sharing my post on Bayesian BM25 and other ways of normalizing BM25 scores. Enjoy! https://softwaredoug.com/blog/2026/03/06/probabilistic-bm25-utopia Do you have any thoughts on normalizing BM25 scores? -Doug Events · Consulting · Training (use code search-tips) You're subscribed to Doug Turnbull's daily search tips where I share tips, blog articles, events, and more. You can always manage your profile:

about 12 hours ago • 1 min read

Ugly hack to force BM25 to 0-1 (daily search tip)

Its convenient to have a lexical score normalized from 0-1. Sadly BM25 scores tend to be all over the place (0.5? 5.1? 12.51?). Fine for ranking. Annoying for other goals. That's why I wrote a post about one way to compute probabilities from BM25. In that post, I allude to one hack that forces BM25 to 0-1. Let's walk through it. A query term’s BM25 score is IDF * TF. Lucene’s TF is already normalized Lucene drops the (k1 + 1) in the numerator of BM25, giving you: Now we’ve got a TF term...

1 day ago • 1 min read

Blog post - can BM25 be a probability?

Reviewing Bayesian BM25 - a new approach to creating calibrated BM25 probabilities for hybrid search. I talk about this vs naive approaches I've used to do similar things. Enjoy! https://softwaredoug.com/blog/2026/03/06/probabilistic-bm25-utopia -Doug Events · Consulting · Training (use code search-tips) You're subscribed to Doug Turnbull's daily search tips where I share tips, blog articles, events, and more. You can always manage your profile:

4 days ago • 1 min read