profile

Doug Turnbull

I share search tips, blog articles, and free events I'm hosting about the search+retreval industry, vector databases, information retrieval and more.

Featured Post

Why do single vector representations fail? (daily search tip)

This week we’ll talk a bit about late interaction. But to get there, we need to think about why single vector representations fail. Let’s think about restaurants. Here’s an article reviewing local restaurants. I have three Italian restaurants and two Chinese ones. What’s the average of these? Russian or something!? Maybe Middle Eastern food? If my document lists these restaurants, then that’s exactly what I’ll get in a single vector encoding. A confusing muddle somewhere in the middle of...

In previous tips I talked about tail latency Your search cluster becomes as slow as the slowest node A wide per-node latency distribution results in an even wider per-cluster distribution The higher the scale, the more sharded your data becomes, the more small node latency problems exaggerate cluster latencies. So when I think about graph-based vector retrieval at scale, like HNSW, I get nervous. With HNSW you’re: Navigating a non-deterministic graph, depending on the order graph is...

The higher the scale, the stronger the incentive to simplify your retrieval. There’s two conflicting incentives: Improving relevance: Requiring more complex retrieval to get all the best candidates Improving reliability: Consistent latency and throughput + easier for an infra engineer to manage / debug What does “simpler retrieval” look like? Single vector retrieval with a few filters A first pass BM25 retrieval with a recency boost An assumption you’re fetching top 1000 and reranking outside...

Don't push complex ranking into the search engine. Layering in operation on top of plugin on top of who-knows-what-else harms user experience. Why? Tail latency In other words, in a distributed system, your query is as fast as your slowest node. A rare event for a single node becomes frequent on the full cluster. Consider a single node benchmark: p50 of 50 ms, p99 200 ms. Seems reasonable. With 100 nodes, on average one node hits p99 every request. The cluster must wait for this slow node to...

Here’s a fun spreadsheet that implements word2vec. Use it for jumping off point. It has: A single small vocabulary of 9 words A single example mary had a little lamb We move positive vectors closer together (mary is IN the context window of had); We move negative vectors farther apart (toenail is NOT IN the context window of mary) In word2vec we maintain two embeddings per vocabulary entry (input and output) vectors. They mean subtly different things: Let’s say two inputs are similar, mary...

In my previous tip I introduced word2vec. I discussed it in terms of language: this word, mary, shared context with this other word, lamb, so their embeddings move closer. Why constrain ourselves to language? We could pretend that “Doug likes Star Wars” is the same kind of co-occurence. We can make a table of users to the movies they like: Anchor Positive movie Negative movie doug star wars king kong doug star trek cinderella tom star wars citizen kane tom battlestar galactica the aviator...

In search, judgment lists fall apart. It’s frankly, humbling and humiliating :) The work isn’t whatever tech returns search results. The work is the measurement - the evaluation. Everything else becomes incidental in search. Better evals has been my humbling lesson about from industry. Humbling because its not exciting or sexy, but because its hard. And its the work nobody wants to do. Everyone wants to do . For this reason, almost everyone overreads the value of some naive judgment list...

With embedding similarity you train with an anchor, a positive, and a negative. You want to move the positive's embeddings closer to the anchor's, while moving negative's farther apart. Enter good ole word2vec Every word in the vocabulary starts with its own random embedding When a word co-occurs with another word, its a positive (training moves them together) A random word, sampled out of context, is a negative (training pushes them apart) From just the context, “mary had a little lamb”, we...

Final free class Tuesday: Tuesday will be the last Cheat at Search Essentials before we begin Cheat at Search with Agents next week Have you ever had to evaluate a search application? To figure out if its satisfying users? To distinguish your team's opinion from what users actually click on? You may hear terms like "NDCG" or "judgment list". WTF is that? If you want to backfill some basics, understand search evaluation, come to the final Cheat at Search Essentials class on Tuesday. I'll...

Have you been to a conversion-crazy site? It’s nuts. Their site screams at you. They probably have the modern version of the HTML blink tag. Popups everywhere just won't go away. Buy buy buy! It’s fun to go to a physical store when you can browse the shelves, talk to customer service, and get help. People avoid stores lacking information and only high pressure salespeople in your face. If your search stinks of pressure, users will retreat. They’ll stay on Google. They win precisely because...