Mutuus TeamMarch 30, 202610 min read

What Happens When Your Index Navigates by Smell?

Echo Field just ran its first away-field benchmark on 18,840 real arXiv sections. 100% recall parity with brute-force, 18x faster at 500K vectors, and churn-stable under continuous mutation. Here's what we found.

echo-fieldbenchmarksvector-searchopen-rag-bench

Skim

Read

Deep Dive

Every vector index in production faces the same three questions: can it find what you're looking for, can it do it fast, and what happens when the corpus changes underneath it?

We just ran Echo Field against open-rag-bench, a public RAG evaluation dataset built from 1,000 arXiv papers, 18,840 corpus sections, and 3,045 real retrieval queries. It's the first time we've tested Echo Field on data we didn't generate. Here's what happened.

The Recall Story

Echo Field matched brute-force exact search perfectly. 0.817 recall@10 on both. Zero loss from the approximate index.

When we first saw the numbers, we assumed something was wrong. An approximate index matching exact search, to the fourth decimal place, on real embeddings? We ran a brute-force kNN baseline on the same corpus with the same 384-dimensional embeddings to confirm. The ceiling is 0.817. Echo Field hits the ceiling.

The 0.817 itself is the embedding model's limit (all-MiniLM-L6-v2, a lightweight 22M-parameter model). A larger model would push the ceiling higher. But the point is: Echo Field doesn't leave anything on the table. Whatever your embeddings can find, Echo Field finds.

Full retrieval metrics on open-rag-bench (18,840 sections, 384 dims, 3,045 queries):

Metric	Echo Field	Brute-Force (exact)	Ratio
Recall@1	0.469	0.468	100.3%
Recall@5	0.745	0.744	100.0%
Recall@10	0.817	0.817	100.0%
MRR	0.584	0.583	100.1%
NDCG@10	0.640	0.640	100.1%

At this corpus size (18.8K), Echo Field correctly uses flat-scan mode. The metamorphosis threshold handles this automatically. No config tuning needed.

The Scaling Story

Echo Field's query time stays nearly flat as the corpus grows. 2.3ms at 10K. 7.6ms at 500K. Brute-force goes from 2.3ms to 138ms over the same range.

This is what sublinear scaling looks like in practice. Brute-force scans every vector on every query, so its latency is directly proportional to corpus size. Echo Field's multi-resolution LSH hashes the query into buckets that contain a tiny fraction of the corpus. More data means more buckets, not more work per query.

The crossover happens around 15K vectors. Below that, brute-force is efficient enough that the LSH overhead doesn't pay off. Above it, the gap widens fast. At 500K, Echo Field is 18x faster.

Scaling benchmark (384 dims, 200 queries, release build):

Corpus	Echo Field	Brute-Force	Speedup	Recall
10,000	2.3ms	2.3ms	1.0x	100%
25,000	3.1ms	6.8ms	2.2x	100%
50,000	4.8ms	13.6ms	2.8x	100%
100,000	5.3ms	27.5ms	5.2x	100%
250,000	6.6ms	69.0ms	10.5x	100%
500,000	7.6ms	138ms	18.2x	100%

100% recall through 500K vectors. The scent field's multi-resolution hashing finds the same nearest neighbors that brute-force does, just faster.

The Churn Story

This is where it gets interesting. We ran 50 ticks of continuous insert-and-delete on a live index. Recall didn't degrade. In the steady-state scenario, it actually improved slightly over time.

Static benchmarks (BEIR, MTEB, open-rag-bench) freeze the corpus and run queries against it. That's useful for measuring recall, but it's not how production vector databases work. Production indexes ingest new data, expire old data, and serve queries at the same time.

HNSW-based indexes accumulate tombstones when vectors are deleted, because the navigation graph can't be cheaply repaired. These ghost entries degrade recall over time, eventually requiring a full rebuild. Echo Field uses lazy deletes: the vector is removed from the resolver immediately, but its scent entries persist in the LSH buckets until they decay naturally.

The risk with lazy deletes is that ghost entries fill the candidate budget, crowding out live results. We fixed this with ghost-aware candidate budgeting: the query path counts only live candidates against the budget, and when a bucket contains ghosts, it iterates newest-first so fresh entries take priority.

Churn benchmark (50K corpus, 384 dims, release build):

Scenario	Initial	Final	Min	Degradation	Speedup
Steady-state (+1K/-1K per tick, 50 ticks)	97%	96%	90%	+1% (stable)	2.8x
Growing (+1K/-500 per tick, 50 ticks)	96%	90%	86%	-6%	3.4x
Shrinking (+500/-1K per tick, 30 ticks)	98%	98%	91%	0% (stable)	2.5x

The steady-state result is the headline. After 50,000 inserts and 50,000 deletes, with the entire original corpus replaced by new data, recall is where it started. The index doesn't accumulate debt.

The growing scenario shows modest degradation (96% to 90%) because the index outgrows its initial configuration. The shrinking scenario holds perfectly because fewer vectors means less competition for the candidate budget.

How the Fix Works

Two changes to the query path made churn stability possible: ghost-aware budgeting and recency-biased iteration.

When Echo Field deletes a vector, it removes the vector data from the resolver (O(1)) but leaves the scent entries in the hash buckets. These ghost entries decay naturally over time. The problem is that during queries, ghost entries counted against the candidate budget. If half the entries in a bucket were ghosts, the budget filled with half as many useful candidates.

The fix separates the live candidate count from the total candidate count. Only live entries (those still in the resolver) count toward the budget. And when a bucket contains ghosts, we iterate from the newest entries first, since recently inserted vectors are more likely to be alive.

There's also a hard cap on total candidates (live plus ghost) to bound memory and rerank cost even under extreme ghost accumulation.

The mechanism is specific: this is ghost-aware candidate budgeting under lazy-delete semantics. The query path's candidate admission loop checks resolver.contains(entry.vector_id) before incrementing the live counter. The budget break condition uses live_count >= candidate_budget instead of all_candidates.len() >= candidate_budget.

For buckets with detected ghosts, the iteration order reverses (entries.iter().rev()) so that newer entries fill the budget before stale ones get a chance. For clean buckets, forward order preserves LSH locality. This is a recency bias, not neutral ghost filtering. In practice it works because recently inserted vectors are overwhelmingly the ones you want to find.

The Full Stack

We didn't benchmark Echo Field alone. We ran all seven implemented Mutuus primitives on the same workload.

The open-rag-bench harness exercises Echo Field for retrieval, Mycelial Cache for query result caching, Diatom Bitmap for document ID set tracking, Nacre Array for ranked result storage, Dendrite Ring for latency time-series monitoring, Spike for event telemetry, and Waggle for multi-modal retrieval confidence fusion.

Every primitive participated. Dendrite caught 5 latency anomalies in the query stream. Spike logged all 3,045 events. Waggle fused per-query evidence across retrieval modalities. This is the first time the full Mutuus stack has run together on a real workload.

What's Next

Echo Field's recall and scaling are strong. The next targets are latency optimization (SIMD, better hashing), the WASM bridge, and the billion-vector memory story.

Echo Field's scent field architecture decouples navigation from data storage. At billion-vector scale, HNSW's navigation graph can consume 120-200GB of memory. Echo Field's LSH tables should be 2-3x smaller because they store lightweight scent entries (vector ID plus generation), not graph edges. We haven't benchmarked this yet. That's the next benchmark target, and it's where the architectural bet either pays off or it doesn't.

The churn stability result is already differentiating. No other approximate nearest-neighbor index we're aware of publishes mutation-stability benchmarks. The industry tests on frozen corpora. Production doesn't freeze.

March 28, 20267 min read

Nacre Array Benchmarks: The Numbers

We promised the full benchmark numbers. Here they are: every operation, every scale, wins and losses. Vec is still faster at some things. Nacre Array is faster at others. The crossover points tell the real story.