Nacre Array Benchmarks: The Numbers
We promised the full benchmark numbers. Here they are: every operation, every scale, wins and losses. Vec is still faster at some things. Nacre Array is faster at others. The crossover points tell the real story.
In Why Nature Builds Better Arrays, we described the limitations of flat sequential storage. In Nacre Array: What If Your Vec Had a Spine?, we introduced the design and showed where segmented architecture should win. This post is the receipt. Every number, every scale, every operation where Vec still wins.
Setup
All numbers in this post come from the corrected benchmark harness run on 2026-03-14 using Criterion.rs on Windows 11 with Rust 1.92. The corrections matter: random access now measures lookup cost instead of lookup plus RNG overhead, and split now compares against Vec::split_off instead of a clone-both-halves baseline. The sizes tested are 1K, 10K, and 100K elements, with push also tested at 1M.
The baseline is Vec<T> from Rust's standard library. This is not a strawman. Vec is one of the most optimized data structures in any language runtime, backed by decades of allocator tuning, LLVM vectorization, and hardware prefetch alignment. Beating it at anything is non-trivial.
Nacre Array uses the current tuned default configuration: 4096-element segments, LZ4 compression, and segment locality cache enabled unless a cache-off variant is shown. For push, we show two Vec baselines: default growth and preallocated capacity. That distinction still matters.
Where Nacre Wins
Three operations where segmented architecture dominates: insert at scale, split at fracture planes, and metadata-only scanning. The margins are large enough to justify the overhead for workloads that hit these paths.
Insert at Middle
Vec shifts n - k elements on every mid-array insert. Nacre Array shifts within one segment, then updates cumulative suffix metadata across later segments. With the current 4096-element default, Nacre still loses at 1K and 10K, but the structural crossover at scale is decisive. At 100K elements, Nacre is 3.7x faster. Note: elements are 12 bytes with the SlimElementHeader (4-byte header + 8-byte u64), compared to 48 bytes under the previous ElementHeader. Smaller elements also make Vec shifts cheaper, reducing the factor from 8.3x to 3.7x; the structural advantage is unchanged.
| Size | NacreArray | Vec | Ratio |
|---|---|---|---|
| 1K | 786 ns | 733 ns | Near parity |
| 10K | 4,604 ns | 3,573 ns | Vec 1.3x faster |
| 100K | 19,723 ns | 72,047 ns | Nacre 3.7x faster |
The crossover now happens later but more decisively. At small and medium sizes, the segmented structure is still overhead. By 100K, the O(n) suffix shift in Vec dominates and Nacre's local movement plus suffix maintenance wins convincingly.
Split at Midpoint
This is the clearest structural win on the tuned branch. The old comparison used clone+truncate for Vec; the proper contiguous-array baseline is Vec::split_off. Against that baseline, Nacre now measures 12.1 microseconds (12,105 ns) versus 90.9 microseconds (90,944 ns). That is a real 7.5x win, not a benchmark artifact. The factor decreased from 23.6x because smaller elements (12 bytes vs 48 bytes) also make Vec::split_off cheaper; the memcpy cost that Nacre avoids is proportional to element size.
| Size | NacreArray (fracture) | Vec (split_off) | Ratio |
|---|---|---|---|
| 100K | 12,105 ns | 90,944 ns | Nacre 7.5x faster |
Segment Scan
This is the real metadata win. scan_segments evaluates about 25 segment summaries in 241 ns. A comparable Vec predicate scan over 100K elements takes about 324,578 ns. That is a roughly 1346.5x difference, and it is not a micro-optimization of the same operation. Segment metadata lets you ask which regions are interesting before you touch payloads.
| Operation | Time (100K) | Elements Touched |
|---|---|---|
| scan_segments | 241 ns | 0 payloads |
| scan_headers_collect | 305,952 ns | 100K element-associated headers |
| Vec iter_filter_collect | 324,578 ns | 100K elements |
scan_headers is near parity with the equivalent Vec collect because the current implementation still traverses live elements to inspect their headers. The real structural win is scan_segments: summary-level filtering at segment granularity.
Where Vec Wins
Radical honesty time. Vec is faster at random access, iteration, and small-scale pushes. These are real costs, not edge cases.
Random Access
Vec's get is a single pointer offset. Nacre's get resolves the containing segment, then dereferences into that segment. Once the harness stopped timing RNG work, the gap turned out to be much larger than the old draft claimed. For pure random access, disable the cache: it only adds overhead.
| Size | Nacre (cache) | Nacre (no-cache) | Vec | cache/Vec | no-cache/Vec |
|---|---|---|---|---|---|
| 1K | 2.14 ns | 2.72 ns | 0.76 ns | 2.8x | 3.6x |
| 10K | 3.52 ns | 3.82 ns | 0.73 ns | 4.8x | 5.2x |
| 100K | 6.73 ns | 5.96 ns | 0.75 ns | 9.0x | 7.9x |
The asymptotic story is still correct: segmented lookup is structurally slower than contiguous pointer arithmetic. What changed is the measured constant factor. With the benchmark now isolating lookup cost cleanly, random access is not a mild tax. It is a major tradeoff.
Iteration
Vec wins iteration decisively at every scale. Contiguous memory with sequential hardware prefetch is extremely hard to beat. Per-segment pointer indirection breaks the prefetch stream. This is a fundamental cost of segmentation, not an optimization we missed.
| Size | NacreArray | Vec | Ratio |
|---|---|---|---|
| 1K | 804 ns | 103 ns | Vec 7.8x faster |
| 10K | 10,671 ns | 1,295 ns | Vec 8.2x faster |
| 100K | 99,054 ns | 15,684 ns | Vec 6.3x slower |
With smaller elements, Vec iteration benefits more from cache-line density (more elements per cache line), widening the gap from the previous 2.9x to 6.3x at 100K. Contiguous memory and hardware prefetch dominate segment-by-segment traversal.
Push Throughput
Push needs two baselines. Against default-growth Vec, Nacre briefly wins at 100K because Vec hits a reallocation cliff. Against preallocated Vec, Nacre loses at every tested size. That means push is not a structural reason to choose Nacre Array.
| Size | NacreArray | Vec (default) | Vec (prealloc) |
|---|---|---|---|
| 1K | 4,568 ns | 1,323 ns | 569 ns |
| 10K | 40,025 ns | 12,202 ns | 5,711 ns |
| 100K | 436,050 ns | 505,050 ns | 322,840 ns |
| 1M | 6,843,200 ns | 5,000,500 ns | 2,740,900 ns |
{/* integrity:claim push_100k_vs_vec_default /} {/ integrity:claim push_100k_vs_vec_prealloc */} At 100K, Nacre records 436,050 ns versus 505,050 ns for default-growth Vec and 322,840 ns for preallocated Vec. Elements are 12 bytes with the SlimElementHeader, compared to 48 bytes under the previous ElementHeader; both Nacre and Vec benefit from smaller elements.
The Crossover Points
Segmentation starts paying for itself at different scales depending on the operation. Understanding these crossover points is the key to knowing when Nacre Array is the right choice.
For insert, the crossover is between 10K and 100K elements. At 10K, Vec still wins. By 100K, suffix shifting dominates and Nacre pulls ahead. At 100K, the medians are 19,723 ns for Nacre and 72,047 ns for Vec.
For push, the crossover against default-growth Vec is narrow and unreliable. Against preallocated Vec, there is no crossover. Do not choose Nacre Array for push throughput.
For split, the corrected baseline still shows a clear Nacre win because the operation is fundamentally different: segment-level fracture versus contiguous suffix relocation.
For random access and iteration, there is no crossover in the other direction. Vec wins at every tested size. The question is whether the structural wins elsewhere justify paying that tax.
The Optimization Story
The corrected harness forced a reset in how we talk about get(). The durable implementation ideas survived. The old headline ratios did not.
What still matters from the optimization work:
- The linear segment scan is gone. Lookup uses a cumulative index and binary search.
- The segment cache is real, but it is workload-sensitive. It helps sequential, localized, and the current Zipfian benchmark. It hurts pure random access.
- Insert and remove now update cumulative suffix metadata incrementally. Split now carries only the moved right-half cumulative metadata and renormalizes that side, so the mutation-side upside is more faithful to the segmented design than it was earlier in the branch.
What no longer survives publication:
- The old 15.8x -> 2.55x random-access story. Those ratios were collected under the pre-fix harness and are not the numbers we should publish.
The corrected numbers are the ones above: at 100K, pure random access is 5.96 ns cache-off versus 0.75 ns for Vec. That is the honest current state.
Full Complexity Table
| Operation | Nacre | Vec | Winner |
|---|---|---|---|
| push | O(1) amortized | O(1) amortized | Mixed |
| get | O(log S) / O(1) cached | O(1) | Vec |
| insert | O(s + S) current impl | O(n) | Mixed, Nacre at scale |
| remove | O(s + S) current impl | O(n) | Mixed, Nacre at scale |
| split | O(S) current impl | O(n) | Nacre at scale |
| iter | O(n) | O(n) | Vec (cache locality) |
| scan_headers | O(n) | O(n) | Mixed |
| scan_segments | O(segments) | N/A | Nacre (novel) |
| tick | O(segments) | N/A | Nacre (novel) |
The s in O(s) is the segment capacity (current default 4096), not the total element count. The S in O(S) is the number of segments. The current implementation's mutation path is the composition of both: local per-segment work plus cumulative suffix maintenance or right-half metadata renormalization.
What the Numbers Mean
The Nacre Array is not a faster Vec. It is a different set of tradeoffs. You pay a tax on reads and iteration. You gain structural operations (split, insert, scan) that scale independently of collection size.
Think of it as a metabolic investment. The per-element header overhead, the segment metadata, the binary search on every get. These are ongoing costs. The returns come when you split a time-series partition without copying, when you scan segment summaries instead of iterating millions of elements, when you insert into a hot region without shifting the entire collection.
The workloads that benefit most: event logs, time-series stores, stream processors, and any system that partitions, scans, or restructures data at runtime. The workloads that should stay with Vec: tight inner loops over contiguous data, random-access-heavy indices, and small collections where the metadata cost never amortizes.
The biology was right about the architecture, but the implementation still matters. Layered structures with differentiated regions do handle certain stresses better than uniform ones. The corrected benchmarks show exactly where that principle pays off today and where the current code still needs work.
What's Next
Next up: Diatom Bitmap benchmarks, covering density-aware container selection and the cooperative thermoregulation model. We're also developing steady-state benchmarks that measure performance after tick cycles, capturing the long-term behavior of the thermal state machine rather than just initial throughput.
Related Posts
Nacre Array: What If Your Vec Had a Spine?
Mother-of-pearl is 3,000 times tougher than the crystals it's made of. The secret is layered organization with flexible mortar between rigid segments. We built an array that works the same way.
Why Nature Builds Better Arrays
Every system that stores sequential data eventually hits the same wall: cold data costs the same as hot, inserts require shifting everything, and there's no way to split without copying. What if the structure itself knew the difference?
Why Your Data Structure Doesn't Have a Metabolism
Data structures are benchmarked at birth, compared as static objects, and optimized for a single moment in time. But biological systems invest in costly metabolic machinery that pays off across a lifecycle. What if the overhead isn't the problem, but the investment?