toloco
diff --git a/‎CLAUDE.md‎
Lines changed: 5 additions & 5 deletions b/‎CLAUDE.md‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎Cargo.lock‎
Lines changed: 24 additions & 30 deletions b/‎Cargo.lock‎
Lines changed: 24 additions & 30 deletions
diff --git a/‎Cargo.toml‎
Lines changed: 1 addition & 1 deletion b/‎Cargo.toml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎Makefile‎
Lines changed: 4 additions & 1 deletion b/‎Makefile‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 31 additions & 13 deletions b/‎README.md‎
Lines changed: 31 additions & 13 deletions
@@ -35,13 +35,13 @@ make test PYTHON=3.13   # Specific version
 ### Rust core (`src/`)
 
 - **`lib.rs`** — PyO3 module entry, exports `CachedFunction`, `SharedCachedFunction`, info types
-- **`store.rs`** — In-process backend: `CachedFunction` uses `parking_lot::RwLock` + `papaya::HashMap`. The `__call__` method does the entire cache lookup in Rust (hash → lookup → equality check → SIEVE visited update → return) in a single FFI crossing
+- **`store.rs`** — In-process backend: `CachedFunction` uses a sharded `hashbrown::HashMap` + `parking_lot::RwLock` per shard (read lock for cache hits, write lock for misses/eviction). The `__call__` method does the entire cache lookup in Rust (hash → shard select → read lock → lookup → equality check → SIEVE visited update → return) in a single FFI crossing
 - **`serde.rs`** — Fast-path binary serialization for common primitives (None, bool, int, float, str, bytes, flat tuples); avoids pickle overhead for the shared backend
-- **`shared_store.rs`** — Cross-process backend: `SharedCachedFunction` serializes via serde.rs (with pickle fallback), stores in mmap'd shared memory
+- **`shared_store.rs`** — Cross-process backend: `SharedCachedFunction` holds `ShmCache` directly (no Mutex), with cached `max_key_size`/`max_value_size` fields and a pre-built `ahash::RandomState`. Serializes via serde.rs (with pickle fallback), stores in mmap'd shared memory
 - **`entry.rs`** — `CacheEntry` { value, created_at, visited }
 - **`key.rs`** — `CacheKey` wraps `Py<PyAny>` + precomputed hash; uses raw `ffi::PyObject_RichCompareBool` for equality (safe because called inside `#[pymethods]` where GIL is held)
 - **`shm/`** — Shared memory infrastructure:
-  - `mod.rs` — `ShmCache`: create/open, get/set with serialized bytes
+  - `mod.rs` — `ShmCache`: create/open, get/set with serialized bytes. Uses interior mutability (`&self` methods): reads are lock-free (seqlock), writes acquire seqlock internally. `next_unique_id` is `AtomicU64`
   - `layout.rs` — Header + SlotHeader structs, memory offsets
   - `region.rs` — `ShmRegion`: mmap file management (`$TMPDIR/warp_cache/{name}.cache`)
   - `lock.rs` — `ShmSeqLock`: seqlock (optimistic reads + TTAS spinlock) in shared memory
@@ -50,15 +50,15 @@ make test PYTHON=3.13   # Specific version
 
 ### Python layer (`warp_cache/`)
 
-- **`_decorator.py`** — `cache()` factory: dispatches to `CachedFunction` (memory) or `SharedCachedFunction` (shared). Auto-detects async functions and wraps with `AsyncCachedFunction` (cache hit in Rust, only misses `await` the coroutine). Also exports `lru_cache()` — a convenience shorthand
+- **`_decorator.py`** — `cache()` factory: dispatches to `CachedFunction` (memory) or `SharedCachedFunction` (shared). Auto-detects async functions and wraps with `AsyncCachedFunction` (cache hit in Rust, only misses `await` the coroutine)
 - **`_strategies.py`** — `Backend(IntEnum)`: MEMORY=0, SHARED=1
 
 ### Key design decisions
 
 - **Single FFI crossing**: entire cache lookup happens in Rust `__call__`, no Python wrapper overhead
 - **Release profile**: fat LTO + `codegen-units=1` for cross-crate inlining of PyO3 wrappers
 - **SIEVE eviction**: unified across both backends. On hit, sets `visited=1` (single-word store). On evict, hand scans for unvisited entry. Lock-free reads on both backends
-- **Thread safety**: `parking_lot::RwLock` (~8ns uncontended) for in-process backend; seqlock (optimistic reads + TTAS spinlock) for shared backend. Enables true parallel reads under free-threaded Python (3.13t+)
+- **Thread safety**: sharded `hashbrown::HashMap` + `parking_lot::RwLock` per shard (read lock for hits, write lock for misses) for in-process backend; seqlock (optimistic reads + TTAS spinlock) for shared backend — no Mutex, `ShmCache` uses `&self` methods with interior mutability. Cache hits only acquire a cheap per-shard read lock (memory) or are fully lock-free (shared). Enables true parallel reads across shards under free-threaded Python (3.13t+)
 
 ## Critical Invariants
 
 
@@ -10,7 +10,7 @@ crate-type = ["cdylib"]
 [dependencies]
 pyo3 = { version = "0.28.1", features = ["extension-module"] }
 parking_lot = "0.12"
-papaya = "0.2"
+hashbrown = "0.15"
 
 [target.'cfg(not(target_os = "windows"))'.dependencies]
 memmap2 = "0.9"
 
@@ -1,4 +1,4 @@
-.PHONY: help fmt lint typecheck build build-debug test test-rust test-only bench bench-quick bench-all bench-report clean publish publish-test setup all
+.PHONY: help fmt lint typecheck build build-debug test test-rust test-only bench bench-quick bench-all bench-report bench-sieve clean publish publish-test setup all
 
 # Optional: specify Python version, e.g. make build PYTHON=3.14
 PYTHON ?=
@@ -62,6 +62,9 @@ bench-quick: build ## Quick benchmarks (skip sustained/TTL)
 bench-all: ## Run benchmarks across Python versions + generate report
 	bash benchmarks/bench_all.sh
 
+bench-sieve: build ## Run SIEVE eviction quality benchmarks
+	uv run $(UV_PYTHON) python benchmarks/bench_sieve.py
+
 bench-report: ## Generate report from existing results
 	uv run python benchmarks/_report_generator.py
 
 
@@ -1,21 +1,21 @@
 # warp_cache
 
 A thread-safe Python caching decorator backed by a Rust extension. Uses
-**SIEVE eviction** for scan-resistant, near-optimal hit rates with lock-free
-reads. The entire cache lookup happens in a single Rust `__call__` — no Python
-wrapper overhead. **7-15M ops/s** single-threaded, **13x** faster than
-`cachetools`, with a cross-process shared memory backend reaching **8.9M ops/s**.
+**SIEVE eviction** for scan-resistant, near-optimal hit rates with per-shard
+read locks. The entire cache lookup happens in a single Rust `__call__` — no Python
+wrapper overhead. **13-20M ops/s** single-threaded, **22x** faster than
+`cachetools`, with a cross-process shared memory backend reaching **9.2M ops/s**.
 
 ## Features
 
 - **Drop-in replacement for `functools.lru_cache`** — same decorator pattern and hashable-argument requirement, with added thread safety, TTL, and async support
-- **SIEVE eviction** — a simple, scan-resistant algorithm with near-optimal hit rates and O(1) overhead per access
-- **Thread-safe** out of the box (`parking_lot::RwLock` in Rust)
+- **[SIEVE eviction](https://junchengyang.com/publication/nsdi24-SIEVE.pdf)** — a simple, scan-resistant algorithm with near-optimal hit rates and O(1) overhead per access
+- **Thread-safe** out of the box (sharded `RwLock` + `AtomicBool` for SIEVE visited bit)
 - **Async support**: works with `async def` functions — zero overhead on sync path
 - **Shared memory backend**: cross-process caching via mmap with fully lock-free reads
 - **TTL support**: optional time-to-live expiration
 - **Single FFI crossing**: entire cache lookup happens in Rust, no Python wrapper overhead
-- **7-15M ops/s** single-threaded, **10M ops/s** under concurrent load, **13x** faster than `cachetools`
+- **13-20M ops/s** single-threaded, **17M+ ops/s** under concurrent load, **22x** faster than `cachetools`
 
 ## Installation
 
@@ -58,17 +58,35 @@ Like `lru_cache`, all arguments must be hashable. See the [usage guide](docs/usa
 
 | Metric | warp_cache | cachetools | lru_cache |
 |---|---|---|---|
-| Single-threaded (cache=256) | 10.5M ops/s | 819K ops/s | 29.6M ops/s |
-| Multi-threaded (8T) | 10.4M ops/s | 788K ops/s (with Lock) | 12.1M ops/s (with Lock) |
-| Shared memory (single proc) | 8.9M ops/s (mmap) | No | No |
-| Shared memory (4 procs) | 7.7M ops/s total | No | No |
-| Thread-safe | Yes (RwLock) | No (manual Lock) | No |
+| Single-threaded (cache=256) | 18.1M ops/s | 814K ops/s | 32.1M ops/s |
+| Multi-threaded (8T) | 17.9M ops/s | 774K ops/s (with Lock) | 12.3M ops/s (with Lock) |
+| Shared memory (single proc) | 9.2M ops/s (mmap) | No | No |
+| Shared memory (4 procs) | 7.5M ops/s total | No | No |
+| Thread-safe | Yes (sharded RwLock) | No (manual Lock) | No |
 | Async support | Yes | No | No |
 | TTL support | Yes | Yes | No |
 | Eviction | SIEVE (scan-resistant) | LRU, LFU, FIFO, RR | LRU only |
 | Implementation | Rust (PyO3) | Pure Python | C (CPython) |
 
-`warp_cache` is the fastest *thread-safe* cache — **13x** faster than `cachetools` and **2.8x** faster than `moka_py`. The shared memory backend reaches 89% of in-process speed with fully lock-free reads. See [full benchmarks](docs/performance.md) for details.
+`warp_cache` is the fastest *thread-safe* cache — **22x** faster than `cachetools` and **4.9x** faster than `moka_py`. Under multi-threaded load, it's **1.5x faster** than `lru_cache + Lock`. See [full benchmarks](docs/performance.md) for details.
+
+<picture>
+  <source media="(prefers-color-scheme: dark)" srcset="benchmarks/results/comparison_mt_scaling_dark.svg">
+  <img src="benchmarks/results/comparison_mt_scaling_light.svg" alt="Multi-thread scaling: GIL vs no-GIL">
+</picture>
+
+## Eviction quality: SIEVE vs LRU
+
+Beyond throughput, SIEVE delivers **up to 21.6% miss reduction** vs LRU. From the [NSDI'24 paper](https://junchengyang.com/publication/nsdi24-SIEVE.pdf), key findings reproduced in `benchmarks/bench_sieve.py` (1M requests, Zipf-distributed keys):
+
+| Workload | SIEVE | LRU | Miss Reduction |
+|---|---:|---:|---:|
+| Zipf, 10% cache | 74.5% | 67.5% | +21.6% |
+| Scan resistance (70% hot) | 69.9% | 63.5% | +17.6% |
+| One-hit wonders (25% unique) | 53.9% | 43.7% | +18.1% |
+| Working set shift | 75.5% | 69.7% | +16.6% |
+
+SIEVE's visited-bit design protects hot entries from sequential scans and filters out one-hit wonders that would pollute LRU. See [eviction quality benchmarks](docs/performance.md#sieve-eviction-quality) for the full breakdown.
 
 ## Documentation