toloco · toloco · Mar 6, 2026 · Mar 5, 2026 · Mar 5, 2026 · Mar 5, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 ## Project Overview
 
-**warp_cache** — a thread-safe Python caching decorator backed by a Rust extension (PyO3 + maturin). Provides LRU/MRU/FIFO/LFU eviction, TTL support, async awareness, and a cross-process shared memory backend.
+**warp_cache** — a thread-safe Python caching decorator backed by a Rust extension (PyO3 + maturin). Uses SIEVE eviction, with TTL support, async awareness, and a cross-process shared memory backend.
 
 ## Build & Test Commands
 
@@ -35,30 +35,30 @@ make test PYTHON=3.13   # Specific version
 ### Rust core (`src/`)
 
 - **`lib.rs`** — PyO3 module entry, exports `CachedFunction`, `SharedCachedFunction`, info types
-- **`store.rs`** — In-process backend: `CachedFunction` wraps `parking_lot::RwLock<CacheStoreInner>`. The `__call__` method does the entire cache lookup in Rust (hash → lookup → equality check → LRU reorder → return) in a single FFI crossing
+- **`store.rs`** — In-process backend: `CachedFunction` uses a sharded `hashbrown::HashMap` + `parking_lot::RwLock` per shard (read lock for cache hits, write lock for misses/eviction). The `__call__` method does the entire cache lookup in Rust (hash → shard select → read lock → lookup → equality check → SIEVE visited update → return) in a single FFI crossing
 - **`serde.rs`** — Fast-path binary serialization for common primitives (None, bool, int, float, str, bytes, flat tuples); avoids pickle overhead for the shared backend
-- **`shared_store.rs`** — Cross-process backend: `SharedCachedFunction` serializes via serde.rs (with pickle fallback), stores in mmap'd shared memory
-- **`entry.rs`** — `CacheEntry` { value, created_at, frequency }
+- **`shared_store.rs`** — Cross-process backend: `SharedCachedFunction` holds `ShmCache` directly (no Mutex), with cached `max_key_size`/`max_value_size` fields and a pre-built `ahash::RandomState`. Serializes via serde.rs (with pickle fallback), stores in mmap'd shared memory
+- **`entry.rs`** — `CacheEntry` { value, created_at, visited }
 - **`key.rs`** — `CacheKey` wraps `Py<PyAny>` + precomputed hash; uses raw `ffi::PyObject_RichCompareBool` for equality (safe because called inside `#[pymethods]` where GIL is held)
-- **`strategies/`** — Enum-based static dispatch (`StrategyEnum`) over LRU/MRU/FIFO/LFU (avoids `Box<dyn>` overhead). LRU uses `hashlink::LruCache`
 - **`shm/`** — Shared memory infrastructure:
-  - `mod.rs` — `ShmCache`: create/open, get/set with serialized bytes
+  - `mod.rs` — `ShmCache`: create/open, get/set with serialized bytes. Uses interior mutability (`&self` methods): reads are lock-free (seqlock), writes acquire seqlock internally. `next_unique_id` is `AtomicU64`
   - `layout.rs` — Header + SlotHeader structs, memory offsets
   - `region.rs` — `ShmRegion`: mmap file management (`$TMPDIR/warp_cache/{name}.cache`)
   - `lock.rs` — `ShmSeqLock`: seqlock (optimistic reads + TTAS spinlock) in shared memory
   - `hashtable.rs` — Open-addressing with linear probing (power-of-2 capacity, bitmask)
-  - `ordering.rs` — Eviction ordering state in shared memory
+  - `ordering.rs` — SIEVE eviction: intrusive linked list + `sieve_evict()` hand scan
 
 ### Python layer (`warp_cache/`)
 
-- **`_decorator.py`** — `cache()` factory: dispatches to `CachedFunction` (memory) or `SharedCachedFunction` (shared). Auto-detects async functions and wraps with `AsyncCachedFunction` (cache hit in Rust, only misses `await` the coroutine). Also exports `lru_cache()` — a convenience shorthand for `cache(strategy=Strategy.LRU, ...)`
-- **`_strategies.py`** — `Strategy(IntEnum)`: LRU=0, MRU=1, FIFO=2, LFU=3
+- **`_decorator.py`** — `cache()` factory: dispatches to `CachedFunction` (memory) or `SharedCachedFunction` (shared). Auto-detects async functions and wraps with `AsyncCachedFunction` (cache hit in Rust, only misses `await` the coroutine)
+- **`_strategies.py`** — `Backend(IntEnum)`: MEMORY=0, SHARED=1
 
 ### Key design decisions
 
 - **Single FFI crossing**: entire cache lookup happens in Rust `__call__`, no Python wrapper overhead
 - **Release profile**: fat LTO + `codegen-units=1` for cross-crate inlining of PyO3 wrappers
-- **Thread safety**: `parking_lot::RwLock` (~8ns uncontended) for in-process backend; seqlock (optimistic reads + TTAS spinlock) for shared backend. Enables true parallel reads under free-threaded Python (3.13t+)
+- **SIEVE eviction**: unified across both backends. On hit, sets `visited=1` (single-word store). On evict, hand scans for unvisited entry. Lock-free reads on both backends
+- **Thread safety**: sharded `hashbrown::HashMap` + `parking_lot::RwLock` per shard (read lock for hits, write lock for misses) for in-process backend; seqlock (optimistic reads + TTAS spinlock) for shared backend — no Mutex, `ShmCache` uses `&self` methods with interior mutability. Cache hits only acquire a cheap per-shard read lock (memory) or are fully lock-free (shared). Enables true parallel reads across shards under free-threaded Python (3.13t+)
 
 ## Critical Invariants
 

diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -10,7 +10,7 @@ crate-type = ["cdylib"]
 [dependencies]
 pyo3 = { version = "0.28.1", features = ["extension-module"] }
 parking_lot = "0.12"
-hashlink = "0.9"
+hashbrown = "0.15"
 
 [target.'cfg(not(target_os = "windows"))'.dependencies]
 memmap2 = "0.9"

diff --git a/Makefile b/Makefile
@@ -1,4 +1,4 @@
-.PHONY: help fmt lint typecheck build build-debug test test-rust test-only bench bench-quick bench-all bench-report clean publish publish-test setup all
+.PHONY: help fmt lint typecheck build build-debug test test-rust test-only bench bench-quick bench-all bench-report bench-sieve clean publish publish-test setup all
 
 # Optional: specify Python version, e.g. make build PYTHON=3.14
 PYTHON ?=
@@ -62,6 +62,9 @@ bench-quick: build ## Quick benchmarks (skip sustained/TTL)
 bench-all: ## Run benchmarks across Python versions + generate report
 	bash benchmarks/bench_all.sh
 
+bench-sieve: build ## Run SIEVE eviction quality benchmarks
+	uv run $(UV_PYTHON) python benchmarks/bench_sieve.py
+
 bench-report: ## Generate report from existing results
 	uv run python benchmarks/_report_generator.py
 

diff --git a/README.md b/README.md
@@ -1,22 +1,21 @@
 # warp_cache
 
-A thread-safe Python caching decorator backed by a Rust extension. Through a
-series of optimizations — eliminating serialization, moving the call wrapper
-into Rust, applying link-time optimization, and using direct C API calls — we
-achieve **0.55-0.66x** of `lru_cache`'s single-threaded throughput while
-providing native thread safety that delivers **1.3-1.4x** higher throughput
-under concurrent load — and **18-24x** faster than pure-Python `cachetools`.
+A thread-safe Python caching decorator backed by a Rust extension. Uses
+**SIEVE eviction** for scan-resistant, near-optimal hit rates with per-shard
+read locks. The entire cache lookup happens in a single Rust `__call__` — no Python
+wrapper overhead. **13-20M ops/s** single-threaded, **22x** faster than
+`cachetools`, with a cross-process shared memory backend reaching **9.2M ops/s**.
 
 ## Features
 
-- **Drop-in replacement for `functools.lru_cache`** — same decorator pattern and hashable-argument requirement, with added thread safety, TTL, eviction strategies, and async support
-- **Thread-safe** out of the box (`parking_lot::RwLock` in Rust)
+- **Drop-in replacement for `functools.lru_cache`** — same decorator pattern and hashable-argument requirement, with added thread safety, TTL, and async support
+- **[SIEVE eviction](https://junchengyang.com/publication/nsdi24-SIEVE.pdf)** — a simple, scan-resistant algorithm with near-optimal hit rates and O(1) overhead per access
+- **Thread-safe** out of the box (sharded `RwLock` + `AtomicBool` for SIEVE visited bit)
 - **Async support**: works with `async def` functions — zero overhead on sync path
-- **Shared memory backend**: cross-process caching via mmap
-- **Multiple eviction strategies**: LRU, MRU, FIFO, LFU
+- **Shared memory backend**: cross-process caching via mmap with fully lock-free reads
 - **TTL support**: optional time-to-live expiration
 - **Single FFI crossing**: entire cache lookup happens in Rust, no Python wrapper overhead
-- **12-18M ops/s** single-threaded, **16M ops/s** under concurrent load, **18-24x** faster than `cachetools`
+- **13-20M ops/s** single-threaded, **17M+ ops/s** under concurrent load, **22x** faster than `cachetools`
 
 ## Installation
 
@@ -59,20 +58,39 @@ Like `lru_cache`, all arguments must be hashable. See the [usage guide](docs/usa
 
 | Metric | warp_cache | cachetools | lru_cache |
 |---|---|---|---|
-| Single-threaded | 12-18M ops/s | 0.6-1.2M ops/s | 21-40M ops/s |
-| Multi-threaded (8T) | 16M ops/s | 770K ops/s (with Lock) | 12M ops/s (with Lock) |
-| Thread-safe | Yes (RwLock) | No (manual Lock) | No |
+| Single-threaded (cache=256) | 18.1M ops/s | 814K ops/s | 32.1M ops/s |
+| Multi-threaded (8T) | 17.9M ops/s | 774K ops/s (with Lock) | 12.3M ops/s (with Lock) |
+| Shared memory (single proc) | 9.2M ops/s (mmap) | No | No |
+| Shared memory (4 procs) | 7.5M ops/s total | No | No |
+| Thread-safe | Yes (sharded RwLock) | No (manual Lock) | No |
 | Async support | Yes | No | No |
-| Cross-process (shared) | ~7.8M ops/s (mmap) | No | No |
 | TTL support | Yes | Yes | No |
-| Eviction strategies | LRU, MRU, FIFO, LFU | LRU, LFU, FIFO, RR | LRU only |
+| Eviction | SIEVE (scan-resistant) | LRU, LFU, FIFO, RR | LRU only |
 | Implementation | Rust (PyO3) | Pure Python | C (CPython) |
 
-Under concurrent load, `warp_cache` delivers **1.3-1.4x** higher throughput than `lru_cache + Lock` and **18-24x** higher than `cachetools`. See [full benchmarks](docs/performance.md) for details.
+`warp_cache` is the fastest *thread-safe* cache — **22x** faster than `cachetools` and **4.9x** faster than `moka_py`. Under multi-threaded load, it's **1.5x faster** than `lru_cache + Lock`. See [full benchmarks](docs/performance.md) for details.
+
+<picture>
+  <source media="(prefers-color-scheme: dark)" srcset="benchmarks/results/comparison_mt_scaling_dark.svg">
+  <img src="benchmarks/results/comparison_mt_scaling_light.svg" alt="Multi-thread scaling: GIL vs no-GIL">
+</picture>
+
+## Eviction quality: SIEVE vs LRU
+
+Beyond throughput, SIEVE delivers **up to 21.6% miss reduction** vs LRU. From the [NSDI'24 paper](https://junchengyang.com/publication/nsdi24-SIEVE.pdf), key findings reproduced in `benchmarks/bench_sieve.py` (1M requests, Zipf-distributed keys):
+
+| Workload | SIEVE | LRU | Miss Reduction |
+|---|---:|---:|---:|
+| Zipf, 10% cache | 74.5% | 67.5% | +21.6% |
+| Scan resistance (70% hot) | 69.9% | 63.5% | +17.6% |
+| One-hit wonders (25% unique) | 53.9% | 43.7% | +18.1% |
+| Working set shift | 75.5% | 69.7% | +16.6% |
+
+SIEVE's visited-bit design protects hot entries from sequential scans and filters out one-hit wonders that would pollute LRU. See [eviction quality benchmarks](docs/performance.md#sieve-eviction-quality) for the full breakdown.
 
 ## Documentation
 
-- **[Usage guide](docs/usage.md)** — eviction strategies, async, TTL, shared memory, decorator parameters
+- **[Usage guide](docs/usage.md)** — SIEVE eviction, async, TTL, shared memory, decorator parameters
 - **[Performance](docs/performance.md)** — benchmarks, architecture deep-dive, optimization journey
 - **[Alternatives](docs/alternatives.md)** — comparison with cachebox, moka-py, cachetools, lru_cache
 - **[Examples](examples/)** — runnable scripts for every feature (`uv run examples/<name>.py`)