toloco · toloco · Mar 6, 2026 · Mar 6, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -35,11 +35,11 @@ make test PYTHON=3.13   # Specific version
 ### Rust core (`src/`)
 
 - **`lib.rs`** — PyO3 module entry, exports `CachedFunction`, `SharedCachedFunction`, info types
-- **`store.rs`** — In-process backend: `CachedFunction` uses a sharded `hashbrown::HashMap` + `parking_lot::RwLock` per shard (read lock for cache hits, write lock for misses/eviction). The `__call__` method does the entire cache lookup in Rust (hash → shard select → read lock → lookup → equality check → SIEVE visited update → return) in a single FFI crossing
+- **`store.rs`** — In-process backend: `CachedFunction` uses sharded `hashbrown::HashMap` with passthrough hasher (avoids re-hashing Python's precomputed hash) + GIL-conditional locking (`GilCell` under GIL for zero-cost, `parking_lot::RwLock` under free-threaded Python). The `__call__` hot path uses `BorrowedArgs` to look up via borrowed pointer (no `CacheKey` allocation on hits), with `CacheKey` only materialized on cache miss for storage
 - **`serde.rs`** — Fast-path binary serialization for common primitives (None, bool, int, float, str, bytes, flat tuples); avoids pickle overhead for the shared backend
 - **`shared_store.rs`** — Cross-process backend: `SharedCachedFunction` holds `ShmCache` directly (no Mutex), with cached `max_key_size`/`max_value_size` fields and a pre-built `ahash::RandomState`. Serializes via serde.rs (with pickle fallback), stores in mmap'd shared memory
 - **`entry.rs`** — `CacheEntry` { value, created_at, visited }
-- **`key.rs`** — `CacheKey` wraps `Py<PyAny>` + precomputed hash; uses raw `ffi::PyObject_RichCompareBool` for equality (safe because called inside `#[pymethods]` where GIL is held)
+- **`key.rs`** — `CacheKey` wraps `Py<PyAny>` + precomputed hash; uses raw `ffi::PyObject_RichCompareBool` for equality. Also provides `BorrowedArgs` (zero-alloc borrowed key for hit-path lookups via hashbrown's `Equivalent` trait)
 - **`shm/`** — Shared memory infrastructure:
   - `mod.rs` — `ShmCache`: create/open, get/set with serialized bytes. Uses interior mutability (`&self` methods): reads are lock-free (seqlock), writes acquire seqlock internally. `next_unique_id` is `AtomicU64`
   - `layout.rs` — Header + SlotHeader structs, memory offsets
@@ -58,7 +58,9 @@ make test PYTHON=3.13   # Specific version
 - **Single FFI crossing**: entire cache lookup happens in Rust `__call__`, no Python wrapper overhead
 - **Release profile**: fat LTO + `codegen-units=1` for cross-crate inlining of PyO3 wrappers
 - **SIEVE eviction**: unified across both backends. On hit, sets `visited=1` (single-word store). On evict, hand scans for unvisited entry. Lock-free reads on both backends
-- **Thread safety**: sharded `hashbrown::HashMap` + `parking_lot::RwLock` per shard (read lock for hits, write lock for misses) for in-process backend; seqlock (optimistic reads + TTAS spinlock) for shared backend — no Mutex, `ShmCache` uses `&self` methods with interior mutability. Cache hits only acquire a cheap per-shard read lock (memory) or are fully lock-free (shared). Enables true parallel reads across shards under free-threaded Python (3.13t+)
+- **Thread safety**: GIL-conditional locking — `GilCell` (zero-cost `UnsafeCell` wrapper) under GIL-enabled Python, `parking_lot::RwLock` under free-threaded Python (`#[cfg(Py_GIL_DISABLED)]`). Shared backend uses seqlock (optimistic reads + TTAS spinlock) — no Mutex. Under free-threaded Python, per-shard `RwLock` enables true parallel reads across cores
+- **Borrowed key lookup**: hit path uses `BorrowedArgs` (raw pointer + precomputed hash) via hashbrown's `Equivalent` trait — no `CacheKey` allocation, no refcount churn on hits
+- **Passthrough hasher**: `PassthroughHasher` feeds Python's precomputed hash directly to hashbrown, avoiding foldhash re-hashing (~1-2ns saved per lookup). Shard count is power-of-2 for bitmask indexing
 
 ## Critical Invariants
 

diff --git a/Cargo.toml b/Cargo.toml
@@ -17,6 +17,9 @@ memmap2 = "0.9"
 libc = "0.2"
 ahash = "0.8"
 
+[lints.rust]
+unexpected_cfgs = { level = "warn", check-cfg = ['cfg(Py_GIL_DISABLED)'] }
+
 [profile.release]
 lto = "fat"
 codegen-units = 1

diff --git a/README.md b/README.md
@@ -1,21 +1,21 @@
 # warp_cache
 
 A thread-safe Python caching decorator backed by a Rust extension. Uses
-**SIEVE eviction** for scan-resistant, near-optimal hit rates with per-shard
-read locks. The entire cache lookup happens in a single Rust `__call__` — no Python
-wrapper overhead. **13-20M ops/s** single-threaded, **22x** faster than
-`cachetools`, with a cross-process shared memory backend reaching **9.2M ops/s**.
+**SIEVE eviction** for scan-resistant, near-optimal hit rates with zero-cost
+locking under the GIL. The entire cache lookup happens in a single Rust `__call__` — no Python
+wrapper overhead. **16-23M ops/s** single-threaded, **25x** faster than
+`cachetools`, with a cross-process shared memory backend reaching **9.7M ops/s**.
 
 ## Features
 
 - **Drop-in replacement for `functools.lru_cache`** — same decorator pattern and hashable-argument requirement, with added thread safety, TTL, and async support
 - **[SIEVE eviction](https://junchengyang.com/publication/nsdi24-SIEVE.pdf)** — a simple, scan-resistant algorithm with near-optimal hit rates and O(1) overhead per access
-- **Thread-safe** out of the box (sharded `RwLock` + `AtomicBool` for SIEVE visited bit)
+- **Thread-safe** out of the box (zero-cost `GilCell` under GIL, sharded `RwLock` under free-threaded Python)
 - **Async support**: works with `async def` functions — zero overhead on sync path
 - **Shared memory backend**: cross-process caching via mmap with fully lock-free reads
 - **TTL support**: optional time-to-live expiration
 - **Single FFI crossing**: entire cache lookup happens in Rust, no Python wrapper overhead
-- **13-20M ops/s** single-threaded, **17M+ ops/s** under concurrent load, **22x** faster than `cachetools`
+- **16-23M ops/s** single-threaded, **20M+ ops/s** under concurrent load, **25x** faster than `cachetools`
 
 ## Installation
 
@@ -58,17 +58,17 @@ Like `lru_cache`, all arguments must be hashable. See the [usage guide](docs/usa
 
 | Metric | warp_cache | cachetools | lru_cache |
 |---|---|---|---|
-| Single-threaded (cache=256) | 18.1M ops/s | 814K ops/s | 32.1M ops/s |
-| Multi-threaded (8T) | 17.9M ops/s | 774K ops/s (with Lock) | 12.3M ops/s (with Lock) |
-| Shared memory (single proc) | 9.2M ops/s (mmap) | No | No |
-| Shared memory (4 procs) | 7.5M ops/s total | No | No |
-| Thread-safe | Yes (sharded RwLock) | No (manual Lock) | No |
+| Single-threaded (cache=256) | 20.4M ops/s | 826K ops/s | 31.0M ops/s |
+| Multi-threaded (8T) | 20.4M ops/s | 793K ops/s (with Lock) | 12.6M ops/s (with Lock) |
+| Shared memory (single proc) | 9.7M ops/s (mmap) | No | No |
+| Shared memory (4 procs) | 8.1M ops/s total | No | No |
+| Thread-safe | Yes (GilCell / sharded RwLock) | No (manual Lock) | No |
 | Async support | Yes | No | No |
 | TTL support | Yes | Yes | No |
 | Eviction | SIEVE (scan-resistant) | LRU, LFU, FIFO, RR | LRU only |
 | Implementation | Rust (PyO3) | Pure Python | C (CPython) |
 
-`warp_cache` is the fastest *thread-safe* cache — **22x** faster than `cachetools` and **4.9x** faster than `moka_py`. Under multi-threaded load, it's **1.5x faster** than `lru_cache + Lock`. See [full benchmarks](docs/performance.md) for details.
+`warp_cache` is the fastest *thread-safe* cache — **25x** faster than `cachetools` and **5.3x** faster than `moka_py`. Under multi-threaded load, it's **1.6x faster** than `lru_cache + Lock`. See [full benchmarks](docs/performance.md) for details.
 
 <picture>
   <source media="(prefers-color-scheme: dark)" srcset="benchmarks/results/comparison_mt_scaling_dark.svg">