Skip to content

Commit 9ced7d0

Browse files
committed
Foo
1 parent 0b2e464 commit 9ced7d0

39 files changed

Lines changed: 370454 additions & 690 deletions

CLAUDE.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -35,13 +35,13 @@ make test PYTHON=3.13 # Specific version
3535
### Rust core (`src/`)
3636

3737
- **`lib.rs`** — PyO3 module entry, exports `CachedFunction`, `SharedCachedFunction`, info types
38-
- **`store.rs`** — In-process backend: `CachedFunction` uses `parking_lot::RwLock` + `papaya::HashMap`. The `__call__` method does the entire cache lookup in Rust (hash → lookup → equality check → SIEVE visited update → return) in a single FFI crossing
38+
- **`store.rs`** — In-process backend: `CachedFunction` uses a sharded `hashbrown::HashMap` + `parking_lot::RwLock` per shard (read lock for cache hits, write lock for misses/eviction). The `__call__` method does the entire cache lookup in Rust (hash → shard select → read lock → lookup → equality check → SIEVE visited update → return) in a single FFI crossing
3939
- **`serde.rs`** — Fast-path binary serialization for common primitives (None, bool, int, float, str, bytes, flat tuples); avoids pickle overhead for the shared backend
40-
- **`shared_store.rs`** — Cross-process backend: `SharedCachedFunction` serializes via serde.rs (with pickle fallback), stores in mmap'd shared memory
40+
- **`shared_store.rs`** — Cross-process backend: `SharedCachedFunction` holds `ShmCache` directly (no Mutex), with cached `max_key_size`/`max_value_size` fields and a pre-built `ahash::RandomState`. Serializes via serde.rs (with pickle fallback), stores in mmap'd shared memory
4141
- **`entry.rs`**`CacheEntry` { value, created_at, visited }
4242
- **`key.rs`**`CacheKey` wraps `Py<PyAny>` + precomputed hash; uses raw `ffi::PyObject_RichCompareBool` for equality (safe because called inside `#[pymethods]` where GIL is held)
4343
- **`shm/`** — Shared memory infrastructure:
44-
- `mod.rs``ShmCache`: create/open, get/set with serialized bytes
44+
- `mod.rs``ShmCache`: create/open, get/set with serialized bytes. Uses interior mutability (`&self` methods): reads are lock-free (seqlock), writes acquire seqlock internally. `next_unique_id` is `AtomicU64`
4545
- `layout.rs` — Header + SlotHeader structs, memory offsets
4646
- `region.rs``ShmRegion`: mmap file management (`$TMPDIR/warp_cache/{name}.cache`)
4747
- `lock.rs``ShmSeqLock`: seqlock (optimistic reads + TTAS spinlock) in shared memory
@@ -50,15 +50,15 @@ make test PYTHON=3.13 # Specific version
5050

5151
### Python layer (`warp_cache/`)
5252

53-
- **`_decorator.py`**`cache()` factory: dispatches to `CachedFunction` (memory) or `SharedCachedFunction` (shared). Auto-detects async functions and wraps with `AsyncCachedFunction` (cache hit in Rust, only misses `await` the coroutine). Also exports `lru_cache()` — a convenience shorthand
53+
- **`_decorator.py`**`cache()` factory: dispatches to `CachedFunction` (memory) or `SharedCachedFunction` (shared). Auto-detects async functions and wraps with `AsyncCachedFunction` (cache hit in Rust, only misses `await` the coroutine)
5454
- **`_strategies.py`**`Backend(IntEnum)`: MEMORY=0, SHARED=1
5555

5656
### Key design decisions
5757

5858
- **Single FFI crossing**: entire cache lookup happens in Rust `__call__`, no Python wrapper overhead
5959
- **Release profile**: fat LTO + `codegen-units=1` for cross-crate inlining of PyO3 wrappers
6060
- **SIEVE eviction**: unified across both backends. On hit, sets `visited=1` (single-word store). On evict, hand scans for unvisited entry. Lock-free reads on both backends
61-
- **Thread safety**: `parking_lot::RwLock` (~8ns uncontended) for in-process backend; seqlock (optimistic reads + TTAS spinlock) for shared backend. Enables true parallel reads under free-threaded Python (3.13t+)
61+
- **Thread safety**: sharded `hashbrown::HashMap` + `parking_lot::RwLock` per shard (read lock for hits, write lock for misses) for in-process backend; seqlock (optimistic reads + TTAS spinlock) for shared backend — no Mutex, `ShmCache` uses `&self` methods with interior mutability. Cache hits only acquire a cheap per-shard read lock (memory) or are fully lock-free (shared). Enables true parallel reads across shards under free-threaded Python (3.13t+)
6262

6363
## Critical Invariants
6464

Cargo.lock

Lines changed: 24 additions & 30 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ crate-type = ["cdylib"]
1010
[dependencies]
1111
pyo3 = { version = "0.28.1", features = ["extension-module"] }
1212
parking_lot = "0.12"
13-
papaya = "0.2"
13+
hashbrown = "0.15"
1414

1515
[target.'cfg(not(target_os = "windows"))'.dependencies]
1616
memmap2 = "0.9"

Makefile

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
.PHONY: help fmt lint typecheck build build-debug test test-rust test-only bench bench-quick bench-all bench-report clean publish publish-test setup all
1+
.PHONY: help fmt lint typecheck build build-debug test test-rust test-only bench bench-quick bench-all bench-report bench-sieve clean publish publish-test setup all
22

33
# Optional: specify Python version, e.g. make build PYTHON=3.14
44
PYTHON ?=
@@ -62,6 +62,9 @@ bench-quick: build ## Quick benchmarks (skip sustained/TTL)
6262
bench-all: ## Run benchmarks across Python versions + generate report
6363
bash benchmarks/bench_all.sh
6464

65+
bench-sieve: build ## Run SIEVE eviction quality benchmarks
66+
uv run $(UV_PYTHON) python benchmarks/bench_sieve.py
67+
6568
bench-report: ## Generate report from existing results
6669
uv run python benchmarks/_report_generator.py
6770

README.md

Lines changed: 31 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,21 @@
11
# warp_cache
22

33
A thread-safe Python caching decorator backed by a Rust extension. Uses
4-
**SIEVE eviction** for scan-resistant, near-optimal hit rates with lock-free
5-
reads. The entire cache lookup happens in a single Rust `__call__` — no Python
6-
wrapper overhead. **7-15M ops/s** single-threaded, **13x** faster than
7-
`cachetools`, with a cross-process shared memory backend reaching **8.9M ops/s**.
4+
**SIEVE eviction** for scan-resistant, near-optimal hit rates with per-shard
5+
read locks. The entire cache lookup happens in a single Rust `__call__` — no Python
6+
wrapper overhead. **13-20M ops/s** single-threaded, **22x** faster than
7+
`cachetools`, with a cross-process shared memory backend reaching **9.2M ops/s**.
88

99
## Features
1010

1111
- **Drop-in replacement for `functools.lru_cache`** — same decorator pattern and hashable-argument requirement, with added thread safety, TTL, and async support
12-
- **SIEVE eviction** — a simple, scan-resistant algorithm with near-optimal hit rates and O(1) overhead per access
13-
- **Thread-safe** out of the box (`parking_lot::RwLock` in Rust)
12+
- **[SIEVE eviction](https://junchengyang.com/publication/nsdi24-SIEVE.pdf)** — a simple, scan-resistant algorithm with near-optimal hit rates and O(1) overhead per access
13+
- **Thread-safe** out of the box (sharded `RwLock` + `AtomicBool` for SIEVE visited bit)
1414
- **Async support**: works with `async def` functions — zero overhead on sync path
1515
- **Shared memory backend**: cross-process caching via mmap with fully lock-free reads
1616
- **TTL support**: optional time-to-live expiration
1717
- **Single FFI crossing**: entire cache lookup happens in Rust, no Python wrapper overhead
18-
- **7-15M ops/s** single-threaded, **10M ops/s** under concurrent load, **13x** faster than `cachetools`
18+
- **13-20M ops/s** single-threaded, **17M+ ops/s** under concurrent load, **22x** faster than `cachetools`
1919

2020
## Installation
2121

@@ -58,17 +58,35 @@ Like `lru_cache`, all arguments must be hashable. See the [usage guide](docs/usa
5858

5959
| Metric | warp_cache | cachetools | lru_cache |
6060
|---|---|---|---|
61-
| Single-threaded (cache=256) | 10.5M ops/s | 819K ops/s | 29.6M ops/s |
62-
| Multi-threaded (8T) | 10.4M ops/s | 788K ops/s (with Lock) | 12.1M ops/s (with Lock) |
63-
| Shared memory (single proc) | 8.9M ops/s (mmap) | No | No |
64-
| Shared memory (4 procs) | 7.7M ops/s total | No | No |
65-
| Thread-safe | Yes (RwLock) | No (manual Lock) | No |
61+
| Single-threaded (cache=256) | 18.1M ops/s | 814K ops/s | 32.1M ops/s |
62+
| Multi-threaded (8T) | 17.9M ops/s | 774K ops/s (with Lock) | 12.3M ops/s (with Lock) |
63+
| Shared memory (single proc) | 9.2M ops/s (mmap) | No | No |
64+
| Shared memory (4 procs) | 7.5M ops/s total | No | No |
65+
| Thread-safe | Yes (sharded RwLock) | No (manual Lock) | No |
6666
| Async support | Yes | No | No |
6767
| TTL support | Yes | Yes | No |
6868
| Eviction | SIEVE (scan-resistant) | LRU, LFU, FIFO, RR | LRU only |
6969
| Implementation | Rust (PyO3) | Pure Python | C (CPython) |
7070

71-
`warp_cache` is the fastest *thread-safe* cache — **13x** faster than `cachetools` and **2.8x** faster than `moka_py`. The shared memory backend reaches 89% of in-process speed with fully lock-free reads. See [full benchmarks](docs/performance.md) for details.
71+
`warp_cache` is the fastest *thread-safe* cache — **22x** faster than `cachetools` and **4.9x** faster than `moka_py`. Under multi-threaded load, it's **1.5x faster** than `lru_cache + Lock`. See [full benchmarks](docs/performance.md) for details.
72+
73+
<picture>
74+
<source media="(prefers-color-scheme: dark)" srcset="benchmarks/results/comparison_mt_scaling_dark.svg">
75+
<img src="benchmarks/results/comparison_mt_scaling_light.svg" alt="Multi-thread scaling: GIL vs no-GIL">
76+
</picture>
77+
78+
## Eviction quality: SIEVE vs LRU
79+
80+
Beyond throughput, SIEVE delivers **up to 21.6% miss reduction** vs LRU. From the [NSDI'24 paper](https://junchengyang.com/publication/nsdi24-SIEVE.pdf), key findings reproduced in `benchmarks/bench_sieve.py` (1M requests, Zipf-distributed keys):
81+
82+
| Workload | SIEVE | LRU | Miss Reduction |
83+
|---|---:|---:|---:|
84+
| Zipf, 10% cache | 74.5% | 67.5% | +21.6% |
85+
| Scan resistance (70% hot) | 69.9% | 63.5% | +17.6% |
86+
| One-hit wonders (25% unique) | 53.9% | 43.7% | +18.1% |
87+
| Working set shift | 75.5% | 69.7% | +16.6% |
88+
89+
SIEVE's visited-bit design protects hot entries from sequential scans and filters out one-hit wonders that would pollute LRU. See [eviction quality benchmarks](docs/performance.md#sieve-eviction-quality) for the full breakdown.
7290

7391
## Documentation
7492

0 commit comments

Comments
 (0)