|
1 | 1 | # warp_cache |
2 | 2 |
|
3 | 3 | A thread-safe Python caching decorator backed by a Rust extension. Uses |
4 | | -**SIEVE eviction** for scan-resistant, near-optimal hit rates with lock-free |
5 | | -reads. The entire cache lookup happens in a single Rust `__call__` — no Python |
6 | | -wrapper overhead. **7-15M ops/s** single-threaded, **13x** faster than |
7 | | -`cachetools`, with a cross-process shared memory backend reaching **8.9M ops/s**. |
| 4 | +**SIEVE eviction** for scan-resistant, near-optimal hit rates with per-shard |
| 5 | +read locks. The entire cache lookup happens in a single Rust `__call__` — no Python |
| 6 | +wrapper overhead. **13-20M ops/s** single-threaded, **22x** faster than |
| 7 | +`cachetools`, with a cross-process shared memory backend reaching **9.2M ops/s**. |
8 | 8 |
|
9 | 9 | ## Features |
10 | 10 |
|
11 | 11 | - **Drop-in replacement for `functools.lru_cache`** — same decorator pattern and hashable-argument requirement, with added thread safety, TTL, and async support |
12 | | -- **SIEVE eviction** — a simple, scan-resistant algorithm with near-optimal hit rates and O(1) overhead per access |
13 | | -- **Thread-safe** out of the box (`parking_lot::RwLock` in Rust) |
| 12 | +- **[SIEVE eviction](https://junchengyang.com/publication/nsdi24-SIEVE.pdf)** — a simple, scan-resistant algorithm with near-optimal hit rates and O(1) overhead per access |
| 13 | +- **Thread-safe** out of the box (sharded `RwLock` + `AtomicBool` for SIEVE visited bit) |
14 | 14 | - **Async support**: works with `async def` functions — zero overhead on sync path |
15 | 15 | - **Shared memory backend**: cross-process caching via mmap with fully lock-free reads |
16 | 16 | - **TTL support**: optional time-to-live expiration |
17 | 17 | - **Single FFI crossing**: entire cache lookup happens in Rust, no Python wrapper overhead |
18 | | -- **7-15M ops/s** single-threaded, **10M ops/s** under concurrent load, **13x** faster than `cachetools` |
| 18 | +- **13-20M ops/s** single-threaded, **17M+ ops/s** under concurrent load, **22x** faster than `cachetools` |
19 | 19 |
|
20 | 20 | ## Installation |
21 | 21 |
|
@@ -58,17 +58,35 @@ Like `lru_cache`, all arguments must be hashable. See the [usage guide](docs/usa |
58 | 58 |
|
59 | 59 | | Metric | warp_cache | cachetools | lru_cache | |
60 | 60 | |---|---|---|---| |
61 | | -| Single-threaded (cache=256) | 10.5M ops/s | 819K ops/s | 29.6M ops/s | |
62 | | -| Multi-threaded (8T) | 10.4M ops/s | 788K ops/s (with Lock) | 12.1M ops/s (with Lock) | |
63 | | -| Shared memory (single proc) | 8.9M ops/s (mmap) | No | No | |
64 | | -| Shared memory (4 procs) | 7.7M ops/s total | No | No | |
65 | | -| Thread-safe | Yes (RwLock) | No (manual Lock) | No | |
| 61 | +| Single-threaded (cache=256) | 18.1M ops/s | 814K ops/s | 32.1M ops/s | |
| 62 | +| Multi-threaded (8T) | 17.9M ops/s | 774K ops/s (with Lock) | 12.3M ops/s (with Lock) | |
| 63 | +| Shared memory (single proc) | 9.2M ops/s (mmap) | No | No | |
| 64 | +| Shared memory (4 procs) | 7.5M ops/s total | No | No | |
| 65 | +| Thread-safe | Yes (sharded RwLock) | No (manual Lock) | No | |
66 | 66 | | Async support | Yes | No | No | |
67 | 67 | | TTL support | Yes | Yes | No | |
68 | 68 | | Eviction | SIEVE (scan-resistant) | LRU, LFU, FIFO, RR | LRU only | |
69 | 69 | | Implementation | Rust (PyO3) | Pure Python | C (CPython) | |
70 | 70 |
|
71 | | -`warp_cache` is the fastest *thread-safe* cache — **13x** faster than `cachetools` and **2.8x** faster than `moka_py`. The shared memory backend reaches 89% of in-process speed with fully lock-free reads. See [full benchmarks](docs/performance.md) for details. |
| 71 | +`warp_cache` is the fastest *thread-safe* cache — **22x** faster than `cachetools` and **4.9x** faster than `moka_py`. Under multi-threaded load, it's **1.5x faster** than `lru_cache + Lock`. See [full benchmarks](docs/performance.md) for details. |
| 72 | + |
| 73 | +<picture> |
| 74 | + <source media="(prefers-color-scheme: dark)" srcset="benchmarks/results/comparison_mt_scaling_dark.svg"> |
| 75 | + <img src="benchmarks/results/comparison_mt_scaling_light.svg" alt="Multi-thread scaling: GIL vs no-GIL"> |
| 76 | +</picture> |
| 77 | + |
| 78 | +## Eviction quality: SIEVE vs LRU |
| 79 | + |
| 80 | +Beyond throughput, SIEVE delivers **up to 21.6% miss reduction** vs LRU. From the [NSDI'24 paper](https://junchengyang.com/publication/nsdi24-SIEVE.pdf), key findings reproduced in `benchmarks/bench_sieve.py` (1M requests, Zipf-distributed keys): |
| 81 | + |
| 82 | +| Workload | SIEVE | LRU | Miss Reduction | |
| 83 | +|---|---:|---:|---:| |
| 84 | +| Zipf, 10% cache | 74.5% | 67.5% | +21.6% | |
| 85 | +| Scan resistance (70% hot) | 69.9% | 63.5% | +17.6% | |
| 86 | +| One-hit wonders (25% unique) | 53.9% | 43.7% | +18.1% | |
| 87 | +| Working set shift | 75.5% | 69.7% | +16.6% | |
| 88 | + |
| 89 | +SIEVE's visited-bit design protects hot entries from sequential scans and filters out one-hit wonders that would pollute LRU. See [eviction quality benchmarks](docs/performance.md#sieve-eviction-quality) for the full breakdown. |
72 | 90 |
|
73 | 91 | ## Documentation |
74 | 92 |
|
|
0 commit comments