Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co

## Project Overview

**warp_cache** — a thread-safe Python caching decorator backed by a Rust extension (PyO3 + maturin). Provides LRU/MRU/FIFO/LFU eviction, TTL support, async awareness, and a cross-process shared memory backend.
**warp_cache** — a thread-safe Python caching decorator backed by a Rust extension (PyO3 + maturin). Uses SIEVE eviction, with TTL support, async awareness, and a cross-process shared memory backend.

## Build & Test Commands

Expand Down Expand Up @@ -35,30 +35,30 @@ make test PYTHON=3.13 # Specific version
### Rust core (`src/`)

- **`lib.rs`** — PyO3 module entry, exports `CachedFunction`, `SharedCachedFunction`, info types
- **`store.rs`** — In-process backend: `CachedFunction` wraps `parking_lot::RwLock<CacheStoreInner>`. The `__call__` method does the entire cache lookup in Rust (hash → lookup → equality check → LRU reorder → return) in a single FFI crossing
- **`store.rs`** — In-process backend: `CachedFunction` uses a sharded `hashbrown::HashMap` + `parking_lot::RwLock` per shard (read lock for cache hits, write lock for misses/eviction). The `__call__` method does the entire cache lookup in Rust (hash → shard select → read lock → lookup → equality check → SIEVE visited update → return) in a single FFI crossing
- **`serde.rs`** — Fast-path binary serialization for common primitives (None, bool, int, float, str, bytes, flat tuples); avoids pickle overhead for the shared backend
- **`shared_store.rs`** — Cross-process backend: `SharedCachedFunction` serializes via serde.rs (with pickle fallback), stores in mmap'd shared memory
- **`entry.rs`** — `CacheEntry` { value, created_at, frequency }
- **`shared_store.rs`** — Cross-process backend: `SharedCachedFunction` holds `ShmCache` directly (no Mutex), with cached `max_key_size`/`max_value_size` fields and a pre-built `ahash::RandomState`. Serializes via serde.rs (with pickle fallback), stores in mmap'd shared memory
- **`entry.rs`** — `CacheEntry` { value, created_at, visited }
- **`key.rs`** — `CacheKey` wraps `Py<PyAny>` + precomputed hash; uses raw `ffi::PyObject_RichCompareBool` for equality (safe because called inside `#[pymethods]` where GIL is held)
- **`strategies/`** — Enum-based static dispatch (`StrategyEnum`) over LRU/MRU/FIFO/LFU (avoids `Box<dyn>` overhead). LRU uses `hashlink::LruCache`
- **`shm/`** — Shared memory infrastructure:
- `mod.rs` — `ShmCache`: create/open, get/set with serialized bytes
- `mod.rs` — `ShmCache`: create/open, get/set with serialized bytes. Uses interior mutability (`&self` methods): reads are lock-free (seqlock), writes acquire seqlock internally. `next_unique_id` is `AtomicU64`
- `layout.rs` — Header + SlotHeader structs, memory offsets
- `region.rs` — `ShmRegion`: mmap file management (`$TMPDIR/warp_cache/{name}.cache`)
- `lock.rs` — `ShmSeqLock`: seqlock (optimistic reads + TTAS spinlock) in shared memory
- `hashtable.rs` — Open-addressing with linear probing (power-of-2 capacity, bitmask)
- `ordering.rs` — Eviction ordering state in shared memory
- `ordering.rs` — SIEVE eviction: intrusive linked list + `sieve_evict()` hand scan

### Python layer (`warp_cache/`)

- **`_decorator.py`** — `cache()` factory: dispatches to `CachedFunction` (memory) or `SharedCachedFunction` (shared). Auto-detects async functions and wraps with `AsyncCachedFunction` (cache hit in Rust, only misses `await` the coroutine). Also exports `lru_cache()` — a convenience shorthand for `cache(strategy=Strategy.LRU, ...)`
- **`_strategies.py`** — `Strategy(IntEnum)`: LRU=0, MRU=1, FIFO=2, LFU=3
- **`_decorator.py`** — `cache()` factory: dispatches to `CachedFunction` (memory) or `SharedCachedFunction` (shared). Auto-detects async functions and wraps with `AsyncCachedFunction` (cache hit in Rust, only misses `await` the coroutine)
- **`_strategies.py`** — `Backend(IntEnum)`: MEMORY=0, SHARED=1

### Key design decisions

- **Single FFI crossing**: entire cache lookup happens in Rust `__call__`, no Python wrapper overhead
- **Release profile**: fat LTO + `codegen-units=1` for cross-crate inlining of PyO3 wrappers
- **Thread safety**: `parking_lot::RwLock` (~8ns uncontended) for in-process backend; seqlock (optimistic reads + TTAS spinlock) for shared backend. Enables true parallel reads under free-threaded Python (3.13t+)
- **SIEVE eviction**: unified across both backends. On hit, sets `visited=1` (single-word store). On evict, hand scans for unvisited entry. Lock-free reads on both backends
- **Thread safety**: sharded `hashbrown::HashMap` + `parking_lot::RwLock` per shard (read lock for hits, write lock for misses) for in-process backend; seqlock (optimistic reads + TTAS spinlock) for shared backend — no Mutex, `ShmCache` uses `&self` methods with interior mutability. Cache hits only acquire a cheap per-shard read lock (memory) or are fully lock-free (shared). Enables true parallel reads across shards under free-threaded Python (3.13t+)

## Critical Invariants

Expand Down
37 changes: 24 additions & 13 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ crate-type = ["cdylib"]
[dependencies]
pyo3 = { version = "0.28.1", features = ["extension-module"] }
parking_lot = "0.12"
hashlink = "0.9"
hashbrown = "0.15"

[target.'cfg(not(target_os = "windows"))'.dependencies]
memmap2 = "0.9"
Expand Down
5 changes: 4 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.PHONY: help fmt lint typecheck build build-debug test test-rust test-only bench bench-quick bench-all bench-report clean publish publish-test setup all
.PHONY: help fmt lint typecheck build build-debug test test-rust test-only bench bench-quick bench-all bench-report bench-sieve clean publish publish-test setup all

# Optional: specify Python version, e.g. make build PYTHON=3.14
PYTHON ?=
Expand Down Expand Up @@ -62,6 +62,9 @@ bench-quick: build ## Quick benchmarks (skip sustained/TTL)
bench-all: ## Run benchmarks across Python versions + generate report
bash benchmarks/bench_all.sh

bench-sieve: build ## Run SIEVE eviction quality benchmarks
uv run $(UV_PYTHON) python benchmarks/bench_sieve.py

bench-report: ## Generate report from existing results
uv run python benchmarks/_report_generator.py

Expand Down
54 changes: 36 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,21 @@
# warp_cache

A thread-safe Python caching decorator backed by a Rust extension. Through a
series of optimizations — eliminating serialization, moving the call wrapper
into Rust, applying link-time optimization, and using direct C API calls — we
achieve **0.55-0.66x** of `lru_cache`'s single-threaded throughput while
providing native thread safety that delivers **1.3-1.4x** higher throughput
under concurrent load — and **18-24x** faster than pure-Python `cachetools`.
A thread-safe Python caching decorator backed by a Rust extension. Uses
**SIEVE eviction** for scan-resistant, near-optimal hit rates with per-shard
read locks. The entire cache lookup happens in a single Rust `__call__` — no Python
wrapper overhead. **13-20M ops/s** single-threaded, **22x** faster than
`cachetools`, with a cross-process shared memory backend reaching **9.2M ops/s**.

## Features

- **Drop-in replacement for `functools.lru_cache`** — same decorator pattern and hashable-argument requirement, with added thread safety, TTL, eviction strategies, and async support
- **Thread-safe** out of the box (`parking_lot::RwLock` in Rust)
- **Drop-in replacement for `functools.lru_cache`** — same decorator pattern and hashable-argument requirement, with added thread safety, TTL, and async support
- **[SIEVE eviction](https://junchengyang.com/publication/nsdi24-SIEVE.pdf)** — a simple, scan-resistant algorithm with near-optimal hit rates and O(1) overhead per access
- **Thread-safe** out of the box (sharded `RwLock` + `AtomicBool` for SIEVE visited bit)
- **Async support**: works with `async def` functions — zero overhead on sync path
- **Shared memory backend**: cross-process caching via mmap
- **Multiple eviction strategies**: LRU, MRU, FIFO, LFU
- **Shared memory backend**: cross-process caching via mmap with fully lock-free reads
- **TTL support**: optional time-to-live expiration
- **Single FFI crossing**: entire cache lookup happens in Rust, no Python wrapper overhead
- **12-18M ops/s** single-threaded, **16M ops/s** under concurrent load, **18-24x** faster than `cachetools`
- **13-20M ops/s** single-threaded, **17M+ ops/s** under concurrent load, **22x** faster than `cachetools`

## Installation

Expand Down Expand Up @@ -59,20 +58,39 @@ Like `lru_cache`, all arguments must be hashable. See the [usage guide](docs/usa

| Metric | warp_cache | cachetools | lru_cache |
|---|---|---|---|
| Single-threaded | 12-18M ops/s | 0.6-1.2M ops/s | 21-40M ops/s |
| Multi-threaded (8T) | 16M ops/s | 770K ops/s (with Lock) | 12M ops/s (with Lock) |
| Thread-safe | Yes (RwLock) | No (manual Lock) | No |
| Single-threaded (cache=256) | 18.1M ops/s | 814K ops/s | 32.1M ops/s |
| Multi-threaded (8T) | 17.9M ops/s | 774K ops/s (with Lock) | 12.3M ops/s (with Lock) |
| Shared memory (single proc) | 9.2M ops/s (mmap) | No | No |
| Shared memory (4 procs) | 7.5M ops/s total | No | No |
| Thread-safe | Yes (sharded RwLock) | No (manual Lock) | No |
| Async support | Yes | No | No |
| Cross-process (shared) | ~7.8M ops/s (mmap) | No | No |
| TTL support | Yes | Yes | No |
| Eviction strategies | LRU, MRU, FIFO, LFU | LRU, LFU, FIFO, RR | LRU only |
| Eviction | SIEVE (scan-resistant) | LRU, LFU, FIFO, RR | LRU only |
| Implementation | Rust (PyO3) | Pure Python | C (CPython) |

Under concurrent load, `warp_cache` delivers **1.3-1.4x** higher throughput than `lru_cache + Lock` and **18-24x** higher than `cachetools`. See [full benchmarks](docs/performance.md) for details.
`warp_cache` is the fastest *thread-safe* cache — **22x** faster than `cachetools` and **4.9x** faster than `moka_py`. Under multi-threaded load, it's **1.5x faster** than `lru_cache + Lock`. See [full benchmarks](docs/performance.md) for details.

<picture>
<source media="(prefers-color-scheme: dark)" srcset="benchmarks/results/comparison_mt_scaling_dark.svg">
<img src="benchmarks/results/comparison_mt_scaling_light.svg" alt="Multi-thread scaling: GIL vs no-GIL">
</picture>

## Eviction quality: SIEVE vs LRU

Beyond throughput, SIEVE delivers **up to 21.6% miss reduction** vs LRU. From the [NSDI'24 paper](https://junchengyang.com/publication/nsdi24-SIEVE.pdf), key findings reproduced in `benchmarks/bench_sieve.py` (1M requests, Zipf-distributed keys):

| Workload | SIEVE | LRU | Miss Reduction |
|---|---:|---:|---:|
| Zipf, 10% cache | 74.5% | 67.5% | +21.6% |
| Scan resistance (70% hot) | 69.9% | 63.5% | +17.6% |
| One-hit wonders (25% unique) | 53.9% | 43.7% | +18.1% |
| Working set shift | 75.5% | 69.7% | +16.6% |

SIEVE's visited-bit design protects hot entries from sequential scans and filters out one-hit wonders that would pollute LRU. See [eviction quality benchmarks](docs/performance.md#sieve-eviction-quality) for the full breakdown.

## Documentation

- **[Usage guide](docs/usage.md)** — eviction strategies, async, TTL, shared memory, decorator parameters
- **[Usage guide](docs/usage.md)** — SIEVE eviction, async, TTL, shared memory, decorator parameters
- **[Performance](docs/performance.md)** — benchmarks, architecture deep-dive, optimization journey
- **[Alternatives](docs/alternatives.md)** — comparison with cachebox, moka-py, cachetools, lru_cache
- **[Examples](examples/)** — runnable scripts for every feature (`uv run examples/<name>.py`)
Expand Down
Loading