-
Notifications
You must be signed in to change notification settings - Fork 2
Cache Simulation
The Cache Simulation framework in GraphBrew provides detailed cache performance analysis for graph algorithms. This allows you to understand memory access patterns and cache efficiency at L1, L2, and L3 levels.
Graph algorithms are memory-intensive and cache performance significantly impacts execution time. The cache simulation framework instruments graph algorithms to track every memory access, simulating how data flows through a three-level cache hierarchy.
Key Features:
- Multi-level cache hierarchy (L1/L2/L3)
- Configurable cache parameters (size, associativity, line size)
- Nine eviction policies (LRU, FIFO, RANDOM, LFU, PLRU, SRRIP, GRASP, P-OPT, ECG)
- Detailed per-level statistics
- JSON export for analysis integration
- Feature vector output for perceptron training
# Build all cache simulation binaries
make all-sim
# Run PageRank with cache simulation
./bench/bin_sim/pr -g 15 -n 1
# Run with custom cache configuration
CACHE_L1_SIZE=65536 CACHE_POLICY=LFU ./bench/bin_sim/bfs -g 15 -n 1bench/
├── include/cache_sim/
│ ├── cache_sim.h # Cache simulator core
│ └── graph_sim.h # Graph wrappers for tracking
├── src_sim/ # Instrumented algorithms
│ ├── pr.cc # PageRank (Gauss-Seidel, in-place updates)
│ ├── pr_spmv.cc # PageRank SpMV (Jacobi, double-buffered)
│ ├── bfs.cc # Breadth-First Search
│ ├── bc.cc # Betweenness Centrality
│ ├── cc.cc # Connected Components (Afforest)
│ ├── cc_sv.cc # Connected Components (Shiloach-Vishkin)
│ ├── sssp.cc # Single-Source Shortest Paths
│ └── tc.cc # Triangle Counting
└── bin_sim/ # Compiled binaries
| Variable | Default | Description |
|---|---|---|
CACHE_L1_SIZE |
32768 | L1 cache size (bytes) |
CACHE_L1_WAYS |
8 | L1 associativity |
CACHE_L2_SIZE |
262144 | L2 cache size (bytes) |
CACHE_L2_WAYS |
4 | L2 associativity |
CACHE_L3_SIZE |
8388608 | L3 cache size (bytes) |
CACHE_L3_WAYS |
16 | L3 associativity |
CACHE_LINE_SIZE |
64 | Line size (bytes) |
CACHE_POLICY |
LRU | Eviction policy |
CACHE_OUTPUT_JSON |
- | JSON output file path |
ECG_MODE |
DBG_PRIMARY | ECG eviction mode: DBG_PRIMARY, POPT_PRIMARY, DBG_ONLY |
ECG_MASK_WIDTH |
8 | ECG mask bits per edge (2,4,8,16,32) |
ECG_NUM_BUCKETS |
11 | Number of degree buckets (2-16) |
ECG_RRPV_BITS |
3 | RRPV bit width (max RRPV = 2^bits - 1) |
ECG_PER_VERTEX |
0 | Per-vertex masks (1) vs per-edge (0) |
ECG_DEGREE_MODE |
OUT | Degree mode: OUT, IN, BOTH |
The default configuration models a typical modern CPU:
| Level | Size | Associativity | Lines | Sets |
|---|---|---|---|---|
| L1 | 32 KB | 8-way | 512 | 64 |
| L2 | 256 KB | 4-way | 4096 | 1024 |
Note: The
FastCacheHierarchy(per-thread simulation) uses 8-way L2 associativity by default, reflecting modern per-core L2 caches. | L3 | 8 MB | 16-way | 131072 | 8192 |
| Policy | Description | Use Case |
|---|---|---|
| LRU | Evicts least recently used line | General purpose, best for temporal locality |
| FIFO | Evicts oldest line | Simple, deterministic behavior |
| RANDOM | Random eviction | Baseline comparison |
| LFU | Evicts least frequently used | Good for hot data |
| PLRU | Tree-based pseudo-LRU | Hardware-efficient LRU approximation |
| SRRIP | Re-reference interval prediction | Scan-resistant, good for streaming |
| GRASP | Graph-aware degree-based RRIP (Faldu et al., HPCA'20) | Power-law graphs with DBG reordering |
| P-OPT | Graph-transpose Belady approximation (Balaji et al., HPCA'21) | Best miss reduction; requires rereference matrix |
| ECG | Layered SRRIP + DBG + dynamic P-OPT (Mughrabi et al., GrAPL) | Combines structural + oracle signals; 3 modes |
ECG uses a 3-level layered eviction strategy. All modes start with SRRIP aging (Level 1), then apply mode-dependent tiebreakers:
| Mode | Level 2 | Level 3 | Env Variable |
|---|---|---|---|
| DBG_PRIMARY (default) | DBG tier (coldest vertex) | Dynamic P-OPT findNextRef()
|
ECG_MODE=DBG_PRIMARY |
| POPT_PRIMARY | Dynamic P-OPT (furthest future) | DBG tier | ECG_MODE=POPT_PRIMARY |
| DBG_ONLY | DBG tier only | None (fast path) | ECG_MODE=DBG_ONLY |
-
Insert: RRPV from
bucketToRRPV(dbg_tier)— structural priority only; P-OPT is not baked at insert - Hit: Top bucket → RRPV=0 (aggressive keep), others → decrement
-
Evict: P-OPT is consulted dynamically via
findNextRef()at eviction time (avoids stale snapshots)
================================================================================
CACHE SIMULATION RESULTS
================================================================================
Configuration:
L1 Cache: 32768 bytes, 8-way, 64-byte lines (64 sets)
L2 Cache: 262144 bytes, 4-way, 64-byte lines (1024 sets)
L3 Cache: 8388608 bytes, 16-way, 64-byte lines (8192 sets)
Eviction Policy: LRU
Statistics by Level:
--------------------------------------------------------------------------------
Level Accesses Hits Misses Hit Rate Miss Rate
--------------------------------------------------------------------------------
L1 ... ... ... XX.XX% XX.XX%
L2 ... ... ... XX.XX% XX.XX%
L3 ... ... ... XX.XX% XX.XX%
--------------------------------------------------------------------------------
Overall Statistics:
Total Memory Accesses: ...
Total Cache Hits: ... (XX.XX% of memory)
Memory Accesses (L3 Misses): ... (XX.XX%)
================================================================================
CACHE_OUTPUT_JSON=results.json ./bench/bin_sim/pr -g 15 -n 1Output format:
{
"total_accesses": ...,
"memory_accesses": ...,
"L1": {
"size_bytes": 32768,
"ways": 8,
"sets": 64,
"line_size": 64,
"policy": "LRU",
"hits": ...,
"misses": ...,
"hit_rate": ...,
"evictions": ...,
"writebacks": 0
},
"L2": {
"size_bytes": 262144,
"ways": 4,
"sets": 1024,
"line_size": 64,
"policy": "LRU",
"hits": ...,
"misses": ...,
"hit_rate": ...,
"evictions": ...,
"writebacks": 0
},
"L3": {
"size_bytes": 8388608,
"ways": 16,
"sets": 8192,
"line_size": 64,
"policy": "LRU",
"hits": ...,
"misses": ...,
"hit_rate": ...,
"evictions": ...,
"writebacks": 0
}
}Compare cache efficiency across different reorderings:
# Test ORIGINAL ordering
./bench/bin_sim/pr -f graph.mtx -o 0 -n 1
# Test RabbitOrder (uses csr variant by default)
./bench/bin_sim/pr -f graph.mtx -o 8 -n 1
# Test RabbitOrder with explicit boost variant
./bench/bin_sim/pr -f graph.mtx -o 8:boost -n 1
# Test GraphBrewOrder
./bench/bin_sim/pr -f graph.mtx -o 12 -n 1Study how cache sizes affect performance:
for size in 16384 32768 65536 131072; do
CACHE_L1_SIZE=$size ./bench/bin_sim/bfs -g 18 -n 1
doneThe cache simulator provides features for the ML-based algorithm selector. The unified experiment script automatically collects these:
# Run cache simulation as part of the full pipeline
python3 scripts/graphbrew_experiment.py --full --size small
# Or just the cache phase
python3 scripts/graphbrew_experiment.py --phase cacheCache features are stored in results/cache_*.json and integrated into perceptron weights:
{
"ALGORITHM_NAME": {
"bias": ...,
"w_modularity": ...,
"cache_l1_impact": ...,
"cache_l2_impact": ...,
"cache_l3_impact": ...,
"cache_dram_penalty": ...
}
}Run --phase cache on your graphs to generate actual cache impact values. See Perceptron-Weights for the full weight schema.
The perceptron uses these weights to factor cache performance into algorithm selection.
Run comprehensive cache benchmarks using the unified script:
# Run cache simulation on all graphs
python3 scripts/graphbrew_experiment.py --phase cache
# Skip expensive simulations (BC, SSSP) on large graphs
python3 scripts/graphbrew_experiment.py --phase cache --skip-expensive
# Full pipeline includes cache simulation automatically
python3 scripts/graphbrew_experiment.py --full --size mediumResults are saved to results/cache_*.json:
{
"graph": "email-Enron",
"algorithm_id": 12,
"algorithm_name": "GraphBrewOrder",
"benchmark": "pr",
"l1_hit_rate": 85.2,
"l2_hit_rate": 92.1,
"l3_hit_rate": 98.5,
"l1_misses": 0,
"l2_misses": 0,
"l3_misses": 0,
"success": true,
"error": ""
}Cache characteristics help predict optimal algorithms:
| Cache Pattern | Recommended Algorithm |
|---|---|
| High L1 miss, low L3 miss | Random walks tolerable, BFS-based works |
| High L3 miss rate | Needs locality-focused reordering |
| Uniform access pattern | Simple reordering sufficient |
| Skewed hub access | Hub-clustering beneficial |
All eight benchmarks (pr, pr_spmv, bfs, cc, cc_sv, sssp, bc, tc) have instrumented versions in bench/src_sim/ (the two SpMV/SV variants share the base simulation logic). Key insights:
- PR: Pull-based has predictable outgoing edge access
- BFS: Direction-optimizing switch point affects cache behavior
- BC: Forward/backward passes have different access patterns
- TC: Sorted adjacency lists improve intersection cache efficiency
- Performance: ~10-100x slower than uninstrumented code
- Threading: Accesses are serialized for accurate counting
- Simplifications: No cache coherence, prefetching, or speculation modeling
- Scope: Data cache only (no instruction cache or TLB)
Cache features integrate with the perceptron-based algorithm selector. See Perceptron-Weights for the full feature vector.
Cache-specific features: cache_l1_impact, cache_l2_impact, cache_l3_impact, cache_dram_penalty — automatically collected during --phase cache and integrated into weight files.
# Run cache simulation as part of full pipeline
python3 scripts/graphbrew_experiment.py --full --size small
# Or just the cache phase
python3 scripts/graphbrew_experiment.py --phase cacheC++ access: cache_sim::GlobalCache().getFeatures() or CACHE_FEATURES() macro.
- Perceptron-Weights - Using cache features for algorithm selection
- Correlation-Analysis - Correlating cache stats with performance
- Benchmark-Suite - Running performance experiments