Cache Simulation

The Cache Simulation framework in GraphBrew provides detailed cache performance analysis for graph algorithms. This allows you to understand memory access patterns and cache efficiency at L1, L2, and L3 levels.

Overview

Graph algorithms are memory-intensive and cache performance significantly impacts execution time. The cache simulation framework instruments graph algorithms to track every memory access, simulating how data flows through a three-level cache hierarchy.

Key Features:

Multi-level cache hierarchy (L1/L2/L3)
Configurable cache parameters (size, associativity, line size)
Nine eviction policies (LRU, FIFO, RANDOM, LFU, PLRU, SRRIP, GRASP, P-OPT, ECG)
Detailed per-level statistics
JSON export for analysis integration
Feature vector output for perceptron training

Quick Start

# Build all cache simulation binaries
make all-sim

# Run PageRank with cache simulation
./bench/bin_sim/pr -g 15 -n 1

# Run with custom cache configuration
CACHE_L1_SIZE=65536 CACHE_POLICY=LFU ./bench/bin_sim/bfs -g 15 -n 1

Architecture

bench/
├── include/cache_sim/
│   ├── cache_sim.h     # Cache simulator core
│   └── graph_sim.h     # Graph wrappers for tracking
├── src_sim/            # Instrumented algorithms
│   ├── pr.cc           # PageRank (Gauss-Seidel, in-place updates)
│   ├── pr_spmv.cc      # PageRank SpMV (Jacobi, double-buffered)
│   ├── bfs.cc          # Breadth-First Search
│   ├── bc.cc           # Betweenness Centrality
│   ├── cc.cc           # Connected Components (Afforest)
│   ├── cc_sv.cc        # Connected Components (Shiloach-Vishkin)
│   ├── sssp.cc         # Single-Source Shortest Paths
│   └── tc.cc           # Triangle Counting
└── bin_sim/            # Compiled binaries

Configuration

Environment Variables

Variable	Default	Description
`CACHE_L1_SIZE`	32768	L1 cache size (bytes)
`CACHE_L1_WAYS`	8	L1 associativity
`CACHE_L2_SIZE`	262144	L2 cache size (bytes)
`CACHE_L2_WAYS`	4	L2 associativity
`CACHE_L3_SIZE`	8388608	L3 cache size (bytes)
`CACHE_L3_WAYS`	16	L3 associativity
`CACHE_LINE_SIZE`	64	Line size (bytes)
`CACHE_POLICY`	LRU	Eviction policy
`CACHE_OUTPUT_JSON`	-	JSON output file path
`ECG_MODE`	DBG_PRIMARY	ECG eviction mode: DBG_PRIMARY, POPT_PRIMARY, DBG_ONLY
`ECG_MASK_WIDTH`	8	ECG mask bits per edge (2,4,8,16,32)
`ECG_NUM_BUCKETS`	11	Number of degree buckets (2-16)
`ECG_RRPV_BITS`	3	RRPV bit width (max RRPV = 2^bits - 1)
`ECG_PER_VERTEX`	0	Per-vertex masks (1) vs per-edge (0)
`ECG_DEGREE_MODE`	OUT	Degree mode: OUT, IN, BOTH

Default Cache Configuration

The default configuration models a typical modern CPU:

Level	Size	Associativity	Lines	Sets
L1	32 KB	8-way	512	64
L2	256 KB	4-way	4096	1024

Note: The FastCacheHierarchy (per-thread simulation) uses 8-way L2 associativity by default, reflecting modern per-core L2 caches. | L3 | 8 MB | 16-way | 131072 | 8192 |

Eviction Policies

Policy	Description	Use Case
LRU	Evicts least recently used line	General purpose, best for temporal locality
FIFO	Evicts oldest line	Simple, deterministic behavior
RANDOM	Random eviction	Baseline comparison
LFU	Evicts least frequently used	Good for hot data
PLRU	Tree-based pseudo-LRU	Hardware-efficient LRU approximation
SRRIP	Re-reference interval prediction	Scan-resistant, good for streaming
GRASP	Graph-aware degree-based RRIP (Faldu et al., HPCA'20)	Power-law graphs with DBG reordering
P-OPT	Graph-transpose Belady approximation (Balaji et al., HPCA'21)	Best miss reduction; requires rereference matrix
ECG	Layered SRRIP + DBG + dynamic P-OPT (Mughrabi et al., GrAPL)	Combines structural + oracle signals; 3 modes

ECG Modes

ECG uses a 3-level layered eviction strategy. All modes start with SRRIP aging (Level 1), then apply mode-dependent tiebreakers:

Mode	Level 2	Level 3	Env Variable
DBG_PRIMARY (default)	DBG tier (coldest vertex)	Dynamic P-OPT `findNextRef()`	`ECG_MODE=DBG_PRIMARY`
POPT_PRIMARY	Dynamic P-OPT (furthest future)	DBG tier	`ECG_MODE=POPT_PRIMARY`
DBG_ONLY	DBG tier only	None (fast path)	`ECG_MODE=DBG_ONLY`

Insert: RRPV from bucketToRRPV(dbg_tier) — structural priority only; P-OPT is not baked at insert
Hit: Top bucket → RRPV=0 (aggressive keep), others → decrement
Evict: P-OPT is consulted dynamically via findNextRef() at eviction time (avoids stale snapshots)

Output Format

Console Statistics

================================================================================
                           CACHE SIMULATION RESULTS                             
================================================================================
Configuration:
  L1 Cache: 32768 bytes, 8-way, 64-byte lines (64 sets)
  L2 Cache: 262144 bytes, 4-way, 64-byte lines (1024 sets)
  L3 Cache: 8388608 bytes, 16-way, 64-byte lines (8192 sets)
  Eviction Policy: LRU

Statistics by Level:
--------------------------------------------------------------------------------
  Level      Accesses         Hits       Misses    Hit Rate   Miss Rate
--------------------------------------------------------------------------------
     L1     ...              ...          ...       XX.XX%      XX.XX%
     L2     ...              ...          ...       XX.XX%      XX.XX%
     L3     ...              ...          ...       XX.XX%      XX.XX%
--------------------------------------------------------------------------------

Overall Statistics:
  Total Memory Accesses: ...
  Total Cache Hits:      ... (XX.XX% of memory)
  Memory Accesses (L3 Misses): ... (XX.XX%)
================================================================================

JSON Export

CACHE_OUTPUT_JSON=results.json ./bench/bin_sim/pr -g 15 -n 1

Output format:

{
  "total_accesses": ...,
  "memory_accesses": ...,
  "L1": {
    "size_bytes": 32768,
    "ways": 8,
    "sets": 64,
    "line_size": 64,
    "policy": "LRU",
    "hits": ...,
    "misses": ...,
    "hit_rate": ...,
    "evictions": ...,
    "writebacks": 0
  },
  "L2": {
    "size_bytes": 262144,
    "ways": 4,
    "sets": 1024,
    "line_size": 64,
    "policy": "LRU",
    "hits": ...,
    "misses": ...,
    "hit_rate": ...,
    "evictions": ...,
    "writebacks": 0
  },
  "L3": {
    "size_bytes": 8388608,
    "ways": 16,
    "sets": 8192,
    "line_size": 64,
    "policy": "LRU",
    "hits": ...,
    "misses": ...,
    "hit_rate": ...,
    "evictions": ...,
    "writebacks": 0
  }
}

Use Cases

1. Comparing Reordering Algorithms

Compare cache efficiency across different reorderings:

# Test ORIGINAL ordering
./bench/bin_sim/pr -f graph.mtx -o 0 -n 1

# Test RabbitOrder (uses csr variant by default)
./bench/bin_sim/pr -f graph.mtx -o 8 -n 1

# Test RabbitOrder with explicit boost variant
./bench/bin_sim/pr -f graph.mtx -o 8:boost -n 1

# Test GraphBrewOrder
./bench/bin_sim/pr -f graph.mtx -o 12 -n 1

2. Analyzing Cache Behavior

Study how cache sizes affect performance:

for size in 16384 32768 65536 131072; do
    CACHE_L1_SIZE=$size ./bench/bin_sim/bfs -g 18 -n 1
done

3. Perceptron Training Features

The cache simulator provides features for the ML-based algorithm selector. The unified experiment script automatically collects these:

# Run cache simulation as part of the full pipeline
python3 scripts/graphbrew_experiment.py --full --size small

# Or just the cache phase
python3 scripts/graphbrew_experiment.py --phase cache

Cache features are stored in results/cache_*.json and integrated into perceptron weights:

{
  "ALGORITHM_NAME": {
    "bias": ...,
    "w_modularity": ...,
    "cache_l1_impact": ...,
    "cache_l2_impact": ...,
    "cache_l3_impact": ...,
    "cache_dram_penalty": ...
  }
}

Run --phase cache on your graphs to generate actual cache impact values. See Perceptron-Weights for the full weight schema.

The perceptron uses these weights to factor cache performance into algorithm selection.

4. Automated Cache Benchmark Suite

Run comprehensive cache benchmarks using the unified script:

# Run cache simulation on all graphs
python3 scripts/graphbrew_experiment.py --phase cache

# Skip expensive simulations (BC, SSSP) on large graphs
python3 scripts/graphbrew_experiment.py --phase cache --skip-expensive

# Full pipeline includes cache simulation automatically
python3 scripts/graphbrew_experiment.py --full --size medium

Results are saved to results/cache_*.json:

{
  "graph": "email-Enron",
  "algorithm_id": 12,
  "algorithm_name": "GraphBrewOrder",
  "benchmark": "pr",
  "l1_hit_rate": 85.2,
  "l2_hit_rate": 92.1,
  "l3_hit_rate": 98.5,
  "l1_misses": 0,
  "l2_misses": 0,
  "l3_misses": 0,
  "success": true,
  "error": ""
}

5. Algorithm Selection

Cache characteristics help predict optimal algorithms:

Cache Pattern	Recommended Algorithm
High L1 miss, low L3 miss	Random walks tolerable, BFS-based works
High L3 miss rate	Needs locality-focused reordering
Uniform access pattern	Simple reordering sufficient
Skewed hub access	Hub-clustering beneficial

Instrumented Algorithms

All eight benchmarks (pr, pr_spmv, bfs, cc, cc_sv, sssp, bc, tc) have instrumented versions in bench/src_sim/ (the two SpMV/SV variants share the base simulation logic). Key insights:

PR: Pull-based has predictable outgoing edge access
BFS: Direction-optimizing switch point affects cache behavior
BC: Forward/backward passes have different access patterns
TC: Sorted adjacency lists improve intersection cache efficiency

Limitations

Performance: ~10-100x slower than uninstrumented code
Threading: Accesses are serialized for accurate counting
Simplifications: No cache coherence, prefetching, or speculation modeling
Scope: Data cache only (no instruction cache or TLB)

Perceptron Integration

Cache features integrate with the perceptron-based algorithm selector. See Perceptron-Weights for the full feature vector.

Cache-specific features: cache_l1_impact, cache_l2_impact, cache_l3_impact, cache_dram_penalty — automatically collected during --phase cache and integrated into weight files.

# Run cache simulation as part of full pipeline
python3 scripts/graphbrew_experiment.py --full --size small

# Or just the cache phase
python3 scripts/graphbrew_experiment.py --phase cache

C++ access: cache_sim::GlobalCache().getFeatures() or CACHE_FEATURES() macro.

Related Pages

Perceptron-Weights - Using cache features for algorithm selection
Correlation-Analysis - Correlating cache stats with performance
Benchmark-Suite - Running performance experiments

← Back to Home

Cache Simulation

Cache Simulation

Overview

Quick Start

Architecture

Configuration

Environment Variables

Default Cache Configuration

Eviction Policies

ECG Modes

Output Format

Console Statistics

JSON Export

Use Cases

1. Comparing Reordering Algorithms

2. Analyzing Cache Behavior

3. Perceptron Training Features

4. Automated Cache Benchmark Suite

5. Algorithm Selection

Instrumented Algorithms

Limitations

Perceptron Integration

Related Pages

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally