CTDA-CPA: Cross-Thread Dependency-Aware Critical Path Analysis

CTDA-CPA (Cross-Thread Dependency-Aware Critical Path Analysis) is an advanced performance analysis tool designed for large-scale model training. It accurately identifies performance hotspots by constructing complete critical paths across CPU multi-threading and GPU multi-stream execution.

🌟 Key Features

Cross-Thread Dependency Modeling: Captures inter-thread dependencies caused by Python GIL, ensuring complete critical path construction
Accurate Hotspot Identification: Avoids "false hotspots" caused by computation overlap in traditional time-accumulation methods
Non-Intrusive Analysis: Works with standard PyTorch Profiler data without requiring code modifications
Framework Support: Compatible with DeepSpeed, Megatron-LM, FSDP, and native PyTorch training
Visualization Tools: Includes computation dependency graph visualization for better understanding

📊 Performance Improvements

CTDA-CPA has successfully identified and optimized:

ResNet-18 Training: 47.06% performance improvement by optimizing data loading bottleneck
Llama 2 Distributed Fine-tuning: 7.21% performance improvement by optimizing communication bottleneck

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/CTDA-CPA.git
cd CTDA-CPA

# Install dependencies
pip install -r requirements.txt

Basic Usage

Step 1: Collect Profile Data

Add PyTorch Profiler to your training code:

import torch
import torch.profiler as profiler

def train_one_epoch(model, dataloader, optimizer, device):
    model.train()
    for batch_idx, (data, target) in enumerate(dataloader):
        data, target = data.to(device), target.to(device)
        
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        
        if batch_idx >= 10:  # Profile first 10 batches
            break

# Wrap training with profiler
with profiler.profile(
    activities=[
        profiler.ProfilerActivity.CPU,
        profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True,
    with_stack=True,
    profile_memory=False,
) as prof:
    train_one_epoch(model, train_loader, optimizer, device)

# Export trace file
prof.export_chrome_trace("trace.json")

Step 2: Run CTDA-CPA Analysis

# Basic analysis
python ctda_cpa/hotspot_analysis.py --input trace.json --output results/

# With custom options
python ctda_cpa/hotspot_analysis.py \
    --input trace.json \
    --top_functions 10

Key Arguments:

--input: Path to trace file (JSON or Parquet)
--top_functions: Number of top hotspots to report (default: 10)

Step 3: Interpret Results

The analysis generates a hotspot report:

=== CTDA-CPA Hotspot Analysis Report ===

Critical Path Total Time: 1000 ms

Top 10 Hotspots:
┌────┬────────────────────────┬─────────────────┬────────────┬────────────┬───────────────┐
│ #  │ name                   │ category        │ duration   │ proportion │ prop_in_total │
├────┼────────────────────────┼─────────────────┼────────────┼────────────┼───────────────┤
│ 1  │ resize of ImagingCore  │ python function │ 100ms      │ 50%        │ 10%           │
│ 2  │ ncclAllReduce          │ kernel          │ 60ms       │ 30%        │ 6%            │
│ 3  │ aten::conv2d           │ cpu_op          │ 40ms       │ 20%        │ 4%            │
└────┴────────────────────────┴─────────────────┴────────────┴────────────┴───────────────┘

Optimization Recommendations:
• DataLoader is the primary bottleneck (50% of critical path)
  → Consider: data preprocessing, num_workers tuning, prefetching
• Communication overhead is significant (30% of critical path)
  → Consider: gradient bucketing, communication overlap

Step 4: Optimize Based on Findings

Example - Data Loading Bottleneck:

# Optimized DataLoader
train_loader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=8,          # Increase workers
    pin_memory=True,        # Faster GPU transfer
    persistent_workers=True # Reuse workers
)

See optimizations/ for complete case studies.

💡 Methodology

Core Innovation: Cross-Thread Dependency Awareness

Existing critical path analysis tools assume CPU threads are independent, but Python's GIL (Global Interpreter Lock) means only one thread executes at a time. In PyTorch training:

Main Thread: Handles forward computation
Autograd Thread: Executes backward propagation
GIL Constraint: These threads execute serially with strict ordering

CTDA-CPA models these cross-thread dependencies by treating multi-threaded events as sequentially ordered based on timestamps, ensuring complete critical path construction across thread boundaries.

How It Works

Data Collection: Uses PyTorch Profiler to collect trace data (CPU ops, CUDA kernels, timing)
Dependency Graph Construction: Builds a Computation Dependency Graph (CDG) with four types of dependencies:
- Sequential execution (same thread/stream)
- Invocation (CPU→GPU calls)
- Synchronization (CUDA sync operations)
- Cross-thread (GIL-induced ordering) ← Our innovation
Critical Path Identification: Uses timestamp-based backtracking to find the true critical path
Hotspot Extraction: Identifies events on critical path that actually impact end-to-end time

Why CTDA-CPA vs Existing Tools?

Feature	HTA	CTDA-CPA
Handles Computation Overlap	✅	✅
Cross-Thread Dependencies	❌	✅
Complete Critical Path	Partial	✅
Multi-threaded Training	Limited	✅

Troubleshooting

Large trace files?

# Convert to Parquet for better performance
python ctda_cpa/utils/json2parquet.py trace.json trace.parquet
python ctda_cpa/hotspot_analysis.py --input trace.parquet --top_functions 10

🔬 Example Use Cases

1. DeepSpeed + Llama 2 Fine-tuning

cd examples/deepspeed
bash run.sh

Demonstrates profiling and analyzing distributed fine-tuning of Llama 2 13B with DeepSpeed ZeRO optimization.

2. Megatron-LM + GPT-2 Pretraining

cd examples/megatron
bash run.sh

Shows how to profile and analyze GPT-2 Large pretraining with Megatron-LM pipeline parallelism.

3. ResNet-18 Data Loading Optimization

Optimized data loading pipeline achieving 47.06% speedup.

4. Llama 2 Communication Optimization

Optimized communication bottlenecks achieving 7.21% improvement.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👥 Authors

Tong Yang - East China Normal University
Ning Li - East China Normal University
Bo Huang - East China Normal University
Jianmei Guo - East China Normal University

🔗 Related Projects

Star ⭐ this repository if you find it helpful!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
ctda_cpa		ctda_cpa
examples		examples
optimizations		optimizations
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CTDA-CPA: Cross-Thread Dependency-Aware Critical Path Analysis

🌟 Key Features

📊 Performance Improvements

🚀 Quick Start

Installation

Basic Usage

Step 1: Collect Profile Data

Step 2: Run CTDA-CPA Analysis

Step 3: Interpret Results

Step 4: Optimize Based on Findings

💡 Methodology

Core Innovation: Cross-Thread Dependency Awareness

How It Works

Why CTDA-CPA vs Existing Tools?

Troubleshooting

🔬 Example Use Cases

1. DeepSpeed + Llama 2 Fine-tuning

2. Megatron-LM + GPT-2 Pretraining

3. ResNet-18 Data Loading Optimization

4. Llama 2 Communication Optimization

📄 License

👥 Authors

🔗 Related Projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CTDA-CPA: Cross-Thread Dependency-Aware Critical Path Analysis

🌟 Key Features

📊 Performance Improvements

🚀 Quick Start

Installation

Basic Usage

Step 1: Collect Profile Data

Step 2: Run CTDA-CPA Analysis

Step 3: Interpret Results

Step 4: Optimize Based on Findings

💡 Methodology

Core Innovation: Cross-Thread Dependency Awareness

How It Works

Why CTDA-CPA vs Existing Tools?

Troubleshooting

🔬 Example Use Cases

1. DeepSpeed + Llama 2 Fine-tuning

2. Megatron-LM + GPT-2 Pretraining

3. ResNet-18 Data Loading Optimization

4. Llama 2 Communication Optimization

📄 License

👥 Authors

🔗 Related Projects

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages