Skip to content

solecnugit/CDTA-CPA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CTDA-CPA: Cross-Thread Dependency-Aware Critical Path Analysis

License: MIT Python 3.8+

CTDA-CPA (Cross-Thread Dependency-Aware Critical Path Analysis) is an advanced performance analysis tool designed for large-scale model training. It accurately identifies performance hotspots by constructing complete critical paths across CPU multi-threading and GPU multi-stream execution.

🌟 Key Features

  • Cross-Thread Dependency Modeling: Captures inter-thread dependencies caused by Python GIL, ensuring complete critical path construction
  • Accurate Hotspot Identification: Avoids "false hotspots" caused by computation overlap in traditional time-accumulation methods
  • Non-Intrusive Analysis: Works with standard PyTorch Profiler data without requiring code modifications
  • Framework Support: Compatible with DeepSpeed, Megatron-LM, FSDP, and native PyTorch training
  • Visualization Tools: Includes computation dependency graph visualization for better understanding

πŸ“Š Performance Improvements

CTDA-CPA has successfully identified and optimized:

  • ResNet-18 Training: 47.06% performance improvement by optimizing data loading bottleneck
  • Llama 2 Distributed Fine-tuning: 7.21% performance improvement by optimizing communication bottleneck

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/CTDA-CPA.git
cd CTDA-CPA

# Install dependencies
pip install -r requirements.txt

Basic Usage

Step 1: Collect Profile Data

Add PyTorch Profiler to your training code:

import torch
import torch.profiler as profiler

def train_one_epoch(model, dataloader, optimizer, device):
    model.train()
    for batch_idx, (data, target) in enumerate(dataloader):
        data, target = data.to(device), target.to(device)
        
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        
        if batch_idx >= 10:  # Profile first 10 batches
            break

# Wrap training with profiler
with profiler.profile(
    activities=[
        profiler.ProfilerActivity.CPU,
        profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True,
    with_stack=True,
    profile_memory=False,
) as prof:
    train_one_epoch(model, train_loader, optimizer, device)

# Export trace file
prof.export_chrome_trace("trace.json")

Step 2: Run CTDA-CPA Analysis

# Basic analysis
python ctda_cpa/hotspot_analysis.py --input trace.json --output results/

# With custom options
python ctda_cpa/hotspot_analysis.py \
    --input trace.json \
    --top_functions 10

Key Arguments:

  • --input: Path to trace file (JSON or Parquet)
  • --top_functions: Number of top hotspots to report (default: 10)

Step 3: Interpret Results

The analysis generates a hotspot report:

=== CTDA-CPA Hotspot Analysis Report ===

Critical Path Total Time: 1000 ms

Top 10 Hotspots:
β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ #  β”‚ name                   β”‚ category        β”‚ duration   β”‚ proportion β”‚ prop_in_total β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1  β”‚ resize of ImagingCore  β”‚ python function β”‚ 100ms      β”‚ 50%        β”‚ 10%           β”‚
β”‚ 2  β”‚ ncclAllReduce          β”‚ kernel          β”‚ 60ms       β”‚ 30%        β”‚ 6%            β”‚
β”‚ 3  β”‚ aten::conv2d           β”‚ cpu_op          β”‚ 40ms       β”‚ 20%        β”‚ 4%            β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Optimization Recommendations:
β€’ DataLoader is the primary bottleneck (50% of critical path)
  β†’ Consider: data preprocessing, num_workers tuning, prefetching
β€’ Communication overhead is significant (30% of critical path)
  β†’ Consider: gradient bucketing, communication overlap

Step 4: Optimize Based on Findings

Example - Data Loading Bottleneck:

# Optimized DataLoader
train_loader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=8,          # Increase workers
    pin_memory=True,        # Faster GPU transfer
    persistent_workers=True # Reuse workers
)

See optimizations/ for complete case studies.

πŸ’‘ Methodology

Core Innovation: Cross-Thread Dependency Awareness

Existing critical path analysis tools assume CPU threads are independent, but Python's GIL (Global Interpreter Lock) means only one thread executes at a time. In PyTorch training:

  • Main Thread: Handles forward computation
  • Autograd Thread: Executes backward propagation
  • GIL Constraint: These threads execute serially with strict ordering

CTDA-CPA models these cross-thread dependencies by treating multi-threaded events as sequentially ordered based on timestamps, ensuring complete critical path construction across thread boundaries.

How It Works

  1. Data Collection: Uses PyTorch Profiler to collect trace data (CPU ops, CUDA kernels, timing)
  2. Dependency Graph Construction: Builds a Computation Dependency Graph (CDG) with four types of dependencies:
    • Sequential execution (same thread/stream)
    • Invocation (CPUβ†’GPU calls)
    • Synchronization (CUDA sync operations)
    • Cross-thread (GIL-induced ordering) ← Our innovation
  3. Critical Path Identification: Uses timestamp-based backtracking to find the true critical path
  4. Hotspot Extraction: Identifies events on critical path that actually impact end-to-end time

Why CTDA-CPA vs Existing Tools?

Feature HTA CTDA-CPA
Handles Computation Overlap βœ… βœ…
Cross-Thread Dependencies ❌ βœ…
Complete Critical Path Partial βœ…
Multi-threaded Training Limited βœ…

Troubleshooting

Large trace files?

# Convert to Parquet for better performance
python ctda_cpa/utils/json2parquet.py trace.json trace.parquet
python ctda_cpa/hotspot_analysis.py --input trace.parquet --top_functions 10

πŸ”¬ Example Use Cases

1. DeepSpeed + Llama 2 Fine-tuning

cd examples/deepspeed
bash run.sh

Demonstrates profiling and analyzing distributed fine-tuning of Llama 2 13B with DeepSpeed ZeRO optimization.

2. Megatron-LM + GPT-2 Pretraining

cd examples/megatron
bash run.sh

Shows how to profile and analyze GPT-2 Large pretraining with Megatron-LM pipeline parallelism.

3. ResNet-18 Data Loading Optimization

Optimized data loading pipeline achieving 47.06% speedup.

4. Llama 2 Communication Optimization

Optimized communication bottlenecks achieving 7.21% improvement.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ‘₯ Authors

  • Tong Yang - East China Normal University
  • Ning Li - East China Normal University
  • Bo Huang - East China Normal University
  • Jianmei Guo - East China Normal University

πŸ”— Related Projects


Star ⭐ this repository if you find it helpful!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors