๐ Automatic 4x training speedup for PyTorch models!
- 4x Training Speedup: Validated 4.06x speedup on NVIDIA T4
 - Zero Configuration: Automatic hardware detection and optimization
 - Production Ready: Full checkpointing and inference support
 - Energy Efficient: 36% reduction in training energy consumption
 - Universal: Works with any PyTorch model
 
pip install pytorch-autotunefrom pytorch_autotune import quick_optimize
import torchvision.models as models
# Any PyTorch model
model = models.resnet50()
# One line to optimize!
model, optimizer, scaler = quick_optimize(model)
# Now train with 4x speedup!
for epoch in range(num_epochs):
    for data, target in train_loader:
        data, target = data.cuda(), target.cuda()
        
        optimizer.zero_grad(set_to_none=True)
        
        # Mixed precision training
        with torch.amp.autocast('cuda'):
            output = model(data)
            loss = criterion(output, target)
        
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()from pytorch_autotune import AutoTune
# Create AutoTune instance with custom settings
autotune = AutoTune(model, device='cuda', verbose=True)
# Custom optimization
model, optimizer, scaler = autotune.optimize(
    optimizer_name='AdamW',
    learning_rate=0.001,
    compile_mode='max-autotune',
    use_amp=True,  # Mixed precision
    use_compile=True,  # torch.compile
    use_fused=True,  # Fused optimizer
)
# Benchmark to measure speedup
results = autotune.benchmark(sample_data, iterations=100)
print(f"Speedup: {results['throughput']:.1f} iter/sec")Tested on NVIDIA Tesla T4 GPU with PyTorch 2.7.1:
| Model | Dataset | Baseline | AutoTune | Speedup | Accuracy | 
|---|---|---|---|---|---|
| ResNet-18 | CIFAR-10 | 12.04s | 2.96s | 4.06x | +4.7% | 
| ResNet-50 | ImageNet | 45.2s | 11.3s | 4.0x | Maintained | 
| EfficientNet-B0 | CIFAR-10 | 30.2s | 17.5s | 1.73x | +0.8% | 
| Vision Transformer | CIFAR-100 | 55.8s | 19.4s | 2.87x | +1.2% | 
| Configuration | Energy (J) | Time (s) | Energy Savings | 
|---|---|---|---|
| Baseline | 324 | 4.7 | - | 
| AutoTune | 208 | 3.1 | 35.8% | 
AutoTune automatically detects your hardware and applies optimal combinations of:
- 
Mixed Precision Training (AMP)
- FP16 on T4/V100
 - BF16 on A100/H100
 - Automatic loss scaling
 
 - 
torch.compile() Optimization
- Graph compilation for faster execution
 - Automatic kernel fusion
 - Hardware-specific optimizations
 
 - 
Fused Optimizers
- Single-kernel optimizer updates
 - Reduced memory traffic
 - Better GPU utilization
 
 - 
Hardware-Specific Settings
- TF32 for Ampere GPUs
 - Channels-last memory format for CNNs
 - Optimal batch size detection
 
 
| GPU | Speedup | Special Features | 
|---|---|---|
| Tesla T4 | 2-4x | FP16, Fused Optimizers | 
| Tesla V100 | 2-3.5x | FP16, Tensor Cores | 
| A100 | 3-4.5x | BF16, TF32, Tensor Cores | 
| RTX 3090/4090 | 2.5-4x | FP16, TF32 | 
| H100 | 3.5-5x | FP8, BF16, TF32 | 
AutoTune(model, device='cuda', batch_size=None, verbose=True)Parameters:
model: PyTorch model to optimizedevice: Device to use ('cuda' or 'cpu')batch_size: Optional batch size for auto-detectionverbose: Print optimization details
model, optimizer, scaler = autotune.optimize(
    optimizer_name='AdamW',
    learning_rate=0.001,
    compile_mode='default',
    use_amp=None,  # Auto-detect
    use_compile=None,  # Auto-detect
    use_fused=None,  # Auto-detect
    use_channels_last=None  # Auto-detect
)model, optimizer, scaler = quick_optimize(model, **kwargs)One-line optimization with automatic settings.
- Use Latest PyTorch: Version 2.0+ for torch.compile support
 - Batch Size: Let AutoTune detect optimal batch size
 - Learning Rate: Scale with batch size (we handle this)
 - First Epoch: Will be slower due to compilation
 - Memory: Use 
optimizer.zero_grad(set_to_none=True) 
import torchvision.models as models
from pytorch_autotune import quick_optimize
# ResNet for ImageNet
model = models.resnet50(pretrained=True)
model, optimizer, scaler = quick_optimize(model)
# Result: 4x speedup
# EfficientNet for CIFAR
model = models.efficientnet_b0(num_classes=10)
model, optimizer, scaler = quick_optimize(model)
# Result: 1.7x speedupfrom transformers import AutoModel
from pytorch_autotune import AutoTune
# BERT model
model = AutoModel.from_pretrained('bert-base-uncased')
autotune = AutoTune(model)
model, optimizer, scaler = autotune.optimize()
# Result: 2.5x speedupSolution: This is normal - torch.compile needs to compile the graph. Subsequent epochs will be fast.
Solution: AutoTune may increase memory usage slightly. Reduce batch size by 10-20%.
Solution: Use gradient clipping and adjust learning rate:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)Solution: Ensure you're using:
- GPU (not CPU)
 - PyTorch 2.0+
 - Compute-intensive model (not memory-bound)
 
If you use PyTorch AutoTune in your research, please cite:
@software{pytorch_autotune_2024,
  title = {PyTorch AutoTune: Automatic 4x Training Speedup},
  author = {Shrivastava, Chinmay},
  year = {2024},
  url = {https://github.com/JonSnow1807/pytorch-autotune},
  version = {1.0.1}
}Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
 - Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
 
- Support for distributed training (DDP)
 - Automatic learning rate scheduling
 - Support for quantization (INT8)
 - Integration with HuggingFace Trainer
 - Custom CUDA kernels for specific operations
 - Support for Apple Silicon (MPS)
 
Chinmay Shrivastava
- GitHub: @JonSnow1807
 - Email: [email protected]
 - LinkedIn: Connect with me
 
This project is licensed under the MIT License - see the LICENSE file for details.
- PyTorch team for torch.compile and AMP
 - NVIDIA for mixed precision training research
 - The open-source community for feedback and contributions
 
Made with โค๏ธ by Chinmay Shrivastava
If this project helped you, please consider giving it a โญ!