A powerful toolkit for streamlining and scheduling ML training workflows on ROCm
Features β’ Installation β’ Quick Start β’ Examples β’ Contributing
Late is a comprehensive Python toolkit that provides a unified interface for managing the entire lifecycle of training large language models on AMD GPUs. It combines a powerful Command-Line Interface (CLI), a batch job runner, and a user-friendly web dashboard to simplify environment setup, job scheduling, and execution.
- π ROCm Optimized: Built specifically for AMD GPUs with optimized defaults
- π§ Zero Config: Smart defaults that just work out of the box
- π Hyperparameter Sweeps: Built-in sweep engine with automatic reporting
- π Queue Management: Batch processing with pause/resume capabilities
- π Real-time Monitoring: Web dashboard for tracking training progress
- π§ͺ Reproducible: Every run is tracked with configs and outputs saved
| Feature | Late | Other Tools |
|---|---|---|
| ROCm Native Support | β First-class | |
| Hyperparameter Sweeps | β Built-in with reports | β External tools needed |
| Queue Management | β Pause/Resume/Priority | β Basic or none |
| Memory Management | β Automatic VRAM clearing | β Manual |
| Training Configs | β Simple YAML | |
| Web Dashboard | β Included | β Separate setup |
| Reproducibility | β Auto-saved runs |
- Automated Environment Patching: A single command (
late patch) installs and configures Flash Attention for ROCm, targeting either your global environment or a specific Python virtual environment. By default uses pre-built wheels, with optional source building. - Declarative Training Jobs: Define all aspects of your training runsβfrom model choice to hyperparametersβin simple, readable YAML configuration files.
- Direct Training: Run training jobs immediately with
late train config.yml- no queue required for single experiments. - Batch Queue Management: Group multiple training configs into
.qmlqueue files to run them sequentially. Perfect for running a series of experiments overnight. - Versatile CLI: Manage every aspect of your workflow from the terminal, including patching, training, and queue management.
- Web Dashboard: Launch a web server (
late serve) to visually create, manage, and monitor your training queues from any browser. - Built for ROCm: Optimized with defaults like
adamw_torch_fusedandbfloat16to get the best performance out of AMD hardware.
- β‘ Unsloth Integration: Optional 2-5x speedup on both AMD and NVIDIA GPUs with optimized kernels
- π― Configurable Loss Masking: Choose between "full" (default, simpler) or "assistant_only" loss masking strategies
- π Push Notifications: Get ntfy.sh notifications on checkpoint saves, training completion, and errors
- π LoRA Merging: Built-in command to merge LoRA adapters with base models and upload to Hub
- β Comprehensive Testing: 80%+ test coverage with pytest, ready for CI/CD
- π Better Documentation: Complete testing guide and example configurations
- Python: 3.8 or higher
- ROCm: 6.0+ (for GPU training)
- Git: For installation and version control
- OS: Linux (Ubuntu/RHEL recommended)
# Install the package
pip install late-training
# Verify installation
late version# Clone the repository
git clone https://github.com/TesslateAI/Late.git
cd Late
# Install in development mode
pip install -e .
# Verify installation
late version# Install core training dependencies
pip install late-training[training]
# For Unsloth support on AMD GPUs (optional but recommended for 2-5x speedup)
pip install --no-deps unsloth unsloth-zoo
pip install --no-deps git+https://github.com/unslothai/unsloth-zoo.git
pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth"
# For Unsloth support on NVIDIA GPUs (optional)
pip install unsloth# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Export your huggingface token
export HF_TOKEN=<your huggingface token>
# Clone the test dataset
git clone https://huggingface.co/datasets/yahma/alpaca-cleaned
# Patch environment for ROCm (installs Flash Attention, etc.)
late patch --arch gfx942 # Replace with your GPU architecture# Create a simple LoRA training config
cat > quick_lora.yml << EOF
base_model: "meta-llama/Llama-3.2-3B-Instruct"
dataset_name: "alpaca-cleaned"
output_model_name: "my-first-lora"
output_dir: "./outputs/quick-test/"
training_type: "lora"
max_seq_length: 2048
batch_size: 4
gradient_accumulation: 4
epochs: 1
learning_rate: 2e-4
lora:
r: 32
lora_alpha: 64
EOF
# Start training immediately
late train quick_lora.yml# Find the best learning rate in minutes
late sweep quick_lora.yml \
--params learning_rate=1e-4,2e-4,3e-4 \
--percent-epoch 10 # Only train 10% of epoch for quick evaluation# Start the web UI
late serve
# Open http://localhost:8080 in your browserLate is built around two simple file types: YAML for defining what to run, and QML for defining the order to run it in.
A YAML file defines a single training job. It contains all the necessary parameters, such as the model, dataset, hyperparameters, and training type (e.g., SFT or LoRA).
Example sft_config.yml:
# Model and Dataset
base_model: "Qwen/Qwen2-1.5B-Instruct"
dataset_name: "Tesslate/UIGENT"
output_model_name: "your-hf-username/UIGENT-1.5B-SFT"
# Paths
output_dir: "/scratch/outputs/sft/"
cache_dir: "./model_cache"
# Training Type ('sft' or 'lora')
training_type: "sft"
# Hyperparameters
max_seq_length: 4096
batch_size: 2
gradient_accumulation: 8
epochs: 1
learning_rate: 2.0e-5
# Control Flags
report_to_wandb: true
upload_to_hub: falseA QML file is a plain text file that lists the absolute paths to your YAML configuration files. The late runner will execute these jobs one by one, in the order they appear.
Example experiment_queue.qml:
/home/user/projects/late/configs/sft_run_1.yml
/home/user/projects/late/configs/lora_run_alpha.yml
/home/user/projects/late/configs/sft_run_2.ymlThe late CLI is the primary way to interact with the library.
Prepares a Python environment for ROCm training. It automatically installs PyTorch ROCm, Flash Attention and other core ML libraries.
IMPORTANT: PyTorch installation is now OPTIONAL by default. You must explicitly use --install-pytorch to install PyTorch ROCm versions.
Usage:
# Patch current environment WITHOUT PyTorch (default)
late patch --arch gfx942
# Install PyTorch stable ROCm version
late patch --arch gfx942 --install-pytorch stable
# Install PyTorch nightly ROCm 6.4 version
late patch --arch gfx942 --install-pytorch nightly
# Install PyTorch from specific wheel URL
late patch --arch gfx942 --install-pytorch https://example.com/torch-2.0-rocm.whl
# Build Flash Attention from source
late patch --arch gfx942 --from-source
# Create a venv and patch it specifically (recommended)
python3 -m venv my_env
late patch --venv ./my_env
# Multiple GPU architectures
late patch --arch "gfx942;gfx90a"
# Skip MIOpen kernel installation
late patch --no-kernels
# Full example: venv + PyTorch nightly + Flash from source
late patch --venv ./my_env --arch gfx942 --install-pytorch nightly --from-sourceRun a training job directly from a YAML config file without using queues. Perfect for quick experiments or single runs.
Usage:
# Run a single training job
late train config.yml
# Run with absolute path
late train /workspace/configs/my_training.yml
# The training will start immediately and show live outputNote: Memory is automatically cleared before and after the run, and all outputs are saved in training_runs/ just like queue runs.
Run systematic hyperparameter sweeps to find optimal training configurations. Supports sweeping ANY parameter with early stopping and automatic report generation.
Key Features:
- Sweep any parameter (learning rate, batch size, LoRA rank, etc.)
- Override parameters for faster sweeps (e.g., shorter context length)
- Early stopping by percentage of epoch or step count
- Automatic Excel reports with loss graphs
- W&B integration with custom sweep IDs
- Add sweeps to queues for batch processing
Command Line Usage:
# Basic learning rate sweep with 25% epoch early stopping
late sweep config.yml --params learning_rate=1e-4,2e-4,3e-4 --percent-epoch 25
# Multiple parameters with overrides for efficiency
late sweep config.yml \
--params learning_rate=1e-4,2e-4 lora.r=64,128 \
--override max_seq_length=2048 \
--percent-epoch 25
# Use custom sweep ID for W&B grouping
late sweep config.yml \
--params learning_rate=1e-4,2e-4,3e-4 \
--sweep-id "lr_search_v1" \
--max-steps 100
# Add sweep to queue instead of running immediately
late sweep config.yml \
--sweep-file lr_sweep.sweep \
--add-to-queue overnight.qmlSweep File Format (.sweep):
# lr_sweep.sweep
sweep_id: "lr_optimization_v1"
sweep_parameters:
learning_rate: [1e-5, 2e-5, 5e-5, 1e-4]
lora.r: [64, 128] # Can sweep nested parameters
overrides:
max_seq_length: 2048 # Use shorter context for faster sweeps
batch_size: 1 # Fix batch size for this sweep
early_stop:
percent_epoch: 25 # OR use max_steps: 100Sweep Reports: After completion, an Excel report is generated with:
- Summary table of all configurations and results
- Best configuration highlighting
- Combined loss curves graph
- Individual graphs for top 5 configurations
- Saved in
sweep_runs/{sweep_id}/sweep_report_{sweep_id}.xlsx
A suite of commands to manage .qml batch files. By default, these commands operate on a ./queues/ directory.
late queue ls: List all available queues.late queue create <name>.qml: Create a new, empty queue.late queue add <queue>.qml <config>.yml: Add a training config path to a queue.late queue delete <name>.qml: Delete a queue.late queue start <name>.qml: Start executing a queue. The runner will process each job sequentially.- Automatically saves progress and can resume if interrupted
- Use
--restartto start from the beginning - Memory is cleared between each job
- Sweep queues automatically generate reports upon completion
Securely store your Weights & Biases or Hugging Face tokens for automatic use in training runs.
Usage:
late set wandb YOUR_WANDB_API_KEY
late set hf_token YOUR_HUGGINGFACE_TOKENClears GPU VRAM, Python garbage collection, and PyTorch caches across all platforms (AMD ROCm, NVIDIA CUDA, CPU).
Usage:
# Clear all training memory
late clearThis command:
- Clears GPU VRAM (CUDA/ROCm automatically detected)
- Runs Python garbage collection
- Clears PyTorch caches
- Displays memory stats before/after with freed amounts
When to use:
- Between training runs to free up VRAM
- When encountering out-of-memory errors
- Before starting a new experiment
Starts a local web server to manage queues through a graphical interface.
Usage:
late serve --port 8080Now open http://localhost:8080 in your browser. If you run this on a remote server, you can forward the port or use a tool like ngrok to access it publicly.
Merge trained LoRA adapters with their base models and optionally upload to HuggingFace Hub.
Usage:
# Merge and upload to Hub
late merge /path/to/checkpoint-100 \
--base-model Qwen/Qwen3-32B \
--output username/merged-model
# Merge without uploading
late merge /path/to/checkpoint-100 \
--base-model Qwen/Qwen3-32B \
--output username/merged-model \
--no-upload
# Specify local save path
late merge /path/to/checkpoint-100 \
--base-model Qwen/Qwen3-32B \
--output username/merged-model \
--local-path ./my_merged_model
# Create private repository
late merge /path/to/checkpoint-100 \
--base-model Qwen/Qwen3-32B \
--output username/merged-model \
--private
# Use config file
late merge /path/to/checkpoint-100 --config merge_config.ymlMerge Config File Format:
adapter_path: "/path/to/checkpoint-100"
base_model: "Qwen/Qwen3-32B"
output_repo_id: "username/merged-model"
local_save_path: "./merged_output"
upload_to_hub: true
private_repo: false
hf_token: "" # Optional, uses HF_TOKEN env var if not setThe web dashboard provides a user-friendly way to:
- View all your training queues and the jobs within them.
- Create new queues.
- Delete existing queues.
- Add new training jobs to a queue by providing the path to the config file.
Note: Starting a queue is a CLI-only feature to ensure process stability.
Late supports two loss masking strategies for training:
What it does: Computes loss on the entire conversation (both user and assistant messages)
Benefits:
- β Simpler and faster preprocessing
- β Learns from complete conversations
- β Good for most use cases
- β Recommended for beginners
Configuration:
# Default - can be omitted
loss_masking_strategy: "full"What it does: Masks user prompts from loss computation (sets to -100), only learns from assistant responses
Benefits:
- β More targeted training (only learn assistant behavior)
- β Prevents model from learning user patterns
- β Traditional fine-tuning approach
- β May be better for instruction-following tasks
Trade-offs:
β οΈ Slower preprocessing than full maskingβ οΈ More complex implementation
Configuration:
# Must be explicitly set
loss_masking_strategy: "assistant_only"Full Masking (Default):
base_model: "meta-llama/Llama-3.2-3B-Instruct"
dataset_name: "yahma/alpaca-cleaned"
loss_masking_strategy: "full" # Simple preprocessing
# ... rest of configAssistant-Only Masking:
base_model: "meta-llama/Llama-3.2-3B-Instruct"
dataset_name: "yahma/alpaca-cleaned"
loss_masking_strategy: "assistant_only" # Masked user prompts
# ... rest of configSee examples/loss_masking/ for complete examples.
Late now supports Unsloth, an open-source library that significantly speeds up LLM fine-tuning while reducing memory usage. This integration is optional and provides:
- 2-5x Faster Training: Optimized kernels for both AMD ROCm and NVIDIA CUDA GPUs
- Lower Memory Usage: More efficient memory management allows larger batch sizes
- Zero Configuration: Works out of the box with your existing configs
- AMD GPU Optimized: Automatically handles bitsandbytes instability on AMD GPUs
- Backward Compatible: Doesn't affect existing workflows - just add one flag
Simply add use_unsloth: true to your training configuration:
base_model: "unsloth/Llama-3.2-3B-Instruct" # Unsloth-optimized models recommended
dataset_name: "mlabonne/FineTome-100k"
output_model_name: "username/my-fast-model"
output_dir: "./outputs/"
training_type: "lora"
# Enable Unsloth for 2-5x speedup
use_unsloth: true
# Rest of your config...
max_seq_length: 2048
batch_size: 4
gradient_accumulation: 4
epochs: 3
learning_rate: 2e-4
lora:
r: 64
lora_alpha: 128Unsloth requires special installation for AMD GPUs to ensure ROCm compatibility:
For AMD GPUs (ROCm):
# First, install Late with training dependencies
pip install late-training[training]
# Then, install Unsloth's AMD branch
pip install --no-deps unsloth unsloth-zoo
pip install --no-deps git+https://github.com/unslothai/unsloth-zoo.git
pip install "unsloth[amd] @ git+https://github.com/unslothai/unsloth"For NVIDIA GPUs (CUDA):
# Install Late with training dependencies
pip install late-training[training]
# Then, install standard Unsloth
pip install unslothWhy the AMD-specific installation?
The AMD branch of Unsloth includes:
- ROCm-compatible kernels
- Automatic bitsandbytes workarounds for HSA_STATUS_ERROR
- Optimizations for AMD GPU architectures (gfx90a, gfx942, gfx950, etc.)
For best performance, use Unsloth-optimized model variants:
unsloth/Llama-3.2-3B-Instruct(instead ofmeta-llama/Llama-3.2-3B-Instruct)unsloth/Meta-Llama-3-8B-Instructunsloth/Qwen2.5-7B-Instruct- See Unsloth Models for more
Unsloth automatically handles AMD-specific optimizations:
- Bitsandbytes Compatibility: Automatically uses 16-bit LoRA when
load_in_4bit=Trueto avoid HSA_STATUS_ERROR exceptions - ROCm Kernel Support: Optimized kernels for AMD architectures (gfx90a, gfx942, gfx950, etc.)
- No Configuration Needed: Works seamlessly on MI300X, RX 7900 XTX, and other AMD GPUs
Typical speedups observed:
| Model Size | Without Unsloth | With Unsloth | Speedup |
|---|---|---|---|
| Llama 3.2 3B | 1.0x | 2.5x | 2.5x faster |
| Llama 3 8B | 1.0x | 3.2x | 3.2x faster |
| Qwen 2.5 7B | 1.0x | 2.8x | 2.8x faster |
Benchmarks on AMD MI300X (192GB VRAM)
See examples/llama3/llama3.2_3b_lora.yml for a complete example with Unsloth enabled.
Get real-time notifications about your training runs on your phone or desktop!
- Install ntfy app on your phone (iOS / Android)
- Choose a unique topic (e.g., "mytraining-20250118")
- Subscribe to your topic in the app
- Add to config:
ntfy_topic: "mytraining-20250118"- π Training start
- πΎ Checkpoint saves (every
save_steps) - β Training completion
- π€ Model upload success/failure
- π Checkpoint resume events
- β Error notifications
base_model: "meta-llama/Llama-3.2-3B-Instruct"
dataset_name: "yahma/alpaca-cleaned"
output_model_name: "username/my-model"
output_dir: "./outputs/"
training_type: "lora"
# Enable notifications
ntfy_topic: "my-training-notifications"
# Checkpoints trigger notifications
save_steps: 50 # Notification every 50 steps
# ... rest of configSend a test notification:
curl -d "Test from Late!" ntfy.sh/your-topic-nameSee examples/with_notifications.yml for a complete example.
Example 1: Full Supervised Fine-Tuning (SFT)
This configuration fine-tunes all the parameters of the base model.
sft_config.yml
base_model: "Qwen/Qwen2-1.5B-Instruct"
dataset_name: "Tesslate/UIGENT"
output_model_name: "your-hf-username/UIGENT-1.5B-SFT"
output_dir: "/scratch/outputs/sft/"
training_type: "sft"
max_seq_length: 4096
batch_size: 2
gradient_accumulation: 8 # Effective batch size: 16 (2 * 8)
gradient_checkpointing: true # Reduce VRAM usage (default: true)
epochs: 1
learning_rate: 2.0e-5
report_to_wandb: true
cache_dir: "~/.cache/late/models" # Model cache directoryExample 2: LoRA Fine-Tuning
This configuration uses Parameter-Efficient Fine-Tuning (PEFT) with LoRA to train only a small number of adapter weights, which is much faster and more memory-efficient.
lora_config.yml
base_model: "Qwen/Qwen2-7B-Instruct"
dataset_name: "smirki/UIGENT-9-6-25"
output_model_name: "your-hf-username/UIGENT-7B-Lora"
output_dir: "/scratch/outputs/lora/"
training_type: "lora" # Set to 'lora' to enable PEFT
# Training Hyperparameters
max_seq_length: 8192
batch_size: 1
gradient_accumulation: 16
epochs: 2
learning_rate: 2.0e-4 # Higher learning rate is common for LoRA
# Memory Optimization
gradient_checkpointing: true # Highly recommended for large models
# LoRA Specific Config
lora:
r: 128
lora_alpha: 256
target_modules:
- "q_proj"
- "k_proj"
- "v_proj"
- "o_proj"
# Control Flags
report_to_wandb: true
upload_to_hub: trueCheck out the examples/ directory for ready-to-use configurations:
- Llama 3 Examples - Full configurations for Llama 3 8B and 3.2B models
- Sweep Templates - Pre-built sweep configurations for common use cases
- Production Queues - Example queue setups for real workflows
base_model: HuggingFace model ID or pathdataset_name: HuggingFace dataset ID (must have "messages" field)output_model_name: Name for uploaded model (format: "username/model-name")output_dir: Local directory to save modeltraining_type: Either "sft" (full fine-tuning) or "lora" (parameter-efficient)max_seq_length: Maximum sequence length for training
batch_size: Per-device batch size (default: 1)gradient_accumulation: Gradient accumulation steps (default: 16)epochs: Number of training epochs (default: 1)learning_rate: Learning rate (default: 2e-5)lr_scheduler_type: LR scheduler type (default: "linear")optim: Optimizer (default: "adamw_torch_fused" - optimized for ROCm)save_steps: Save checkpoint every N steps (default: 50)
use_unsloth: Enable Unsloth for 2-5x faster training (default: false)- Recommended for both AMD and NVIDIA GPUs
- Works best with Unsloth-optimized models (e.g.,
unsloth/Llama-3.2-3B-Instruct) - Automatically handles AMD GPU optimizations
gradient_checkpointing: Enable gradient checkpointing to reduce VRAM (default: true)torch_compile: Enable torch.compile for faster training (default: true)tf32: Enable TF32 mode for matrix operations (default: true)cache_dir: Model cache directory (default: "~/.cache/late/models")
lora:
r: 128 # LoRA rank
lora_alpha: 256 # LoRA alpha scaling parameter
target_modules: # Modules to apply LoRA to
- "q_proj"
- "k_proj"
- "v_proj"
- "o_proj"loss_masking_strategy: Choose loss computation strategy (default: "full")"full": Compute loss on entire conversation (default, simpler, faster)"assistant_only": Mask user prompts from loss (-100 tokens)
ntfy_topic: ntfy.sh topic for push notifications (optional, default: none)hf_token: HuggingFace API token (optional, uses HF_TOKEN env var if not set)
report_to_wandb: Enable Weights & Biases logging (default: false)upload_to_hub: Upload to HuggingFace Hub after training (default: false)
# Model & Dataset
base_model: "meta-llama/Llama-3.2-3B-Instruct"
dataset_name: "your-dataset/chat-format"
output_model_name: "your-username/model-finetuned"
output_dir: "/workspace/outputs/my-model/"
# Training Setup
training_type: "lora" # or "sft" for full fine-tuning
max_seq_length: 8192
# Loss Masking Strategy (NEW - optional, defaults to "full")
loss_masking_strategy: "full" # or "assistant_only"
# Hyperparameters
batch_size: 1
gradient_accumulation: 16 # Effective batch size: 16
epochs: 3
learning_rate: 2e-4
lr_scheduler_type: "cosine"
optim: "adamw_torch_fused"
save_steps: 100
# Memory & Performance
gradient_checkpointing: true
torch_compile: true
tf32: true
cache_dir: "~/.cache/late/models"
# LoRA Config (only if training_type: "lora")
lora:
r: 128
lora_alpha: 256
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj"]
# Notifications & Tokens (NEW - optional)
ntfy_topic: "my-training-runs" # Push notifications
hf_token: "" # Or uses HF_TOKEN env var
# Logging
report_to_wandb: true
upload_to_hub: trueLate automatically manages your training runs:
- Run Directory: Each training run creates a timestamped directory in
training_runs/ - Config Preservation: The YAML config is saved as JSON for reproducibility
- Training Script: The generated Python script is saved for debugging
- Logs: Full training output is saved to
training.log - Memory Management: VRAM is automatically cleared between queue runs
Example run directory structure:
training_runs/
βββ UIGENT-7B-Lora_lora_20240315_143022/
βββ config.json # Training configuration
βββ training_script.py # Generated training script
βββ training.log # Full output log
Hereβs how to go from a fresh server to a running batch of experiments.
# 1. Create a dedicated Python virtual environment
python3 -m venv rocm_trainer_env
source rocm_trainer_env/bin/activate
# 2. Install the 'late' library inside the venv
# (Assuming you've cloned the repo)
pip install -e .
# 3. Patch this new environment. This will take a while!
# No need for --venv flag since we are already inside it.
late patch
# 4. Set your API keys
late set wandb <your_key>
late set hf_token <your_key>
# 5. Create YAML config files for your experiments
# (e.g., `configs/sft_qwen.yml`, `configs/lora_llama.yml`)
# 6. Create a new training queue
late queue create nightly_runs.qml
# 7. Add your experiments to the queue
late queue add nightly_runs.qml configs/sft_qwen.yml
late queue add nightly_runs.qml configs/lora_llama.yml
# 8. Start the queue and let it run!
late queue start nightly_runs.qmlWe welcome contributions! Please see our Contributing Guidelines for details.
# Clone the repo
git clone https://github.com/TesslateAI/Late.git
cd Late
# Create development environment
python -m venv dev-env
source dev-env/bin/activate
# Install in editable mode with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/This project is licensed under the MIT License - see the LICENSE file for details.
- Built with β€οΈ for the ROCm community
- Powered by HuggingFace Transformers
- Flash Attention implementation from ROCm/flash-attention
- π§ Email: [email protected]
- π Issues: GitHub Issues
- π¬ Discord: Join our community
Made with π₯ by TesslateAI



