A Hyper-Optimized Library for Pretraining, Finetuning, and Distillation of Large Language Models
LM Engine is a research-grade, production-ready library for training large language models at scale. Built with performance and flexibility in mind, it provides native support for multiple accelerators including NVIDIA GPUs, Google TPUs, and AWS Trainiums.
- 🚀 Multi-Accelerator Support — Train on NVIDIA CUDA GPUs, Google Cloud TPUs, and AWS Trainium
- ⚡ Advanced Distributed Training — FSDP (1 & 2), Tensor Parallelism, Pipeline Parallelism, and ZeRO stages 1-3
- 🔧 Flexible Model Architectures — Transformer variants, MoE, Mamba2, RNNs, and hybrid architectures
- 📦 HuggingFace Integration — Seamless import/export with the HuggingFace ecosystem
- 🎯 Training Modes — Pretraining from scratch, full finetuning, and knowledge distillation
- 🔥 Custom Kernels — High-performance Triton, CUDA, and Pallas kernels via XMA
- 📊 Experiment Tracking — Native Weights & Biases and Aim integration
- 💾 Efficient Checkpointing — Async checkpointing with full state resumability
# Clone the repository
git clone https://github.com/open-lm-engine/lm-engine.git
cd lm-engine
# Install with uv
uv sync --extra cuda # For NVIDIA GPUs
uv sync --extra tpu # For Google TPUspip install lm-engine
# With optional dependencies
pip install "lm-engine[cuda]" # NVIDIA GPU support
pip install "lm-engine[tpu]" # Google TPU support
pip install "lm-engine[mamba2]" # Mamba2 architecture support
pip install "lm-engine[data]" # Data preprocessing utilities
pip install "lm-engine[dev]" # Development dependencies# Build for TPU
docker build --build-arg EXTRA=tpu -t lm-engine:tpu -f docker/Dockerfile .
# Build for CUDA
docker build --build-arg EXTRA=cuda -t lm-engine:cuda -f docker/Dockerfile .Launch training using a sample pretraining config in configs folder.
# Single GPU
python -m lm_engine.pretrain --config config.yml
# Multi-GPU with torchrun
torchrun --nproc_per_node=8 -m lm_engine.pretrain --config config.yml
# Multi-node
torchrun --nnodes=2 --nproc_per_node=8 --rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
-m lm_engine.pretrain --config config.ymlLM Engine provides comprehensive distributed training support:
distributed_args:
stage: 3 # ZeRO stage (0, 1, 2, or 3)
fsdp_algorithm: 2 # FSDP-1 or FSDP-2
zero_topology:
data_parallel_replication_world_size: 2
data_parallel_sharding_world_size: 4distributed_args:
tensor_parallel_world_size: 4
sequence_parallel: true
use_async_tensor_parallel: truedistributed_args:
pipeline_parallel_world_size: 2
num_pipeline_stages: 4
pipeline_parallel_schedule: "1F1B"distributed_args:
gradient_checkpointing_method: block # or "full"
gradient_checkpointing_args:
checkpoint_every_n_layers: 2from tools.import_from_hf import convert_hf_to_lm_engine
convert_hf_to_lm_engine(
model_name="meta-llama/Llama-3.1-8B",
output_path="./converted-model"
)from tools.export_to_hf import convert_lm_engine_to_hf
convert_lm_engine_to_hf(
checkpoint_path="./checkpoints/step-10000",
output_path="./hf-model",
model_type="llama"
)python -m lm_engine.unshard --config unshard.ymlFull Configuration Options
| Parameter | Type | Default | Description |
|---|---|---|---|
num_training_steps |
int | required | Total training steps |
micro_batch_size |
int | required | Batch size per device |
gradient_accumulation_steps |
int | 1 | Gradient accumulation steps |
gradient_clipping |
float | 1.0 | Max gradient norm |
eval_during_training |
bool | true | Enable validation |
eval_interval |
int | — | Steps between evaluations |
| Parameter | Type | Default | Description |
|---|---|---|---|
class_name |
str | "TorchAdamW" | Optimizer class |
lr |
float | 1e-5 | Learning rate |
weight_decay |
float | 0.1 | Weight decay |
betas |
list | [0.9, 0.95] | Adam betas |
| Parameter | Type | Default | Description |
|---|---|---|---|
lr_decay_style |
str | "cosine" | Decay schedule (linear, cosine, exponential) |
num_warmup_steps |
int | 200 | Warmup steps |
num_decay_steps |
int | — | Decay steps (defaults to remaining) |
lr_decay_factor |
float | 0.1 | Final LR ratio |
| Parameter | Type | Default | Description |
|---|---|---|---|
dtype |
str | "fp32" | Training dtype (fp32, bf16, fp16) |
# See scripts/gcp-tpu/ for TPU-specific launch scripts
./scripts/gcp-tpu/launch.sh --config config.yml --tpu-name my-tpu-v4# See scripts/aws-trainium/ for Trainium launch scripts
./scripts/aws-trainium/launch.sh --config config.yml# See scripts/gke/ for Kubernetes manifests
kubectl apply -f scripts/gke/training-job.ymlJoin the Discord server to discuss LLM architecture research, distributed training, and contribute to the project!
If you use LM Engine in your research, please cite:
@software{mishra2024lmengine,
title = {LM Engine: A Hyper-Optimized Library for Pretraining and Finetuning},
author = {Mishra, Mayank},
year = {2024},
url = {https://github.com/open-lm-engine/lm-engine}
}LM Engine is released under the Apache 2.0 License.
Built with ❤️ by Mayank Mishra