You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Key insight: The bottleneck for local 27B training is system RAM, not GPU VRAM. DeepSpeed ZeRO-3 + NVMe offload handles the GPU side, but deepspeed.initialize() materializes the full model + optimizer in RAM before swapping to NVMe. 96GB RAM is insufficient; 128GB is marginal; 192GB is comfortable.
Hardware Investigation Notes
What works
Qwen3-1.7B Repr-Align: Passes on single RTX 5090 with DeepSpeed ZeRO-3 + CPU offload (~4.8GB VRAM, ~15s/step). Full wandb logging. Anchor precompute takes ~30s for 1000 examples.
Qwen3.6-27B anchor precompute: Works across both GPUs (RTX 5090 + RTX 4000) with device_map=auto + CPU overflow. 1000 examples × 4 layers = 27GB cache in ~20 min.
DeepSpeed ZeRO-3 + NVMe optimizer offload: Successfully writes ~180GB of optimizer/param state to NVMe. The init completes; the RAM peak is the bottleneck.
What doesn't work (yet)
27B full-layer Repr-Align on 96GB RAM: OOM killed during deepspeed.initialize(). The init peak (~170GB) exceeds available RAM + swap (104GB total). Two investigation paths remain open:
Upgrade to 192GB RAM: Replace 2×16GB sticks with 2×64GB. Cost: ~$2,000 AUD. Should comfortably fit the init peak.
Lazy init with init_device: meta: Skip materializing params in RAM during model construction. DeepSpeed's zero.Init context + remote_device="cpu" still allocates params on CPU. A true meta-device init would defer allocation until DeepSpeed can partition + swap directly, never holding the full model in RAM. This requires changes to the weight loading path in veomni/models/loader.py.
2-GPU ZeRO-3 with 96GB RAM: Each rank holds a partition (~27GB params), totaling 54GB in RAM. Optimizer adds another ~108GB. Total exceeds RAM even before considering DeepSpeed buffers.
Hobby RAM vs Cloud H100 — cost comparison
Approach
Upfront Cost
Per-run Cost
Time for 10 steps
Notes
RAM upgrade (96→192GB)
~$2,000 AUD
$0 (electricity)
~15-30 min
One-time, reusable for future models
Rent 8×H100 (Lambda)
$0
~$30-50/hr
<5 min
Faster, but per-experiment cost adds up
Rent 4×A100 (RunPod)
$0
~$12-20/hr
<10 min
Sweet spot for 27B
Rent 2×A6000 (Vast.ai)
$0
~$2-4/hr
~20 min
Cheapest cloud option
Break-even: At $2,000 AUD for RAM vs ~$30/hr for H100s, the RAM upgrade pays for itself after ~65 hours of training. If you're iterating frequently (hyperparameter sweeps, multiple models), the RAM upgrade is more economical. If you only need a few one-off runs, cloud is cheaper.
DeepSpeed NVMe offload gotchas
DS_SKIP_CUDA_CHECK=1: Required when system CUDA toolkit (13.1) doesn't match PyTorch's compiled CUDA (13.0). Without it, DeepSpeed can't compile async_io extensions needed for NVMe offload.
buffer_size: Must exceed the largest combined partition size. Default 100M elements is too small for 27B (622M for embed_tokens alone). Set to 2B elements: offload["buffer_size"] = 2_000_000_000.
Pre-build async_io: Run DS_SKIP_CUDA_CHECK=1 python -c "import deepspeed.ops.op_builder as b; b.AsyncIOBuilder().load()" once before training.
torch.empty on "nvme" device: DeepSpeed's _post_init patch in veomni/distributed/deepspeed_init.py must allocate on "cpu" not "nvme" — PyTorch doesn't recognize nvme as a device. The async_io swapper handles the NVMe transfer separately.