fix imports in components/checkpoint.py #1844

saforem2 · 2025-10-09T15:44:48Z

I was seeing:

ModuleNotFoundError in components/checkpoint.py:

 Traceback (most recent call last):
   File "/opt/aurora/25.190.0/frameworks/aurora_frameworks-2025.2.0/lib/python3.10/runpy.py", line 196, in _run_module_as_main
     return _run_code(code, main_globals, None,
   File "/opt/aurora/25.190.0/frameworks/aurora_frameworks-2025.2.0/lib/python3.10/runpy.py", line 86, in _run_code
     exec(code, run_globals)
   File "/lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/saforem2/tt/torchtitan/experiments/blendcorpus/train.py", line 21, in <module>
     from torchtitan.components.checkpoint import CheckpointManager
   File "/lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/saforem2/tt/torchtitan/components/checkpoint.py", line 26, in <module>
     from torch.distributed.checkpoint._consolidate_hf_safetensors import (
 ModuleNotFoundError: No module named 'torch.distributed.checkpoint._consolidate_hf_safetensors'

ImportError issue in torchtitan/distributed/pipeline_parallel.py:

 Traceback (most recent call last):
   File "/opt/aurora/25.190.0/frameworks/aurora_frameworks-2025.2.0/lib/python3.10/runpy.py", line 187, in _run_module_as_main
     mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
   File "/opt/aurora/25.190.0/frameworks/aurora_frameworks-2025.2.0/lib/python3.10/runpy.py", line 110, in _get_module_details
     __import__(pkg_name)
   File "/lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/saforem2/tt/torchtitan/experiments/blendcorpus/__init__.py", line 16, in <module>
     from torchtitan.experiments.blendcorpus.infra.pipeline import pipeline_llama
   File "/lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/saforem2/tt/torchtitan/experiments/blendcorpus/infra/pipeline.py", line 22, in <module>
     from torchtitan.distributed.pipeline_parallel import (
   File "/lus/flare/projects/AuroraGPT/AuroraGPT-v1/Experiments/AuroraGPT-2B/tt/saforem2/tt/torchtitan/distributed/pipeline_parallel.py", line 15, in <module>
     from torch.distributed.pipelining.schedules import (
 ImportError: cannot import name 'ScheduleDualPipeV' from 'torch.distributed.pipelining.schedules' (/opt/aurora/25.190.0/frameworks/aurora_frameworks-2025.2.0/lib/python3.10/site-packages/torch/distributed/pipelining/schedules.py)
 ^C^C^C
 [1]    66162 interrupt  ezpz-launch python3 -m torchtitan.experiments.blendcorpus.train     |

PyTorch Config:

; python3 -c 'import torch; print(*torch.__config__.show().split("\n", sep="\n")'
PyTorch built with:
  - GCC 13.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2025.2-Product Build 20250620 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - CPU capability usage: AVX512
XPU backend  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=RelWithDebInfo, COMMIT_SHA=ba56102387ef21a3b04b357e5b183d48f0afefc7, CXX_COMPILER=/opt/aurora/25.190.0/spack/unified/0.10.0/install/linux-sles15-x86_64/gcc-13.3.0/gcc-13.3.0-4enwbrb/bin/g++, CXX_FLAGS= -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=OFF -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-dangling-reference -Wno-error=dangling-reference -DUSE_XPU -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.8.0, USE_CUDA=0, USE_CUDNN=OFF, USE_CUSPARSELT=OFF, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=1, USE_MPI=0, USE_NCCL=OFF, USE_NNPACK=0, USE_OPENMP=ON, USE_ROCM=0, USE_ROCM_KERNEL_ASSERT=OFF, USE_XCCL=1, USE_XPU=1,

Copilot Summary

This pull request introduces several updates to the checkpointing and pipeline parallel scheduling logic to improve compatibility, simplify configuration, and streamline checkpoint saving. The most significant changes involve refactoring how checkpoint staging and consolidation are handled, updating the pipeline schedule import logic for better fallback, and removing obsolete or redundant configuration options.

Checkpointing improvements and refactoring:

Refactored the import of DefaultStager and StagingOptions in checkpoint.py to use a fallback implementation if the primary import fails, improving compatibility across different PyTorch versions. (torchtitan/components/checkpoint.py)
Simplified the logic for creating HuggingFaceStorageWriter by removing conditional consolidation and always enabling consolidation with a fixed thread count, streamlining the checkpoint saving process. (torchtitan/components/checkpoint.py)
Removed the call to consolidate_safetensors_files_on_every_rank after saving, as consolidation is now always handled during the save operation itself. (torchtitan/components/checkpoint.py)
Removed the load_only option and all related checks, as well as redundant state management for staging, to simplify the checkpointing interface and behavior. (torchtitan/components/checkpoint.py) [1] [2] [3] [4]

Pipeline parallel scheduling updates:

Updated the import logic for ScheduleDualPipeV to provide a fallback to ScheduleZBVZeroBubble if the primary schedule is unavailable, increasing robustness to upstream changes. (torchtitan/distributed/pipeline_parallel.py)
Simplified pipeline schedule construction by removing the loss rescaling wrapper, passing the loss function directly instead. (torchtitan/distributed/pipeline_parallel.py)

meta-cla · 2025-10-09T15:44:53Z

Hi @saforem2!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

wwwjn

In torchtitan, we assume PyTorch version is nightly. You can follow the README to download the last est PyTorch nightly.

saforem2 · 2025-10-09T17:14:56Z

ahhhh, okay. I saw from the README

To use the latest features of torchtitan, we recommend using the most recent PyTorch nightly.

but didn't know it was a strict requirement.

The only reason I bring it up is because I'm currently testing on Intel XPU devices, but our production (user-facing) environments are currently at 2.8

tianyu-l

have you tried rebasing onto latest pytorch nightly?

saforem2 added 2 commits October 9, 2025 10:43

fix imports in components/checkpoint.py

cfd29fc

fix distributed/pipeline_parallel.py

443d5a7

saforem2 requested review from fegin, tianyu-l, wconstab and wwwjn as code owners October 9, 2025 15:44

fix: Resolve conflicts in components/checkpoint.py

f9b2a83

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 9, 2025

Merge branch 'main' into saforem2/fix-pt28

99e4075

wwwjn reviewed Oct 9, 2025

View reviewed changes

tianyu-l requested changes Oct 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix imports in components/checkpoint.py #1844

fix imports in components/checkpoint.py #1844

saforem2 commented Oct 9, 2025 •

edited

Loading

Uh oh!

meta-cla bot commented Oct 9, 2025

Uh oh!

wwwjn left a comment

Uh oh!

saforem2 commented Oct 9, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix imports in components/checkpoint.py #1844

Are you sure you want to change the base?

fix imports in components/checkpoint.py #1844

Conversation

saforem2 commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-cla bot commented Oct 9, 2025

Action Required

Process

Uh oh!

wwwjn left a comment

Choose a reason for hiding this comment

Uh oh!

saforem2 commented Oct 9, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

saforem2 commented Oct 9, 2025 •

edited

Loading