Skip to content

fix(tests): validate training completion in UT and capture torchrun exit code#620

Open
WangLingxun wants to merge 1 commit intomainfrom
fix/direct-capture-torchrun-exit-code
Open

fix(tests): validate training completion in UT and capture torchrun exit code#620
WangLingxun wants to merge 1 commit intomainfrom
fix/direct-capture-torchrun-exit-code

Conversation

@WangLingxun
Copy link
Collaborator

@WangLingxun WangLingxun commented Mar 20, 2026

  1. Shell fix (primus-cli-direct.sh):

    • Add set +e before eval "$CMD" so the script does not exit
      immediately on non-zero torchrun return code, allowing proper
      error logging before exit.
  2. UT validation (tests/utils.py → run_training_script):

    • Extract common training-script execution logic into a shared
      'run_training_script()' helper, replacing duplicated code across
      test_megatron_trainer, test_torchtitan_trainer, and
      test_maxtext_trainer.
    • In the success path (exit code 0), assert that the PrimusRuntime
      'Training completed.' marker is present in the log file. This
      catches silent training failures where torchrun returns 0 but
      training did not actually finish (e.g. AITER HIP errors).

…xit code

1. Shell fix (primus-cli-direct.sh):
   - Add  before  so the script does not exit
     immediately on non-zero torchrun return code, allowing proper
     error logging before exit.

2. UT validation (tests/utils.py → run_training_script):
   - Extract common training-script execution logic into a shared
     'run_training_script()' helper, replacing duplicated code across
     test_megatron_trainer, test_torchtitan_trainer, and
     test_maxtext_trainer.
   - In the success path (exit code 0), assert that the PrimusRuntime
     'Training completed.' marker is present in the log file. This
     catches silent training failures where torchrun returns 0 but
     training did not actually finish (e.g. AITER HIP errors).
@WangLingxun WangLingxun force-pushed the fix/direct-capture-torchrun-exit-code branch from 8e38f26 to 1da90ca Compare March 20, 2026 11:03
@WangLingxun WangLingxun changed the title fix(cli): move set +e before eval to capture torchrun exit code fix(tests): validate training completion in UT and capture torchrun exit code Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant