Skip to content

Conversation

@finbarrtimbers
Copy link
Collaborator

@finbarrtimbers finbarrtimbers commented Nov 10, 2025

These worked for DPO!

The other flags are already defaults. With this, we should be able to remove the --env flags for all the DPO scripts, as all the defaults will be correct.


Note

Adds LD_LIBRARY_PATH and NCCL_LIB_DIR env vars to the GCP cluster NCCL configuration in mason.py.

  • Environment configuration (GCP clusters):
    • In get_env_vars GCP branch, add:
      • LD_LIBRARY_PATH to include /var/lib/tcpxo/lib64 plus existing value.
      • NCCL_LIB_DIR set to /var/lib/tcpxo/lib64.

Written by Cursor Bugbot for commit a07f8d9. This will update automatically on new commits. Configure here.

@finbarrtimbers finbarrtimbers marked this pull request as ready for review November 12, 2025 18:50
@finbarrtimbers finbarrtimbers requested review from saurabh111233212 and tyler-romero and removed request for saurabh111233212 November 12, 2025 18:50
# Add COLL here to log all collective operations. Extreamly verbose, dont use for production.
beaker.BeakerEnvVar(name="NCCL_DEBUG_SUBSYS", value="INIT,NET"),
beaker.BeakerEnvVar(
name="LD_LIBRARY_PATH", value=f"/var/lib/tcpxo/lib64:{os.getenv(LD_LIBRARY_PATH, '')}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I'm also seeing these from DPO:

    --env NCCL_PROTO=Simple,LL128 \
    --env NCCL_TUNER_CONFIG_PATH=/var/lib/tcpxo/lib64/a3plus_tuner_config_ll128.textproto \
    --env NCCL_SHIMNET_GUEST_CONFIG_CHECKER_CONFIG_FILE=/var/lib/tcpxo/lib64/a3plus_guest_config_ll128.textproto \

I'm not sure they're necessary, especially the NCCL_PROTO one, but just flagging for awareness.

We're also running source /var/lib/tcpxo/lib64/nccl-env-profile.sh before launching the actual accelerate command.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see you mentioned that the other flags are already defaults

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these are all set (except for nccl-env-profile.sh). I'm not sure how to source nccl-env-profile.sh programmatically. Do you have any ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants