Skip to content

Conversation

kshitij12345
Copy link
Collaborator

@kshitij12345 kshitij12345 commented Sep 24, 2025

Context:
When running the test_dtensor.py like we do now (pytest thunder/tests/distributed/test_dtensor.py), we see Error from segmentation group 1: The singleton Communicator isn't available. for #2503

Changes:

  • Move all the tests under class DTensorTest(DistributedParallelTestCase) to free functions with test prefix and run the test with torchrun --nproc-per-node 2 --no-python pytest thunder/tests/distributed/test_dtensor.py (This is required to correctly run the nvFuser tests)

cc @Borda

@github-actions github-actions bot added the ci label Sep 24, 2025
@kshitij12345 kshitij12345 force-pushed the update-dtensor-test-run branch from 8d40b21 to 7cf40ee Compare September 24, 2025 11:31
@kshitij12345 kshitij12345 changed the title [WIP] run test_dtensor with torchrun Run test_dtensor with torchrun Sep 26, 2025
@kshitij12345 kshitij12345 added the DTensor Issues about DTensor support in Thunder label Sep 26, 2025
@kshitij12345 kshitij12345 self-assigned this Sep 26, 2025
Copy link
Collaborator

@wujingyue wujingyue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but can you clarify the context? What makes it hard for nvFuser DTensor tests to work with DistributedParallelTestCase?

@kshitij12345
Copy link
Collaborator Author

but can you clarify the context?

(Have also added this to PR description)
When running the test_dtensor.py like we do now (pytest thunder/tests/distributed/test_dtensor.py), we see Error from segmentation group 1: The singleton Communicator isn't available. for #2503

What makes it hard for nvFuser DTensor tests to work with DistributedParallelTestCase?

I was seeing hang-up in some tests when DistributedParallelTestCase is used with torchrun --nproc-per-node 2 --no-python pytest thunder/tests/distributed/test_dtensor.py. Haven't found the exact cause for the same.

@kshitij12345 kshitij12345 marked this pull request as ready for review September 26, 2025 17:13
Copy link
Collaborator

@KaelanDt KaelanDt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kshitij12345

@kshitij12345 kshitij12345 enabled auto-merge (squash) September 29, 2025 07:49
@crcrpar
Copy link
Collaborator

crcrpar commented Sep 29, 2025

I was seeing hang-up in some tests when DistributedParallelTestCase is used with torchrun --nproc-per-node 2 --no-python pytest thunder/tests/distributed/test_dtensor.py. Haven't found the exact cause for the same.

we can just run DistributedParallelTestCase-based tests with pytest and I think it's the way to run such tests. PyTorch distributed elastic launch command is not needed. (I know you have a problem with pytest and dtensor test)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci DTensor Issues about DTensor support in Thunder
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants