-
Notifications
You must be signed in to change notification settings - Fork 25
Revamp testing. #289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Revamp testing. #289
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR revamps the testing infrastructure to improve test organization and execution. The key changes include consolidating workflow files, adding support for separate test directories (examples, unittests, ccl), and improving test isolation through explicit preamble calls in all_reduce tests.
Key Changes
- Consolidated testing workflows into a single comprehensive
iris-tests.ymlfile with a matrix strategy that tests each directory (examples, unittests, ccl) with different rank counts (1, 2, 4, 8) - Updated test scripts to accept a test directory parameter, enabling more granular test execution
- Enhanced all_reduce tests to explicitly call preamble before running operations for better test isolation
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/iris-tests.yml |
New consolidated workflow with matrix strategy for testing all directories across multiple GPU configurations |
.github/workflows/iris-tests-apptainer.yml |
Deleted old Apptainer-specific workflow, functionality now in iris-tests.yml |
.github/workflows/iris-pip-install-test.yml |
Deleted pip installation test workflow, consolidated into main testing workflow |
.github/workflows/iris-performance-regression-test.yml |
Added DOCKER_IMAGE_NAME environment variable for consistency |
.github/workflows/iris-external-validation-test.yml |
Added DOCKER_IMAGE_NAME environment variable for consistency |
.github/scripts/run_tests.sh |
Enhanced to accept test directory parameter and validate directory existence |
.github/scripts/container_run.sh |
Updated to use DOCKER_IMAGE_NAME variable with fallback to "iris-dev" |
.github/scripts/container_exec.sh |
Updated to use DOCKER_IMAGE_NAME variable with fallback to "iris-dev" |
.github/scripts/container_build.sh |
Updated to use DOCKER_IMAGE_NAME variable with fallback to "iris-dev" |
tests/ccl/test_all_reduce.py |
Disabled "ring" variant testing and added explicit preamble calls for better test isolation |
| if variant == "ring": | ||
| config.all_reduce_num_rings = min(2, config.comm_sms) |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dead code: The conditional block for the "ring" variant will never execute since "ring" has been commented out in the parametrize decorator above (line 19). Consider removing this block or uncommenting the "ring" variant in the test parameters.
| if variant == "ring": | |
| config.all_reduce_num_rings = min(2, config.comm_sms) |
| matrix: | ||
| include: | ||
| # Test each subdirectory with each rank count | ||
| - test_dir: examples | ||
| num_ranks: 1 | ||
| gpu_devices: "0,1" | ||
| - test_dir: examples | ||
| num_ranks: 2 | ||
| gpu_devices: "2,3" | ||
| - test_dir: examples | ||
| num_ranks: 4 | ||
| gpu_devices: "4,5,6,7" | ||
| - test_dir: examples | ||
| num_ranks: 8 | ||
| gpu_devices: "0,1,2,3,4,5,6,7" | ||
| - test_dir: unittests | ||
| num_ranks: 1 | ||
| gpu_devices: "0,1" | ||
| - test_dir: unittests | ||
| num_ranks: 2 | ||
| gpu_devices: "2,3" | ||
| - test_dir: unittests | ||
| num_ranks: 4 | ||
| gpu_devices: "4,5,6,7" | ||
| - test_dir: unittests | ||
| num_ranks: 8 | ||
| gpu_devices: "0,1,2,3,4,5,6,7" | ||
| - test_dir: ccl | ||
| num_ranks: 1 | ||
| gpu_devices: "0,1" | ||
| - test_dir: ccl | ||
| num_ranks: 2 | ||
| gpu_devices: "2,3" | ||
| - test_dir: ccl | ||
| num_ranks: 4 | ||
| gpu_devices: "4,5,6,7" | ||
| - test_dir: ccl | ||
| num_ranks: 8 | ||
| gpu_devices: "0,1,2,3,4,5,6,7" |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential GPU device conflict: The test matrix allows parallel execution (fail-fast: false) with overlapping GPU device assignments. For example, the 8-rank tests use GPUs 0-7, while 1-rank tests use GPUs 0-1. If these jobs run concurrently, they may conflict. Consider either:
- Making GPU assignments mutually exclusive across all matrix jobs
- Adding a resource lock mechanism to prevent concurrent access to the same GPUs
- Setting a concurrency group to serialize jobs that share GPU resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ this is why I wasn’t using a matrix and was manually serializing the tests. If this works (looks like it), its fine though.
| [ | ||
| "atomic", | ||
| "ring", | ||
| # "ring", |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing documentation: The "ring" variant is commented out without explanation. Consider adding a comment explaining why this variant is disabled (e.g., "# ring - disabled due to known issues" or "# ring - TODO: fix compatibility issues").
| # "ring", | |
| # "ring", # Disabled due to known issues with the ring algorithm (TODO: fix compatibility issues) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like us to continue testing against the three different installs:
pip install git+https://github.com/${{ github.repository }}.git@${{ github.sha }}pip install -e .pip install .
Sometimes source structure and import assumptions are only captured in one vs the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mawad-amd why don't we just make sure the installs work there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pip install without running the tests you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are issues that won't show up unless we actually run the tests. If we really have to test against one and do some smoke tests against the other, then I would prefer we always do the full test suite against pip install git+https://github.com/${{ github.repository }}.git@${{ github.sha }}
No description provided.