Skip to content

Conversation

@neoblizz
Copy link
Member

No description provided.

@github-actions github-actions bot added in-progress We are working on it iris Iris project issue labels Nov 17, 2025
@neoblizz neoblizz marked this pull request as ready for review November 22, 2025 20:31
Copilot AI review requested due to automatic review settings November 22, 2025 20:31
Copilot finished reviewing on behalf of neoblizz November 22, 2025 20:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR revamps the testing infrastructure to improve test organization and execution. The key changes include consolidating workflow files, adding support for separate test directories (examples, unittests, ccl), and improving test isolation through explicit preamble calls in all_reduce tests.

Key Changes

  • Consolidated testing workflows into a single comprehensive iris-tests.yml file with a matrix strategy that tests each directory (examples, unittests, ccl) with different rank counts (1, 2, 4, 8)
  • Updated test scripts to accept a test directory parameter, enabling more granular test execution
  • Enhanced all_reduce tests to explicitly call preamble before running operations for better test isolation

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
.github/workflows/iris-tests.yml New consolidated workflow with matrix strategy for testing all directories across multiple GPU configurations
.github/workflows/iris-tests-apptainer.yml Deleted old Apptainer-specific workflow, functionality now in iris-tests.yml
.github/workflows/iris-pip-install-test.yml Deleted pip installation test workflow, consolidated into main testing workflow
.github/workflows/iris-performance-regression-test.yml Added DOCKER_IMAGE_NAME environment variable for consistency
.github/workflows/iris-external-validation-test.yml Added DOCKER_IMAGE_NAME environment variable for consistency
.github/scripts/run_tests.sh Enhanced to accept test directory parameter and validate directory existence
.github/scripts/container_run.sh Updated to use DOCKER_IMAGE_NAME variable with fallback to "iris-dev"
.github/scripts/container_exec.sh Updated to use DOCKER_IMAGE_NAME variable with fallback to "iris-dev"
.github/scripts/container_build.sh Updated to use DOCKER_IMAGE_NAME variable with fallback to "iris-dev"
tests/ccl/test_all_reduce.py Disabled "ring" variant testing and added explicit preamble calls for better test isolation

Comment on lines 76 to 77
if variant == "ring":
config.all_reduce_num_rings = min(2, config.comm_sms)
Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code: The conditional block for the "ring" variant will never execute since "ring" has been commented out in the parametrize decorator above (line 19). Consider removing this block or uncommenting the "ring" variant in the test parameters.

Suggested change
if variant == "ring":
config.all_reduce_num_rings = min(2, config.comm_sms)

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +84
matrix:
include:
# Test each subdirectory with each rank count
- test_dir: examples
num_ranks: 1
gpu_devices: "0,1"
- test_dir: examples
num_ranks: 2
gpu_devices: "2,3"
- test_dir: examples
num_ranks: 4
gpu_devices: "4,5,6,7"
- test_dir: examples
num_ranks: 8
gpu_devices: "0,1,2,3,4,5,6,7"
- test_dir: unittests
num_ranks: 1
gpu_devices: "0,1"
- test_dir: unittests
num_ranks: 2
gpu_devices: "2,3"
- test_dir: unittests
num_ranks: 4
gpu_devices: "4,5,6,7"
- test_dir: unittests
num_ranks: 8
gpu_devices: "0,1,2,3,4,5,6,7"
- test_dir: ccl
num_ranks: 1
gpu_devices: "0,1"
- test_dir: ccl
num_ranks: 2
gpu_devices: "2,3"
- test_dir: ccl
num_ranks: 4
gpu_devices: "4,5,6,7"
- test_dir: ccl
num_ranks: 8
gpu_devices: "0,1,2,3,4,5,6,7"
Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential GPU device conflict: The test matrix allows parallel execution (fail-fast: false) with overlapping GPU device assignments. For example, the 8-rank tests use GPUs 0-7, while 1-rank tests use GPUs 0-1. If these jobs run concurrently, they may conflict. Consider either:

  1. Making GPU assignments mutually exclusive across all matrix jobs
  2. Adding a resource lock mechanism to prevent concurrent access to the same GPUs
  3. Setting a concurrency group to serialize jobs that share GPU resources

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ this is why I wasn’t using a matrix and was manually serializing the tests. If this works (looks like it), its fine though.

[
"atomic",
"ring",
# "ring",
Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation: The "ring" variant is commented out without explanation. Consider adding a comment explaining why this variant is disabled (e.g., "# ring - disabled due to known issues" or "# ring - TODO: fix compatibility issues").

Suggested change
# "ring",
# "ring", # Disabled due to known issues with the ring algorithm (TODO: fix compatibility issues)

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like us to continue testing against the three different installs:

  • pip install git+https://github.com/${{ github.repository }}.git@${{ github.sha }}
  • pip install -e .
  • pip install .

Sometimes source structure and import assumptions are only captured in one vs the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mawad-amd why don't we just make sure the installs work there?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pip install without running the tests you mean?

Copy link
Collaborator

@mawad-amd mawad-amd Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are issues that won't show up unless we actually run the tests. If we really have to test against one and do some smoke tests against the other, then I would prefer we always do the full test suite against pip install git+https://github.com/${{ github.repository }}.git@${{ github.sha }}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants