Revamp testing. #289

neoblizz · 2025-11-17T22:33:58Z

No description provided.

Copilot

Pull request overview

This PR revamps the testing infrastructure to improve test organization and execution. The key changes include consolidating workflow files, adding support for separate test directories (examples, unittests, ccl), and improving test isolation through explicit preamble calls in all_reduce tests.

Key Changes

Consolidated testing workflows into a single comprehensive iris-tests.yml file with a matrix strategy that tests each directory (examples, unittests, ccl) with different rank counts (1, 2, 4, 8)
Updated test scripts to accept a test directory parameter, enabling more granular test execution
Enhanced all_reduce tests to explicitly call preamble before running operations for better test isolation

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`.github/workflows/iris-tests.yml`	New consolidated workflow with matrix strategy for testing all directories across multiple GPU configurations
`.github/workflows/iris-tests-apptainer.yml`	Deleted old Apptainer-specific workflow, functionality now in iris-tests.yml
`.github/workflows/iris-pip-install-test.yml`	Deleted pip installation test workflow, consolidated into main testing workflow
`.github/workflows/iris-performance-regression-test.yml`	Added DOCKER_IMAGE_NAME environment variable for consistency
`.github/workflows/iris-external-validation-test.yml`	Added DOCKER_IMAGE_NAME environment variable for consistency
`.github/scripts/run_tests.sh`	Enhanced to accept test directory parameter and validate directory existence
`.github/scripts/container_run.sh`	Updated to use DOCKER_IMAGE_NAME variable with fallback to "iris-dev"
`.github/scripts/container_exec.sh`	Updated to use DOCKER_IMAGE_NAME variable with fallback to "iris-dev"
`.github/scripts/container_build.sh`	Updated to use DOCKER_IMAGE_NAME variable with fallback to "iris-dev"
`tests/ccl/test_all_reduce.py`	Disabled "ring" variant testing and added explicit preamble calls for better test isolation

Copilot · 2025-11-22T20:35:15Z

tests/ccl/test_all_reduce.py

    if variant == "ring":
        config.all_reduce_num_rings = min(2, config.comm_sms)


Dead code: The conditional block for the "ring" variant will never execute since "ring" has been commented out in the parametrize decorator above (line 19). Consider removing this block or uncommenting the "ring" variant in the test parameters.

Suggested change

if variant == "ring":

config.all_reduce_num_rings = min(2, config.comm_sms)

Copilot · 2025-11-22T20:35:15Z

.github/workflows/iris-tests.yml

+      matrix:
+        include:
+          # Test each subdirectory with each rank count
+          - test_dir: examples
+            num_ranks: 1
+            gpu_devices: "0,1"
+          - test_dir: examples
+            num_ranks: 2
+            gpu_devices: "2,3"
+          - test_dir: examples
+            num_ranks: 4
+            gpu_devices: "4,5,6,7"
+          - test_dir: examples
+            num_ranks: 8
+            gpu_devices: "0,1,2,3,4,5,6,7"
+          - test_dir: unittests
+            num_ranks: 1
+            gpu_devices: "0,1"
+          - test_dir: unittests
+            num_ranks: 2
+            gpu_devices: "2,3"
+          - test_dir: unittests
+            num_ranks: 4
+            gpu_devices: "4,5,6,7"
+          - test_dir: unittests
+            num_ranks: 8
+            gpu_devices: "0,1,2,3,4,5,6,7"
+          - test_dir: ccl
+            num_ranks: 1
+            gpu_devices: "0,1"
+          - test_dir: ccl
+            num_ranks: 2
+            gpu_devices: "2,3"
+          - test_dir: ccl
+            num_ranks: 4
+            gpu_devices: "4,5,6,7"
+          - test_dir: ccl
+            num_ranks: 8
+            gpu_devices: "0,1,2,3,4,5,6,7"


Potential GPU device conflict: The test matrix allows parallel execution (fail-fast: false) with overlapping GPU device assignments. For example, the 8-rank tests use GPUs 0-7, while 1-rank tests use GPUs 0-1. If these jobs run concurrently, they may conflict. Consider either:

Making GPU assignments mutually exclusive across all matrix jobs

Adding a resource lock mechanism to prevent concurrent access to the same GPUs

Setting a concurrency group to serialize jobs that share GPU resources

^ this is why I wasn’t using a matrix and was manually serializing the tests. If this works (looks like it), its fine though.

Copilot · 2025-11-22T20:35:15Z

tests/ccl/test_all_reduce.py

    [
        "atomic",
-        "ring",
+        # "ring",


Missing documentation: The "ring" variant is commented out without explanation. Consider adding a comment explaining why this variant is disabled (e.g., "# ring - disabled due to known issues" or "# ring - TODO: fix compatibility issues").

Suggested change

# "ring",

# "ring", # Disabled due to known issues with the ring algorithm (TODO: fix compatibility issues)

mawad-amd · 2025-11-22T21:13:08Z

.github/workflows/iris-pip-install-test.yml

I would like us to continue testing against the three different installs:

pip install git+https://github.com/${{ github.repository }}.git@${{ github.sha }}

pip install -e .

pip install .

Sometimes source structure and import assumptions are only captured in one vs the other.

@mawad-amd why don't we just make sure the installs work there?

Just pip install without running the tests you mean?

There are issues that won't show up unless we actually run the tests. If we really have to test against one and do some smoke tests against the other, then I would prefer we always do the full test suite against pip install git+https://github.com/${{ github.repository }}.git@${{ github.sha }}

Revamp testing.

4147f3b

github-actions bot added in-progress We are working on it iris Iris project issue labels Nov 17, 2025

neoblizz and others added 4 commits November 18, 2025 02:52

Use preamble

0a19762

Apply Ruff auto-fixes

4983672

Merge branch 'main' into muhosama/testing-infra

e5002f3

Disable ring-reduce.

b49a3e8

neoblizz marked this pull request as ready for review November 22, 2025 20:31

neoblizz requested review from BKP and mawad-amd as code owners November 22, 2025 20:31

Copilot AI review requested due to automatic review settings November 22, 2025 20:31

Copilot started reviewing on behalf of neoblizz November 22, 2025 20:32 View session

Copilot finished reviewing on behalf of neoblizz November 22, 2025 20:34

Copilot AI reviewed Nov 22, 2025

View reviewed changes

mawad-amd reviewed Nov 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revamp testing. #289

Revamp testing. #289

Uh oh!

neoblizz commented Nov 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 22, 2025

Uh oh!

Copilot AI Nov 22, 2025

Uh oh!

mawad-amd Nov 22, 2025

Uh oh!

Copilot AI Nov 22, 2025

Uh oh!

mawad-amd Nov 22, 2025

Uh oh!

neoblizz Nov 24, 2025

Uh oh!

mawad-amd Nov 24, 2025

Uh oh!

mawad-amd Nov 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if variant == "ring":
		config.all_reduce_num_rings = min(2, config.comm_sms)

	# "ring",
	# "ring", # Disabled due to known issues with the ring algorithm (TODO: fix compatibility issues)

Revamp testing. #289

Are you sure you want to change the base?

Revamp testing. #289

Uh oh!

Conversation

neoblizz commented Nov 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

mawad-amd Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

mawad-amd Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

neoblizz Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

mawad-amd Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

mawad-amd Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mawad-amd Nov 24, 2025 •

edited

Loading