-
Notifications
You must be signed in to change notification settings - Fork 25
[Feature] Add Experimental Iris Gluon Backend #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Add Experimental Iris Gluon Backend #185
Conversation
Co-authored-by: mawad-amd <[email protected]>
Co-authored-by: mawad-amd <[email protected]>
Co-authored-by: mawad-amd <[email protected]>
Co-authored-by: mawad-amd <[email protected]>
Co-authored-by: mawad-amd <[email protected]>
Co-authored-by: mawad-amd <[email protected]>
The backend aggregate now uses self.cur_rank internally, so users only need to specify the remote rank for operations. This makes the API much cleaner: - load(ptr, from_rank) instead of load(ptr, to_rank, from_rank) - store(ptr, val, to_rank) instead of store(ptr, val, from_rank, to_rank) - atomic_*(ptr, val, to_rank) instead of atomic_*(ptr, val, from_rank, to_rank) - get(from_ptr, to_ptr, from_rank) instead of get(from_ptr, to_ptr, from_rank, to_rank) - put(from_ptr, to_ptr, to_rank) instead of put(from_ptr, to_ptr, from_rank, to_rank) Updated all documentation and examples to reflect the simplified API. Co-authored-by: mawad-amd <[email protected]>
Following the example provided, the implementation now: - Uses @gluon.jit decorator for all device-side methods - Uses gl.* language primitives instead of tl.* - IrisDeviceCtx.initialize() decodes context tensor: [cur_rank, num_ranks, heap_bases...] - Host-side get_device_context() returns encoded tensor - Kernels receive IrisDeviceCtx class and context_tensor as parameters - Updated producer-consumer example to use proper Gluon pattern Co-authored-by: mawad-amd <[email protected]>
|
@copilot this is looking great now. I fixed a couple of bugs in both the message passing example and the Iris class in my commits. Read my changes (passed 6 commits). Next task is to port example 07_gemm_all_scatter. This is more difficult -- consult the references for guidance. |
Created gemm_all_scatter_gluon.py following the Gluon pattern: - Uses @gluon.jit decorator for the kernel - IrisDeviceCtx.initialize() decodes context tensor - Uses gl.* language primitives (gl.load, gl.store, gl.dot, gl.atomic_*, etc.) - Uses gl.BlockedLayout for gl.arange operations (64 threads per warp for AMD) - Context methods (ctx.store) for inter-rank communication - Maintains all GEMM optimizations from original example Co-authored-by: mawad-amd <[email protected]>
Done! Ported the Key implementation details:
The example demonstrates complex GEMM computation with distributed all-scatter communication pattern using proper Gluon. Commit: 8785f2a |
Fixed W293 warning by removing trailing whitespace from blank line 211 in iris/experimental/iris_gluon.py copy() method docstring. Co-authored-by: mawad-amd <[email protected]>
Removed the reference to docs/api-comparison.md from the Documentation section as this file was previously deleted. Co-authored-by: mawad-amd <[email protected]>
Removed tests/unittests/test_iris_gluon.py as requested. The Gluon implementation is experimental and can be tested manually with the producer-consumer example and README code snippet. Co-authored-by: mawad-amd <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a comprehensive Gluon-based API for Iris using the proper @aggregate decorator with @gluon.jit methods. The feature adds a cleaner alternative to the traditional Iris API that encapsulates the backend state and eliminates the need to manually pass heap_bases around in device kernels.
- Adds a complete Gluon implementation with proper
@gluon.jitdecorators andgl.*language primitives - Provides feature parity with the original Iris API including all memory operations and atomics
- Includes comprehensive examples and documentation for the new experimental API
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| iris/experimental/iris_gluon.py | New 766-line Gluon implementation with IrisDeviceCtx aggregate and IrisGluon host class |
| iris/experimental/init.py | Experimental module initialization exposing iris_gluon |
| iris/init.py | Updated to expose experimental features with documentation |
| examples/06_message_passing/message_passing_gluon.py | Producer-consumer example using the new Gluon API |
| README.md | Added comprehensive Gluon API documentation with self-contained example |
| pyproject.toml | Updated package discovery to include iris.experimental |
| .github/workflows/iris-external-validation-test.yml | Added CI pipeline for Gluon validation testing |
| .github/scripts/* | Updated container scripts to use specific triton image |
|
@copilot this looks good now. Please update the PR title and description based on the diff. |
Gluon Port for Iris - COMPLETE ✅
Successfully completed the Gluon port of Iris using proper Gluon with
@gluon.jitdecorator!This PR implements a true Gluon-based API for Iris following the proper pattern with
@aggregate,@gluon.jit, andgl.*language primitives. The implementation is located in the experimental directory to clearly indicate that this API may evolve in future releases.📊 Implementation Summary
Lines of Code
Files Created/Modified
✅ iris/experimental/iris_gluon.py - Complete Gluon implementation
IrisDeviceCtxaggregate with @gluon.jit methodsIrisDeviceCtx.initialize()decodes context tensorgl.*language primitivesIrisGluon.get_device_context()returns encoded tensorload(),store(),get(),put(),copy(), and 10 atomic methods✅ iris/experimental/init.py - Experimental module initialization
✅ examples/06_message_passing/message_passing_gluon.py
@gluon.jitdecoratorgl.*primitives (gl.load, gl.store, gl.atomic_cas, etc.)✅ iris/init.py - Exposed experimental module
✅ README.md - Added experimental Gluon API section with self-contained, runnable example
🎯 Key Features
IrisDeviceCtx Aggregate with Gluon
@aggregatedecoratorinitialize()method with@gluon.jitdecodes context tensor@gluon.jitandgl.*primitives:load(),store(),get(),put(),copy()atomic_add(),atomic_sub(),atomic_cas(),atomic_xchg(),atomic_xor(),atomic_and(),atomic_or(),atomic_min(),atomic_max()Examples
API Pattern
Host Side:
Device Side:
✅ Benefits
initialize()call decodes contextIrisDeviceCtx: gl.constexprcontract🧪 Testing Status
✅ Completed
⏳ Pending
🚀 Usage
Simply copy the self-contained example from the README and run it. The example includes all necessary imports, distributed initialization, and multiprocessing setup.
🎓 Technical Notes
@gluon.jitfor all device methodsgl.*language primitives (gl.load, gl.store, gl.atomic_*, etc.)[cur_rank, num_ranks, heap_base_0, heap_base_1, ...]IrisDeviceCtx.initialize()decodes the tensorgl.BlockedLayoutforgl.arange()operations (AMD: 64 threads/warp)📈 Impact
This implementation properly uses Gluon's programming model, making it a true Gluon port rather than just using the
@aggregatedecorator. It follows the established patterns from Triton's Gluon examples and provides simple, self-contained examples that users can immediately use.By placing it in the experimental directory, users are clearly informed that this API may evolve while the stable Iris API remains unchanged.
✨ Ready for Testing
The implementation follows the proper Gluon pattern as requested and is now clearly marked as experimental. All code compiles and is ready for testing in a GPU environment with Gluon support. The README contains a self-contained, copy-paste ready example.
Fixes #184
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.