Skip to content

Conversation

@yiz-liu
Copy link
Collaborator

@yiz-liu yiz-liu commented Nov 18, 2025

What this PR does / why we need it?

Add ACL graph capture/replay DP test, this is a imprved version of #3886

Restructures the multi-card ACL graph test for improved clarity, robustness, and accuracy.

Key improvements include:

  • Replaces fragile sys.settrace and manual patching with a clean, reusable spy installer using unittest.mock.patch.
  • Introduces more precise metrics by tracking NPUModelRunner.execute_model and _dummy_run calls directly.
  • Rewrites assertions to be more accurate and provides clear explanations for the expected counts of graph captures, replays, model executions, and dummy runs.
  • Simplifies the overall test structure by separating the worker logic into a dedicated function.
  • Removes a long, unnecessary sleep at the end of the test.
  • Expands test coverage by adding a larger max_tokens parameter.

Does this PR introduce any user-facing change?

None.

How was this patch tested?

None.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@yiz-liu yiz-liu added ready read for review ready-for-test start test by label for PR labels Nov 18, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new end-to-end test for ACL graph capture and replay with data parallelism. The changes are a significant improvement, using unittest.mock.patch for cleaner method spying and providing detailed, well-commented assertions for various metrics. The test structure is clear and robust.

My review focuses on ensuring test reliability. I've identified one area for improvement: the use of a hardcoded network port, which could lead to flaky tests in a parallel execution environment. I've suggested using a dynamic port to address this.

@yiz-liu yiz-liu force-pushed the graph-test branch 2 times, most recently from 3b53853 to 0af82f1 Compare November 18, 2025 17:13
Copilot AI review requested due to automatic review settings November 18, 2025 17:50
Copilot finished reviewing on behalf of yiz-liu November 18, 2025 17:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a comprehensive end-to-end test for ACL graph capture and replay functionality in data parallel (DP) mode. This is an improved version of PR #3886 that uses a cleaner testing approach with mock-based spies instead of fragile sys.settrace mechanisms.

Key improvements:

  • Implements thread-safe spy installation using unittest.mock.patch to track NPU method invocations
  • Adds precise metrics tracking for graph captures, replays, model executions, and dummy runs
  • Expands test coverage with multiple max_tokens values (4 and 36) to test different execution paths

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
tests/e2e/multicard/test_aclgraph_capture_replay.py New test file that validates ACL graph capture/replay behavior in DP mode with comprehensive metrics tracking and assertions
.github/workflows/_e2e_test.yaml Adds the new test to the full e2e test suite execution

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yiz-liu yiz-liu force-pushed the graph-test branch 6 times, most recently from 41e3554 to 3f0daa2 Compare November 19, 2025 09:07
@whx-sjtu
Copy link
Collaborator

In the future we should break down the template script for launching DP into more granular functions, and then all DP-related UTs can import these functions to reuse code. Specially we should enable the 'main' function to support passing extra parameters and additional patch functions.

@yiz-liu yiz-liu force-pushed the graph-test branch 5 times, most recently from aad0fb6 to 748f733 Compare November 20, 2025 04:32
Restructures the multi-card ACL graph test for improved clarity, robustness, and accuracy.

Key improvements include:
- Replaces fragile `sys.settrace` and manual patching with a clean, reusable spy installer using `unittest.mock.patch`.
- Introduces more precise metrics by tracking `NPUModelRunner.execute_model` and `_dummy_run` calls directly.
- Rewrites assertions to be more accurate and provides clear explanations for the expected counts of graph captures, replays, model executions, and dummy runs.
- Simplifies the overall test structure by separating the worker logic into a dedicated function.
- Removes a long, unnecessary sleep at the end of the test.
- Expands test coverage by adding a larger `max_tokens` parameter.

Signed-off-by: Yizhou Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:tests ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants