Skip to content

Conversation

@imkero
Copy link
Contributor

@imkero imkero commented Oct 17, 2025

Purpose

Fix the buggy free logic of SingleWriterShmRingBuffer, it may break multimodal shm cache because of over-allocation.

details:

  1. correct the data_buffer_start computation in SingleWriterShmRingBuffer.free_buf
  2. SingleWriterShmRingBuffer.metadata should be empty on init

Test Plan

Test case added in tests.distributed.test_shm_buffer

python -m unittest tests.distributed.test_shm_buffer.TestSingleWriterShmRingBuffer.test_allocation_cycles

Test Result

Test Result (main): FAILED
python -m unittest tests.distributed.test_shm_buffer.TestSingleWriterShmRingBuffer.test_allocation_cycles
F
======================================================================
FAIL: test_allocation_cycles (tests.distributed.test_shm_buffer.TestSingleWriterShmRingBuffer.test_allocation_cycles)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/workspaces/vllm/tests/distributed/test_shm_buffer.py", line 172, in test_allocation_cycles
    ring_allocate(2)
  File "/workspaces/vllm/tests/distributed/test_shm_buffer.py", line 160, in ring_allocate
    mark_allocated_with_assertion(monotonic_id, addr, allocate_size_with_md)
  File "/workspaces/vllm/tests/distributed/test_shm_buffer.py", line 135, in mark_allocated_with_assertion
    self.assertEqual(count_allocated(allocated_bitmap[addr : addr + size]), 0)
AssertionError: 10 != 0

Ran 1 test in 0.002s

FAILED (failures=1)

Test Result (this PR): OK
python -m unittest tests.distributed.test_shm_buffer.TestSingleWriterShmRingBuffer.test_allocation_cycles
.
----------------------------------------------------------------------
Ran 1 test in 0.006s

OK


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@imkero imkero marked this pull request as ready for review October 17, 2025 19:28
@imkero imkero force-pushed the fix/shm-object-storage-allocation branch 4 times, most recently from 939b637 to a61fb77 Compare October 18, 2025 04:31
@ywang96
Copy link
Member

ywang96 commented Oct 18, 2025

/gemini-review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a much-needed fix for the allocation and freeing logic in SingleWriterShmRingBuffer. The previous implementation had several bugs related to pointer management and wrap-around behavior, which could lead to memory over-allocation and incorrect state. The new implementation is a complete rewrite of the core logic, introducing a clearer and more robust state management for the ring buffer. Key improvements include treating start == end as an empty buffer, reserving one byte to distinguish empty from full, and resetting pointers to zero when the buffer is empty to maximize contiguous allocation. The allocation logic is now split into two clear cases (wrapped vs. non-wrapped buffer), which is much easier to reason about. The fix to the free_buf logic, which now correctly advances the start pointer, is also crucial. I have thoroughly reviewed the new logic and it appears to be sound and correct, addressing the described bugs effectively. I did not find any high or critical severity issues with the proposed changes.

@gemini-code-assist
Copy link
Contributor

Acknowledged. Initiating a comprehensive code review for this pull request.

@ywang96
Copy link
Member

ywang96 commented Oct 18, 2025

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. More of your lovely PRs please.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@dongluw
Copy link
Contributor

dongluw commented Oct 19, 2025

would be nice if you can help adding your test case to vllm/tests/distributed/test_shm_buffer.py

@ywang96 ywang96 modified the milestone: v0.11.1 Oct 19, 2025
Co-authored-by: donglu <[email protected]>
Signed-off-by: Kero Liang <[email protected]>
@imkero imkero force-pushed the fix/shm-object-storage-allocation branch from e505b72 to 8165963 Compare October 25, 2025 10:28
Signed-off-by: Kero Liang <[email protected]>
Signed-off-by: Kero Liang <[email protected]>
@imkero
Copy link
Contributor Author

imkero commented Oct 25, 2025

would be nice if you can help adding your test case to vllm/tests/distributed/test_shm_buffer.py

Yeah, I put the test case into vllm/tests/distributed/test_shm_buffer.py (test_allocation_cycles) with necessary assertions on allocation / free ops.

@imkero imkero requested a review from dongluw October 25, 2025 11:12
@dongluw
Copy link
Contributor

dongluw commented Oct 27, 2025

looks good to me, thanks @imkero
maybe @ywang96 can have another look

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 27, 2025
Signed-off-by: Roger Wang <[email protected]>
Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated an attribute error since np.bool has been deprecated, but otherwise LGTM!

@ywang96 ywang96 enabled auto-merge (squash) October 27, 2025 20:44
@ywang96 ywang96 merged commit 02af36d into vllm-project:main Oct 28, 2025
52 checks passed
@imkero imkero deleted the fix/shm-object-storage-allocation branch October 28, 2025 15:50
bhagyashrigai pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Oct 29, 2025
…lm-project#27117)

Signed-off-by: Kero Liang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: donglu <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Signed-off-by: Bhagyashri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants