rocm: sync branch with main#219
Draft
24601 wants to merge 86 commits into
Draft
Conversation
Implements the Responses API endpoint that Codex CLI (and other modern OpenAI tooling) speaks instead of /v1/chat/completions. The wire format is documented in OpenAI's Responses API; this implementation has been iterated against the Codex CLI binary's SSE parser shape until no remaining schema gaps were found. Request parsing (parse_responses_request, parse_responses_input): - Accepts the typed input array (message, function_call, function_call_output, reasoning, custom_tool_call(_output), local_shell_call(_output), web_search_call(_output), tool_search_call(_output), image_generation_call(_output), compaction, context_compaction). - Maps hosted-tool history to function_call/function_call_output so prior actions survive across turns; rejects unknown item types and non-completed status with 400 to avoid silent context loss. - Strict content-array parsing: only string|null|array of recognized text blocks (input_text/output_text/text/summary_text/ reasoning_text); rejects non-text modalities (input_image/file/ audio) instead of accepting an empty prompt. - Merges adjacent function_call items into the preceding assistant message so text + tool-call turns render as a single assistant block. - Honors reasoning.effort (incl. "minimal"/"none") and gates reasoning summary surface on reasoning.summary opt-in. - Rejects previous_response_id, conversation, and forced tool_choice explicitly (constrained decoding / persisted state not supported). Output (responses_sse_*, responses_final_response): - Emits the full streaming lifecycle: response.created, output_item.added/.done, reasoning_summary_part.added/.done, reasoning_summary_text.delta/.done, content_part.added/.done, output_text.delta/.done, function_call_arguments.delta/.done, response.completed. - Branches the terminal event by finish reason: response.failed for errors and response.incomplete with reason "max_tokens" for length. - Every event carries sequence_number; every output_text part carries annotations:[]; function_call output_item.added ships with an empty arguments string (full args arrive via function_call_arguments.done and output_item.done), and item ids are stable across added/done. - Tracks whether </think> was actually observed so a truncated stream marks the reasoning item incomplete instead of "completed". - Recovers gracefully when the DSML tool parse fails after the model was suppressed at the tool marker: the suppressed tail is flushed as additional output_text deltas so the streamed message matches output_item.done. Tested by 25 rounds of /codex:adversarial-review against the same client this is meant to feed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Broaden the DS4 imatrix prompt dataset with provider-neutral agent/tool traffic, multi-language programming prompts, algorithm recall, Bash scripting, and multilingual translation tasks. Remove duplicate rendered prompts and avoid provider-specific client references in the generated calibration corpus. This improves calibration coverage without claiming to fix a distributed GGUF bug.
Fold the successful CUDA selector/top-k/indexed-attention changes into one clean commit. This excludes rejected experiment commits and the local prefill-slope work log.\n\nMeasured on GB10 with speed-bench/promessi_sposi.txt, 2048-token append chunks: 32K prefill improved from 255.61 tok/s on origin/main to 346.49 tok/s. Full-curve average improved from 316.39 tok/s to 369.76 tok/s. 32K full prompt + 128-token generation prefill improved from 312.87 tok/s to 368.43 tok/s, while generation stayed neutral at 12.49 -> 12.48 tok/s.\n\nCorrectness: make cuda-regression; ./ds4_test --logprob-vectors --tool-call-quality; ./ds4_test --server --metal-kernels.
Build score_official against the CUDA runtime on Linux and select the CUDA backend there, while keeping the existing Metal path on macOS.\n\nCorrectness: make -C gguf-tools quality-score; gguf-tools/quality-testing/score_official ds4flash.gguf /tmp/ds4_quality_smoke/manifest.tsv /tmp/ds4_quality_smoke/scores.tsv 16384.
Replace the default long-context continuation check with a deterministic prose-story retrieval test. The fixture embeds spelled-out person-number assignments in a long rendered prompt, and ds4_test now validates the generated Name=number list instead of brittle sampled prose.
Preserve Responses namespace metadata and tool_search calls while rendering DSML-safe internal tool names. Replay function_call, hosted tool, and tool_search_output items into the shared chat/tool path so Codex and Pi can round-trip tool calls without losing KV-cache prefix reuse. Document the /v1/responses endpoint and add server unit coverage for namespace, tool_search, and replay output shapes.
This reverts commit 2a7a5f3. There was no ack from the user. Don't want to take a fix that is astronautically produced from an unclear error trace.
Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth. Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.
Keep normal CUDA context buffers on device allocations, but route very large KV-cache tensors through managed memory so million-token contexts do not starve unified-memory systems during graph/session allocation. The fallback is scoped to the long-lived KV/cache tensors and logs when it is used because it may reduce performance. Tested on 0.180 with: - make cpu - make -B cuda-spark - make cuda-regression - ./ds4_test --server --metal-kernels - ./ds4_test --logprob-vectors --tool-call-quality - ds4-bench ctx-alloc 32768, 250000, and 1000000 - ds4-server --ctx 1000000 startup smoke (cherry picked from commit 0b248a65c07d21f2fc8ff4815bd8b75af26719f9)
Parse Anthropic tool_use blocks by their own type field instead of relying on the enclosing message role being parsed first. Some clients serialize messages as content-before-role, which made full-history tool_result replays look like unknown live-only continuations. Fixes antirez#127.
Return a 400 error with error type "context_exceeded" when prompt tokens exceed
context size. The response includes both n_prompt_tokens and n_ctx fields so
clients can determine exactly why the request failed and how far over the limit
they went.
Error response format:
{
"error": {
"message": "Prompt tokens (N) exceeds context size (M)",
"type": "context_exceeded",
"n_prompt_tokens": N,
"n_ctx": M
}
}
dwarfstar is typoed to drawfstar
fix typo in readme
Add ds4-agent as a native terminal coding agent with session KV storage, DSML tool execution, edit/read/search/write/list/bash tools, async bash jobs, compaction support, history replay, and a linenoise-based interactive UI. The agent defaults to CRC-guarded edits, streams tool visualization, ignores tool calls emitted inside thinking, improves bash output observations for long-running commands, and styles prompts, spacing, and prefill progress for interactive use.
Use an ANSI scroll region for streamed agent output, resize it as linenoise input wraps or shrinks, and anchor the prompt/status block at the terminal bottom. This also fixes cursor placement at exact column boundaries and removes the prompt/output overwrite cases seen during multiline editing.
Long prefills (large prompts, no cache hit) can take minutes on local hardware. ds4-server was silent on the socket the whole time, so HTTP and TCP idle timeouts on the client side would close the connection before the first response byte was written -- see the "sse headers failed" log line that appeared at the very end of a multi-minute prefill in real agent runs. Stream the SSE response headers from the prefill_chunk progress callback, then emit a ":" comment line (ignored by SSE clients per the spec) at most every five seconds while prefill is still running. The keepalive is best-effort: a closed socket simply fails the writes and the outer code discovers the dead connection when it tries to stream a real event, matching the existing error path. The tool-checkpoint rebuild path pre-arms headers_sent because it only runs after the response stream is already in flight, so it never tries to re-send the SSE header line. Verified on macOS Metal, q2-imatrix GGUF, ctx=200000: - ./ds4_test --server passes - 35s fresh prefill: client receives ": prefill" lines at +6/+12/ +18/+25/+31s, then SSE content events at +35s, no client disconnect - 1s cached prompt: unchanged (sse_headers is still emitted from the request handler when prefill never fires) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…20260521-mergeable # Conflicts: # Makefile
e4617b5 to
e13de73
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This brings the
rocmbranch forward to currentmain(8d576642) while keeping the ROCm patch scoped to the existing community ROCm backend work.Current state before this change:
rocmwas based at7a751ebrocmwas 85 commits behindmainand 1 commit aheadWhat changed:
mainintorocmwith conflicts resolvedmake rocmtarget that builds the current main binaries throughhipccThis intentionally does not include the user-facing
--rocmnaming/MTP mapping work from #156 or the ROCm WMMA/indexer optimization work from #180.Relation to #16
This is related to #16, but it should not close #16.
The goal here is to make the existing upstream
rocmbranch current and testable again on currentmain. It is not the later ROCm optimization work discussed in #16, and the speed numbers below show why this should stay draft until the remaining correctness/performance gaps are understood.Scope notes
I checked prior ROCm-related PRs before keeping this narrowly scoped:
rocmbranchThis PR is therefore only a branch sync plus the minimum build fixes needed for current
main.Validation environment
gfx11517.2.53211-364a905cudaAPI path over ROCm/HIPDeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.ggufe13de731787799310e53bffe98b1f077f7c96b50Build and unit validation
Commands run:
Results:
ds4_test --server: PASS (server: OK,ds4 tests: ok)Negative target guard check:
make clean make ds4_test GPU_BACKEND=rocm ROCM_ARCH=gfx90a -j$(nproc)Result: expected compile failure with the new wave64 GCN/CDNA unsupported-target error.
Existing warnings observed during the builds:
ds4_server.c: const-discard warning instop_list_find_fromds4_agent.c: existingsnprintftruncation warningsModel-backed validation
Model load / inspect:
Result: PASS. DS4 loaded the official q2-imatrix GGUF and initialized the ROCm/HIP backend.
Short deterministic generation:
./ds4 --cuda -m ds4flash.gguf --ctx 4096 --nothink --temp 0 -n 32 -p 'Reply exactly: ROCm OK'Result: PASS. Output was
ROCm OK.Reported speed for this tiny prompt:
Tool-call quality:
Result: PASS in both fast and exact paths.
Long-context story recall:
Result: PASS (
long-context: OK,ds4 tests: ok).Runtime note: this run took about one hour wall time on the validation machine. Progress markers reached
8192/30474,16384/30474,24576/30474, and30474/30474; correctness passed, but throughput is not yet competitive with the best ROCm numbers discussed in #16.Official logprob vectors:
Result: FAIL with one mismatch:
Manual dump for the failing prompt showed ROCm selecting uppercase
Cafter the opening code fence while the fixture expects lowercasec:The exact/quality ROCm path also selected uppercase
Cfor the same prompt:Speed smoke
Command:
Result:
These are usable but below the better Strix Halo ROCm numbers reported in #16, which is expected for this branch-sync PR because it deliberately does not include the later ROCm optimization work.
Additional review
Read-only adversarial reviews were run with DeepSeek V4, Cursor Composer 2.5, and Gemini 3.5 Flash. They did not identify a branch-sync merge blocker. Their actionable feedback is reflected here:
Draft blockers / remaining gaps
./ds4_test --logprob-vectorsdoes not fully pass on ROCm because of theshort_code_completionCvscmismatch.