fix(native): support Metal custom V-cache SET_ROWS#9303
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
❌ PR title does not match the required pattern. Please use one of these formats:
|
446b92a to
ff9cdd8
Compare
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
ff9cdd8 to
1cf1fbe
Compare
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
1cf1fbe to
fc6180f
Compare
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Fixes #9258.
Summary
plugins/plugin-local-inference/native/llama.cppat6e83e4b9b808bc21100c7846fcc1acd0a0fa674c, which adds MetalSET_ROWSand copy/dequant support for manually selected custom V-cache tensors:tbq3_0,tbq4_0, andq4_polar..github/issue-evidence/9258-metal-v-cache-set-rows.md.origin/developbefore final validation.Root Cause
Manual custom V-cache selections could route cache updates through
GGML_OP_SET_ROWS, but the Metal backend did not have destination kernels or dispatch wiring forTBQ3_0,TBQ4_0, orQ4_POLAR. With flash attention enabled, those custom cache tensors could also reach stock attention paths that need a backend-supported dequant/copy path first.Validation
Native macOS Metal:
cmake --build build-metal-9258 --target test-backend-ops llama-cli llama-completion llama-server -j 12-> passedxcrun -sdk macosx metal ... ggml-metal.metal ...-> passed, warnings onlytest-backend-ops test -b MTL0 -o SET_ROWS -p "(tbq3_0|tbq4_0|q4_polar)"-> 12/12 passedtest-backend-ops test -b MTL0 -o CPY -p "(tbq3_0|tbq4_0|q4_polar)"-> 6/6 passedllama-clismoke runs with-fa on -ctv tbq3_0,tbq4_0, andq4_polar-> all generated tokens and exited 0llama-completionsmoke runs with the same three cache types -> all exited 0Node/web HTTP path:
llama-serverbuilt and served the real GGUF model on127.0.0.1:19058/completionHTTP requests fortbq3_0,tbq4_0, andq4_polareach returned JSON withtokens_predicted: 4and exited 0iOS / Apple-platform packaging and runtime:
xcrun -sdk iphoneos metal ... ggml-metal.metal ...-> passed, warnings onlyxcrun -sdk iphonesimulator metal ... ggml-metal.metal ...-> passed, warnings onlyELIZA_MTP_FORCE_REBUILD=1 node packages/app-core/scripts/build-llama-cpp-mtp.mjs --target ios-arm64-metal-> passedELIZA_MTP_FORCE_REBUILD=1 node packages/app-core/scripts/build-llama-cpp-mtp.mjs --target ios-arm64-simulator-metal-> passednode packages/app-core/scripts/ios-xcframework/build-xcframework.mjs --output /tmp/LlamaCpp-9258.xcframework --verify-> passed; device and simulator kernel/runtime symbol audits passed, slicesios/arm64andios-simulator/arm64bun run --cwd packages/app build:ios:local:sim-> passed with** BUILD SUCCEEDED **run-physical-device-smoke.mjs-> passed on an iPhone 16 Pro Max;testLibElizaInferenceAbiV1CallsMatchHeader,testLlamaKernelAndVoiceSymbolsResolve, andtestMetalDeviceIsAvailableOnPhysicalIospassed, optional benchmark skipped because no model was bundledRepo/package gates:
bun run --cwd plugins/plugin-native-llama test-> 4 files passed, 35 tests passedbun run --cwd plugins/plugin-local-inference test-> 201 files passed, 1 skipped; 2065 tests passed, 13 skippedbun install-> passedbun run verify-> passed,509 successful, 509 totalEvidence:
.github/issue-evidence/9258-metal-v-cache-set-rows.md