Problem
The Vulkan allocator's memory type selection on discrete GPUs allows fallback to host-visible memory types (lines 320-331 in allocator.cpp). This causes critical performance degradation:
// Current problematic discrete GPU path:
preferred_memory_types = {
vk::MemoryPropertyFlagBits::eDeviceLocal, // 1st choice
vk::MemoryPropertyFlagBits::eDeviceLocal | // 2nd choice - PROBLEMATIC
vk::MemoryPropertyFlagBits::eHostVisible |
vk::MemoryPropertyFlagBits::eHostCoherent,
vk::MemoryPropertyFlagBits::eHostVisible | // 3rd choice - VERY SLOW
vk::MemoryPropertyFlagBits::eHostCoherent,
// ...
};
When device-local memory is exhausted, compute buffers (GEMM weights, KV-cache, activations) can end up in:
- BAR memory (device+host visible) - slower for GPU-only access
- Pure host-visible memory - PCIe bandwidth bottleneck
Impact on decode performance:
- GEMM weights in host-visible memory → PCIe-bound throughput
- KV-cache spill to host → decode throughput collapse
- The ~8.5 tok/s wall is partially caused by this
Why This Matters
Both reference runtimes (ggml-vulkan, Zinc) explicitly prefer device-local VRAM for compute buffers:
- ggml-vulkan uses
VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT for compute tensors
- Zinc allocates GPU-private memory for weights/activations, only using mapped memory for explicit transfers
MLX's fallback chain allows host-visible selection which destroys sustained bandwidth on discrete GPUs.
Tasks
Acceptance Criteria
- All compute buffers (weights, KV-cache, activations) allocated in device-local VRAM on discrete GPUs
- Staging uploads/downloads still work correctly
- Qwen3 decode throughput improves measurably on discrete GPUs
- No correctness regressions
Code References
mlx/backend/vulkan/allocator.cpp:312-331 (malloc() memory type selection)
mlx/backend/vulkan/allocator.cpp:357-361 (mapped_ptr handling)
mlx/backend/vulkan/device.cpp (staging arena management)
Related
Problem
The Vulkan allocator's memory type selection on discrete GPUs allows fallback to host-visible memory types (lines 320-331 in
allocator.cpp). This causes critical performance degradation:When device-local memory is exhausted, compute buffers (GEMM weights, KV-cache, activations) can end up in:
Impact on decode performance:
Why This Matters
Both reference runtimes (ggml-vulkan, Zinc) explicitly prefer device-local VRAM for compute buffers:
VK_MEMORY_PROPERTY_DEVICE_LOCAL_BITfor compute tensorsMLX's fallback chain allows host-visible selection which destroys sustained bandwidth on discrete GPUs.
Tasks
eDeviceLocalmemory on discrete GPUs (no host-visible fallback)MLX_VULKAN_FORCE_DEVICE_LOCAL=1Acceptance Criteria
Code References
mlx/backend/vulkan/allocator.cpp:312-331(malloc()memory type selection)mlx/backend/vulkan/allocator.cpp:357-361(mapped_ptr handling)mlx/backend/vulkan/device.cpp(staging arena management)Related