Skip to content

[Vulkan] Force device-local memory allocation on discrete GPUs #25

@goniz

Description

@goniz

Problem

The Vulkan allocator's memory type selection on discrete GPUs allows fallback to host-visible memory types (lines 320-331 in allocator.cpp). This causes critical performance degradation:

// Current problematic discrete GPU path:
preferred_memory_types = {
    vk::MemoryPropertyFlagBits::eDeviceLocal,                    // 1st choice
    vk::MemoryPropertyFlagBits::eDeviceLocal |                    // 2nd choice - PROBLEMATIC
        vk::MemoryPropertyFlagBits::eHostVisible |
        vk::MemoryPropertyFlagBits::eHostCoherent,
    vk::MemoryPropertyFlagBits::eHostVisible |                    // 3rd choice - VERY SLOW
        vk::MemoryPropertyFlagBits::eHostCoherent,
    // ...
};

When device-local memory is exhausted, compute buffers (GEMM weights, KV-cache, activations) can end up in:

  • BAR memory (device+host visible) - slower for GPU-only access
  • Pure host-visible memory - PCIe bandwidth bottleneck

Impact on decode performance:

  • GEMM weights in host-visible memory → PCIe-bound throughput
  • KV-cache spill to host → decode throughput collapse
  • The ~8.5 tok/s wall is partially caused by this

Why This Matters

Both reference runtimes (ggml-vulkan, Zinc) explicitly prefer device-local VRAM for compute buffers:

  • ggml-vulkan uses VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT for compute tensors
  • Zinc allocates GPU-private memory for weights/activations, only using mapped memory for explicit transfers

MLX's fallback chain allows host-visible selection which destroys sustained bandwidth on discrete GPUs.

Tasks

  • Force pure eDeviceLocal memory on discrete GPUs (no host-visible fallback)
  • Use separate staging buffers (with host-visible) only for explicit upload/download
  • Add environment variable override: MLX_VULKAN_FORCE_DEVICE_LOCAL=1
  • Add tracing to log memory type selection per allocation
  • Benchmark Qwen3 decode before/after on discrete GPU (AMD/NVIDIA)

Acceptance Criteria

  • All compute buffers (weights, KV-cache, activations) allocated in device-local VRAM on discrete GPUs
  • Staging uploads/downloads still work correctly
  • Qwen3 decode throughput improves measurably on discrete GPUs
  • No correctness regressions

Code References

  • mlx/backend/vulkan/allocator.cpp:312-331 (malloc() memory type selection)
  • mlx/backend/vulkan/allocator.cpp:357-361 (mapped_ptr handling)
  • mlx/backend/vulkan/device.cpp (staging arena management)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions