[Vulkan] Force device-local memory allocation on discrete GPUs

## Problem

The Vulkan allocator's memory type selection on discrete GPUs allows fallback to host-visible memory types (lines 320-331 in `allocator.cpp`). This causes critical performance degradation:

```cpp
// Current problematic discrete GPU path:
preferred_memory_types = {
    vk::MemoryPropertyFlagBits::eDeviceLocal,                    // 1st choice
    vk::MemoryPropertyFlagBits::eDeviceLocal |                    // 2nd choice - PROBLEMATIC
        vk::MemoryPropertyFlagBits::eHostVisible |
        vk::MemoryPropertyFlagBits::eHostCoherent,
    vk::MemoryPropertyFlagBits::eHostVisible |                    // 3rd choice - VERY SLOW
        vk::MemoryPropertyFlagBits::eHostCoherent,
    // ...
};
```

When device-local memory is exhausted, compute buffers (GEMM weights, KV-cache, activations) can end up in:
- BAR memory (device+host visible) - slower for GPU-only access
- Pure host-visible memory - PCIe bandwidth bottleneck

**Impact on decode performance:**
- GEMM weights in host-visible memory → PCIe-bound throughput
- KV-cache spill to host → decode throughput collapse
- The ~8.5 tok/s wall is partially caused by this

## Why This Matters

Both reference runtimes (ggml-vulkan, Zinc) explicitly prefer device-local VRAM for compute buffers:

- ggml-vulkan uses `VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT` for compute tensors
- Zinc allocates GPU-private memory for weights/activations, only using mapped memory for explicit transfers

MLX's fallback chain allows host-visible selection which destroys sustained bandwidth on discrete GPUs.

## Tasks

- [ ] Force pure `eDeviceLocal` memory on discrete GPUs (no host-visible fallback)
- [ ] Use separate staging buffers (with host-visible) only for explicit upload/download
- [ ] Add environment variable override: `MLX_VULKAN_FORCE_DEVICE_LOCAL=1`
- [ ] Add tracing to log memory type selection per allocation
- [ ] Benchmark Qwen3 decode before/after on discrete GPU (AMD/NVIDIA)

## Acceptance Criteria

- All compute buffers (weights, KV-cache, activations) allocated in device-local VRAM on discrete GPUs
- Staging uploads/downloads still work correctly
- Qwen3 decode throughput improves measurably on discrete GPUs
- No correctness regressions

## Code References

- `mlx/backend/vulkan/allocator.cpp:312-331` (`malloc()` memory type selection)
- `mlx/backend/vulkan/allocator.cpp:357-361` (mapped_ptr handling)
- `mlx/backend/vulkan/device.cpp` (staging arena management)

## Related

- #3 (persistent staging allocator - related but this is about memory type selection, not reuse)
- #15 (resource-aware decode - this enables that work by ensuring resources are in fast memory)
- Performance tracking issue #19


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Vulkan] Force device-local memory allocation on discrete GPUs #25

Problem

Why This Matters

Tasks

Acceptance Criteria

Code References

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[Vulkan] Force device-local memory allocation on discrete GPUs #25

Description

Problem

Why This Matters

Tasks

Acceptance Criteria

Code References

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions