Skip to content

RotorQuant / PlanarQuant / IsoQuant extremely slow on Metal with high graph splits (>2) #7

@lvirbalas

Description

@lvirbalas

Hello,

I tested iso3 and polar3 on a build from your llama.cpp fork. I built it with the command provided in README.md for Metal.

The issue is that when loading different models, all of them report:
graph splits >> 2

This usually happens, if any op isn’t fully supported/fused on Metal, or if there’s a missing kernel. Then the compute graph falls back to CPU for parts of it.

Result is very high CPU load, GPU underutilization and prefill speeds <50 t/s, where it should be >300 t/s.

If I run the same models on the original TurboQuant fork with turbo3, they all nicely show:
graph splits = 2
(And don't have CPU utilization)

Is this issue expected and simply not implemented yet? Or is it actually a bug to dig into further?

Hardware: MacBook Pro M3 Max

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions