RotorQuant / PlanarQuant / IsoQuant extremely slow on Metal with high graph splits (>2)

Hello,

I tested iso3 and polar3 on a build from your llama.cpp fork. I built it with the command provided in README.md for Metal.

The issue is that when loading different models, all of them report:
graph splits >> 2

This usually happens, if any op isn’t fully supported/fused on Metal, or if there’s a missing kernel. Then the compute graph falls back to CPU for parts of it.

Result is very high CPU load, GPU underutilization and prefill speeds <50 t/s, where it should be >300 t/s.

If I run the same models on the original TurboQuant fork with turbo3, they all nicely show:
graph splits = 2
(And don't have CPU utilization)

Is this issue expected and simply not implemented yet? Or is it actually a bug to dig into further?

Hardware: MacBook Pro M3 Max

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RotorQuant / PlanarQuant / IsoQuant extremely slow on Metal with high graph splits (>2) #7

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

RotorQuant / PlanarQuant / IsoQuant extremely slow on Metal with high graph splits (>2) #7

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions