Hello,
I tested iso3 and polar3 on a build from your llama.cpp fork. I built it with the command provided in README.md for Metal.
The issue is that when loading different models, all of them report:
graph splits >> 2
This usually happens, if any op isn’t fully supported/fused on Metal, or if there’s a missing kernel. Then the compute graph falls back to CPU for parts of it.
Result is very high CPU load, GPU underutilization and prefill speeds <50 t/s, where it should be >300 t/s.
If I run the same models on the original TurboQuant fork with turbo3, they all nicely show:
graph splits = 2
(And don't have CPU utilization)
Is this issue expected and simply not implemented yet? Or is it actually a bug to dig into further?
Hardware: MacBook Pro M3 Max
Hello,
I tested iso3 and polar3 on a build from your llama.cpp fork. I built it with the command provided in README.md for Metal.
The issue is that when loading different models, all of them report:
graph splits >> 2
This usually happens, if any op isn’t fully supported/fused on Metal, or if there’s a missing kernel. Then the compute graph falls back to CPU for parts of it.
Result is very high CPU load, GPU underutilization and prefill speeds <50 t/s, where it should be >300 t/s.
If I run the same models on the original TurboQuant fork with turbo3, they all nicely show:
graph splits = 2
(And don't have CPU utilization)
Is this issue expected and simply not implemented yet? Or is it actually a bug to dig into further?
Hardware: MacBook Pro M3 Max