RFC: ensure byte order consistency in GGUF and llama.cpp across LE and BE hosts #17311

zv-io · 2025-11-17T04:34:53Z

zv-io
Nov 17, 2025

This relates to #3552 and #3957.

I want to start a broader discussion about whether and how to support endianness (byte order) differences between some host platforms and models, and to push for a coherent and consistent path forward, to accomplish these objectives:

Avoid conversion/bandwidth/storage waste;
Ensure that models run correctly on LE and BE hosts;
Ensure that any effort put into developing format specifications, conversion tools, and llama.cpp itself is well spent, prioritizing accuracy and correctness and avoiding future code churn or incompatibilities;
Understand any performance implications.

While the codebase as of at least 2376b77 will compile on big-endian targets (e.g. ppc64), it will quickly encounter errors while loading little-endian .gguf models, as early as the built-in test suite:

$ llama-cli --version
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (CPU)
version: 11624 (907c5c34d)
built with gcc (REDACTED 13.3.0) 13.3.0 for powerpc64-REDACTED-linux-musl
...
20/39 Test #20: test-tokenizer-1-llama-spm ........***Failed    0.01 sec
main : reading vocab from: '/usr/src/packages/user/llama-cpp/src/llama.cpp-b7058/models/ggml-vocab-llama-spm.gguf'
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (CPU)
gguf_init_from_file_impl: failed to load model: this GGUF file version 50331648 is extremely large, is there a mismatch between the host and model endianness?
gguf_init_from_file_impl: failed to read header
llama_model_load: error loading model: llama_model_loader: failed to load model from /usr/src/packages/user/llama-cpp/src/llama.cpp-b7058/models/ggml-vocab-llama-spm.gguf
llama_model_load_from_file_impl: failed to load model
main: error: failed to load vocab '/usr/src/packages/user/llama-cpp/src/llama.cpp-b7058/models/ggml-vocab-llama-spm.gguf'

This assertion (garbage file version) fails due to a check which assumes file version byte order:

if (ok && (ctx->version & 0x0000FFFF) == 0x00000000) {
    GGML_LOG_ERROR("%s: failed to load model: this GGUF file version %" PRIu32 " is extremely large, is there a mismatch between the host and model endianness?\n", __func__, ctx->version);
    ok = false;
}

The on-disk start of the file is:

# xxd original.gguf | head -n 1
00000000: 4747 5546 0300 0000 3200 0000 0000 0000  GGUF....2....... <-- real LE model
                    ^^^^^^^^^
                    (version)

If the proposal to use GGUF magic for LE files and FUGG magic for BE files is to be implemented:

Transmitting and storing one copy of each would require twice the disk space while providing no new information.
Conversion tools will require maintenance as the format evolves, and will require robust validation to ensure they work flawlessly.
Robust file format specifications will need to be laid out, including specifying whether all, or only a subset of, fields/values/parameters will need to be interpreted as LE or BE. I cannot speak for performance about model loading times or whether conversion should happen on the fly.

If we patch the GGUF files in place such that:

# xxd patched.gguf | head -n 1
00000000: 4747 5546 0000 0003 0000 0000 0000 0032  GGUF...........2 <-- hypothetical BE model

then it will be difficult to ensure such a conversion is done correctly, as the conversion tool(s) themselves will also need to be endian-correct whether they're used on an LE or BE host. This will not prevent cross-endian GGUF files from failing to run on the opposite host, and likely many tools will choke as well.

I believe it is a software (inference engine) issue, not strictly a file format issue, to ensure that data are encoded and interpreted correctly. The file format (data exchange vehicle) should probably be consistent (effectively serialized) regardless of the byte order of the system that produced the GGUF. In other words, host byte order should not affect the data written to disk. The file format should specify how the data are to be interpreted.

Similarly, llama.cpp should behave the same (modulo any performance penalties) regardless of the input file. Suppose we fix the loader so that it is endian-agnostic, now other such assumptions are revealed:

...
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)                              
Assertion failed: scale > 0.0f (/usr/src/packages/user/llama-cpp/src/llama.cpp-b7058/ggml/src/ggml-cpu/ops.cpp: ggml_compute_forward_rms_norm_f32: 3689)                                                                                                                        
Aborted

While BE systems are not nearly as common or popular as LE systems, they do exist. My proposal is to fix all byte order assumptions in the llama.cpp codebase but I want to know if this topic has been explored in depth and whether/where to put my and others' effort.

awilfox · 2025-11-17T04:45:07Z

awilfox
Nov 17, 2025

Note that whisper.cpp had ggml-org/whisper.cpp#398 and ggml-org/whisper.cpp#2816 successfully merged to have big-endian support with a similar theory of implementation.

0 replies

taronaeo · 2025-11-17T06:20:38Z

taronaeo
Nov 17, 2025
Collaborator

While the codebase as of at least 2376b77 will compile on big-endian targets (e.g. ppc64), it will quickly encounter errors while loading little-endian .gguf models, as early as the built-in test suite:

As of devops: add s390x & ppc64le CI #15925, test suites for Big-Endian platforms use a pre-converted Big-Endian model as shown here:

https://github.com/taronaeo/llama.cpp-s390x/blob/8ab049129c092b13c728e809ce125ef94bae57bd/.github/workflows/build.yml#L262-L269

If you intend on running the same test on your system, you should point the test to the Big-Endian test model.

If we patch the GGUF files in place such that:

xxd patched.gguf | head -n 1

00000000: 4747 5546 0000 0003 0000 0000 0000 0032 GGUF...........2 <-- hypothetical BE model
then it will be difficult to ensure such a conversion is done correctly, as the conversion tool(s) themselves will also need to be endian-correct whether they're used on an LE or BE host. This will not prevent cross-endian GGUF files from failing to run on the opposite host, and likely many tools will choke as well.

Discerning Little-Endian and Big-Endian model byte-order is currently done by reading the byte-order of the version, which you have pointed out.

We have an existing tool (gguf_convert_endian.py) that uses numpy to byteswap the models and it seem to have performed fine for the supported datatypes/structures.

There was actually a proposal by @AlekseiNikiforovIBM to load Little-Endian models on Big-Endian systems on-the-fly but wasn't accepted: Allow s390x to load little endian models unmodified #11234

2 replies

zv-io Nov 17, 2025
Author

Thanks for the pointer. I can confirm that, unmodified, llama.cpp can run this model on a BE host (ppc64 in this case).
This behavior will only detect if the file matches the host platform but does not ensure validity of the rest of the file, which I'm not clear on whether is intended to be fully byte-swapped, or only certain sections, or if there's an explicit mix of "LE + Host" endian data. The closest I can find to a file specification is
Ah, thank you. I had not seen this. For reference, some of those changes seem to have been integrated in this project. Based on the discussion in Allow s390x to load little endian models unmodified #11234 it seems this project's position is to pretend BE systems don't exist and to punt model conversion to users of those systems without concern to the broader ecosystem, the least painful route will be for model providers to convert models on demand, assuming there is a robust conversion toolkit available.
The closest I can find to a GGML or GGUF specification is https://github.com/ggml-org/ggml/blob/master/docs/gguf.md; do I understand this correctly, or is there a canonical specification elsewhere? If the endianness of all of the fields/data within such files can be fully specified, it makes writing conversion tools much easier.

I will do some testing on both LE and BE systems with existing conversion tools and provide some additional thoughts/commentary. I assume that, realistically, all future development of this project and other downstream GGUF consumers will only test on LE systems without dedicated BE-interested maintainers. It's clear that access to BE hardware is not the issue.

CISC Nov 17, 2025
Collaborator

I assume that, realistically, all future development of this project and other downstream GGUF consumers will only test on LE systems without dedicated BE-interested maintainers. It's clear that access to BE hardware is not the issue.

Thanks to @taronaeo et al. at IBM we have recently added big endian CIs running the full test suite, which means we are now in a much better position to maintain this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: ensure byte order consistency in GGUF and llama.cpp across LE and BE hosts #17311

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

xxd patched.gguf | head -n 1

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RFC: ensure byte order consistency in GGUF and llama.cpp across LE and BE hosts #17311

Uh oh!

Uh oh!

zv-io Nov 17, 2025

Replies: 2 comments · 2 replies

Uh oh!

awilfox Nov 17, 2025

Uh oh!

Uh oh!

taronaeo Nov 17, 2025 Collaborator

xxd patched.gguf | head -n 1

Uh oh!

zv-io Nov 17, 2025 Author

Uh oh!

CISC Nov 17, 2025 Collaborator

zv-io
Nov 17, 2025

Replies: 2 comments 2 replies

awilfox
Nov 17, 2025

taronaeo
Nov 17, 2025
Collaborator

zv-io Nov 17, 2025
Author

CISC Nov 17, 2025
Collaborator