qvac-fabric-llm.cpp

AI inference engine for embedded and mobile platforms**

qvac-fabric-llm.cpp is a specialized fork of llama.cpp optimized for embedded systems, mobile devices, and enterprise deployment scenarios. It extends the excellent foundation of llama.cpp with additional capabilities focused on memory-based model loading, mobile GPU optimization, and flexible integration patterns.

Key Features

Memory-Based Model Loading

Load models directly from memory buffers instead of files - essential for:

Embedded systems where filesystem access is limited or unavailable
WebAssembly deployments where models are fetched over network
Encrypted model storage where models are decrypted in memory
Streaming scenarios where models are received over network connections

#include "llama-cpp.h"

// Load model from memory buffer
std::vector<uint8_t> model_data = /* load from network, decrypt, etc. */;
auto model = llama_model_load_from_buffer(std::move(model_data), params);

// Or use split model loading with async fulfillment
auto model = llama_model_load_from_split_futures(paths, n_paths, context, tensor_list, params);

// ... later, fulfill splits as they become available
llama_model_load_fulfill_split_future(path, context, std::move(streambuf));

Optimized Vulkan & OpenCL Backend for Mobile GPUs

Enhanced GPU support with specific optimizations for Qualcomm Adreno GPUs:

Support for running quantized LLMs (Q4_0, Q8) on Adreno 700+ GPUs
Both Vulkan and OpenCL backends supported on Adreno GPUs
Adreno-specific shader variants for improved performance
Q4_K optimized mul_mat_vec operations
Vulkan Memory Allocator (VMA) integration for efficient memory management, with specific improvements for Google Pixel devices

Quick Start

Building from Source

git clone https://github.com/tetherto/qvac-fabric-llm.cpp.git
cd qvac-fabric-llm.cpp

# Standard build
cmake -B build
cmake --build build --config Release

# With Vulkan support (Android, Windows, Linux)
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

# With Metal support (macOS, iOS)
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release

For more detailed build instructions, see docs/build.md.

Running a Model

# Run a model with Vulkan GPU acceleration, 4096 context length, and a prompt
./build/bin/llama-cli -m model.gguf -ngl 99 -c 4096 -p "Explain quantum computing in simple terms"

Supported Platforms

Platform	Backend	Status
Linux (x86_64, ARM64)	CPU, Vulkan, CUDA	✅ Full support
macOS (Intel, Apple Silicon)	CPU, Metal	✅ Full support
Windows (x86_64)	CPU, Vulkan, CUDA	✅ Full support
Android (ARM64)	CPU, Vulkan, OpenCL	✅ Full support
iOS	CPU, Metal	✅ Full support

Relationship with llama.cpp

qvac-fabric-llm.cpp is built on top of llama.cpp, an excellent open-source LLM inference engine. We regularly synchronize with upstream llama.cpp releases to incorporate the latest improvements, bug fixes, and model support.

Current upstream version: b7248

What We Add

Memory-based model loading API
Vulkan Memory Allocator (VMA) integration (optimized for Pixel 9 devices)
Adreno 700+ GPU support for quantized LLMs (Q4_0, Q8)
Vulkan and OpenCL backends for Adreno GPUs
Adreno GPU-specific shader optimizations
Performance profiling tools

Upstream Compatibility

All standard llama.cpp functionality, models, and APIs remain fully compatible. You can:

Use any GGUF model supported by llama.cpp
Use all standard CLI tools (llama-cli, llama-server, etc.)
Follow llama.cpp documentation for general usage

Model Support

qvac-fabric-llm.cpp supports all models compatible with llama.cpp. Models must be in GGUF format. Convert from other formats using the provided Python scripts or use pre-converted models from Hugging Face.

Contributing

We welcome contributions! Please see our development workflow:

Fork the repository
Create a feature branch from master
Submit a pull request

License

MIT License - see LICENSE for details.

qvac-fabric-llm.cpp is built on llama.cpp by Georgi Gerganov and contributors.

For additional documentation, refer to the llama.cpp documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 7,326 Commits
.devops		.devops
.github		.github
benches/dgx-spark		benches/dgx-spark
ci		ci
cmake		cmake
common		common
common_test		common_test
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
pocs		pocs
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AUTHORS		AUTHORS
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
build-xcframework.sh		build-xcframework.sh
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
flake.lock		flake.lock
flake.nix		flake.nix
merging_strategy.md		merging_strategy.md
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
vulkan_profiling_analyzer.py		vulkan_profiling_analyzer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

qvac-fabric-llm.cpp

Key Features

Memory-Based Model Loading

Optimized Vulkan & OpenCL Backend for Mobile GPUs

Quick Start

Building from Source

Running a Model

Supported Platforms

Relationship with llama.cpp

What We Add

Upstream Compatibility

Model Support

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

tetherto/qvac-fabric-llm.cpp

Folders and files

Latest commit

History

Repository files navigation

qvac-fabric-llm.cpp

Key Features

Memory-Based Model Loading

Optimized Vulkan & OpenCL Backend for Mobile GPUs

Quick Start

Building from Source

Running a Model

Supported Platforms

Relationship with llama.cpp

What We Add

Upstream Compatibility

Model Support

Contributing

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages