This guide explains how to convert LLM models for optimal use with the Apple Neural Engine (ANE).
Converting models for ANE requires special optimization techniques to achieve maximum performance. Our conversion process splits the model into distinct components and applies architecture-specific optimizations.
The conversion pipeline involves several key steps, with detailed progress tracking for each stage:
-
Model Configuration Analysis:
- Load and analyze the model's configuration
- Auto-detect architecture type
- Estimate parameter count and recommend chunk size
-
Model Loading:
- Load model weights with memory optimization
- Apply architecture-specific preprocessing
-
Model Splitting: Split into three components:
- Embeddings layer (token embeddings)
- Feed Forward Network (transformer layers)
- LM Head (token prediction)
-
KV Cache Optimization:
- Create prefill models for efficient processing of long contexts
- Optimize attention mechanisms based on architecture
-
Multi-Function Chunks:
- Merge FFN and prefill models to optimize weight sharing
- Reduce total model size by approximately 50%
-
LUT Quantization:
- Apply Look-Up Table quantization with architecture-specific settings
- Use different precision for different model components
-
Compilation:
- Convert to MLModelC format for efficient on-device execution
- Optimize for Apple Neural Engine
The macOS app includes a built-in model conversion interface:
- Launch the app:
swift run ANEChat - Click on the "Convert Model" tab
- Fill in the conversion options:
- Model Path: Select HuggingFace model directory
- Output Path: Choose where to save the converted model
- Architecture: Auto-detect or specify model type
- Context Length, Batch Size, Chunks, etc.
- Click "Convert Model" to start the process with real-time progress tracking
swift run ANEModelConverter convert-hf \
--model-id meta-llama/Llama-3.2-1B \
--output-dir ./models \
--context-length 1024 \
--num-chunks 2 \
--lut-bits 6For direct access to the conversion process with detailed progress reporting:
python scripts/convert_hf_to_coreml.py \
--model_path meta-llama/Llama-3.2-1B \
--output_path ./converted_model \
--max_seq_len 1024 \
--batch_size 64 \
--quantize_weights 6 \
--verboseYou'll see detailed progress indicators showing:
- Overall completion percentage
- Current conversion step
- Estimated time remaining
- Architecture-specific optimizations being applied
| Parameter | Description | Recommended Value |
|---|---|---|
--context |
Maximum context length | 1024-4096 |
--batch-size |
Batch size for prefill mode | 64-128 |
--chunks |
Number of model chunks | 1-2 (1B), 4-8 (7B+) |
--lut |
LUT quantization bits | 6 (balanced), 4 (speed) |
--architecture |
Model architecture | Auto-detected if not specified |
Different model architectures benefit from specific optimizations:
- Optimized attention mechanism for Llama-style MHA
- Split points after attention blocks and FFN projections
- Recommended quantization: 6-bit LUT
- Specialized handling for Qwen's grouped-query attention
- Split points after attention outputs
- Optimized embedding handling with shared weights
- Recommended quantization: 6-bit LUT
- Custom quantization handling for already quantized models
- Specialized KV cache for optimized memory usage
- Recommended setting: preserve existing quantization
- Optimized for Mistral's sliding window attention
- Special handling for Mistral's KV cache pattern
- Recommended chunks: 4 for 7B model
Choose chunk counts based on model size:
| Model Size | Recommended Chunks | iOS | macOS |
|---|---|---|---|
| 1B | 1-2 | ✓ | ✓ |
| 3B | 2-4 | ✓ | ✓ |
| 7B | 4-8 | ✓ | ✓ |
| 13B | 8-16 | ✓ | |
| 70B+ | 32+ | ❌ |
Note: iOS has a 1GB file size limit per model, while macOS can handle ~2GB.
After conversion, verify your model:
# Test in chat interface
swift run ANEChat
# Simple CLI test
swift run ANEToolCLI --model-path ./converted_model --prompt "Hello, world!"With the enhanced progress tracking, you can now monitor each step of the conversion process:
-
Configuration Analysis:
- Displays model details (parameters, architecture)
- Shows recommended chunk count based on architecture analysis
-
Weight Loading:
- Progress indicators for loading large model files
- Memory usage optimization notification
-
Optimization Phase:
- Shows which architecture-specific optimizations are being applied
- Details on any compatibility adjustments being made
-
Conversion and Quantization:
- Progress metrics for CoreML conversion
- Notifications for each major conversion step
- ETA based on processing speed
If conversion fails:
- Check the detailed error messages for specific step failures
- Verify input model format and completeness
- Try increasing chunk count for large models
- Use
--skip-checkif dependency checks are failing incorrectly - Use
--verboseto get more detailed progress and debugging information - Check for available memory (conversion requires significant RAM)
- For large models, ensure your system meets the minimum requirements
The conversion approach is based on techniques from:
- ANEMLL's architecture-aware model splitting
- Apple's CoreML quantization tools
- LitGPT optimization methods