v0.5.0
The 0.5 release of ExecuTorch accompanies the release of PyTorch 2.6, and includes various updates and improvements to ExecuTorch’s backend delegates, as well as slight improvements to the Python and C++ APIs. Most notably, dim order has been enabled in ExecuTorch export by default. For more details, please see this post
On the front of Llama model support, an eager runner has been added to the Llama example to allow running inference in eager mode; additionally, support for AttentionSink has been added for eager mode execution.
API Changes
- Introduced a C++ 
TensorAccessorclass for ExecuTorch tensors based on PyTorch’sTensorAccessorclass - Introduced a Python 
save(path: str)toExecutorchProgramManagerto reduce boilerplate code required to serialize to a.ptefile - Introduced the C++ 
PlatformMemoryAllocatorclass to allow kernel authors to provide their own memory allocation implementation - Introduced the C++ 
num_instructions()function to the C++Methodclass - Enabled direct serialization of 
uint16 typesin ExecuTorch programs 
Build
- ExecuTorch nightly binaries are now built only for python 
3.10,3.11and3.12 - Introduced nightly builds for Apple platforms, which can be found listed here
 - Added support for NumPy 2
 
Backends
Arm
- Added support for the following operators:
1D convolution, Tanh activation,select, 2D max pooling,upsample_nearest2d,cat/stack,rshift,concat,log_softmax,var,layer_norm - Improved support of reduction operators
 - Extended softmax to handle dim < 0
 - Added support for 
keep_dims == Trueformeanandvaroperators - Enabled reporting of Ethos-U PMU hardware counters in the Arm delegate executor
 - Multiple TOSA Spec support
 - Adding model evaluation functionality to the AOT Compiler
 
Cadence
- Migrated most of the graph level compiler from internal Meta location to OSS location
 - Cadence OSS flow is now using ~50 graph-level optimization passes
 - Various improvements to the export workflow for Cadence chips
 - Expanded operator support to include 33 ATen operators and 11 quantized operators
 - Integrated multiple optimized kernels for HiFi and Fusion chips, resulting in large performance gains (double digit percent to orders of magnitude)
 - Enabled 
mobilenet_v2andresnet50as e2e tests 
CoreML
- Added the option to specify which CoreML compute unit to use in the Llama model export script
 - Fixed a compilation crash on iOS <16
 - Added support for dim order
 
Qualcomm
- Enabled batch prefill for llama with weight sharing feature
 - Various improvements to Llama model support for both prefill and decode, including sha, static_llama (kv cache as io), graph break reduction, and more
 - Added example for the 
wav2lettermodel - Added support for the 
retinanet_fpnmodel - Added support for the SA8295 SoC
 - Added support for QAT
 - Added support dim order
 - Added 
DrawGraphutility for graph visualization 
MediaTek
- Integrated the MediaTek backend in the Android Llama application
 - Added support for dim order
 
MPS
- Added support for dim order
 
Vulkan
- Improved support for Llama model architectures in the Vulkan backend:
- Added implementation of SDPA + KV cache updated fused operator
 - Added implementation of rotary embeddings
 
 - Various improvements to compute shader latency and memory footprint, such as:
- Introduced support for push constants in compute shaders, used to pass in tensor metadata (i.e. sizes)
 - Switched from 
VK_IMAGE_TILING_OPTIMALtoVK_IMAGE_TILING_LINEARas the default texture tiling setting which greatly reduces memory footprint of image textures used to store tensors - Reduced register pressure in compute shaders by using lower precision integer types to store texture positions and tensor indices
 
 - Added export pass to automatically insert transition ops to switch to a different optimal/required storage types or memory layout between operators in the export graph
 
XNNPACK
- Updated XNNPACK Version to commit hash 
1ed874e65which includes the newest KleidiAI Blockwise Kernels which gives around 20% performance improvement on Llama Prefill. - Support for delegating models quantized via 
torchao’squantize_api - New Partitioner XNNPACK Partitioner, with configurable settings that allow users greater control over how ops are partitioned
 - Support for 
to_edge_transform_and_lower, leveraging this API with the partitioner provides more stable lowerings - Allowed 
addmmandmmto call dynamic fp32 kernels - Fixes to partitioning of unsupported operators
 - Update 
cpuinfodependency to resolve intermittent faults on UNISOC-based phones 
Devtools
- Added a public benchmark dashboard, offering insights into ExecuTorch model performance trends, commit-to-commit comparisons, and anomaly detection. Onboarded Llama3.2-1B to track perf with SpinQuant, QLora, CoreML ANE.
 - Added support for 
uint16in the devtools inspector 
Llama Model Support
- Swapped TorchTune attention with custom export-friendly ExecuTorch attention
 - Added 
llama3_2_visiontext decoder as a TorchTune exportable model - Added a React Native LLaMA app for iOS devices
 - Added support for the  
bfloat16dtype in the LLM runner binary and theexport_llamascript - Added support for AttentionSink in the Llama example
 - Added TorchAO MPS low bit operators to the Llama runner
 - Added support for kv cache quantization; currently only 8-bit per token quantization is supported with FP32 as a dequantized dtype. This can be enabled in the 
export_llamascript using the–quanitze_kv_cacheoption. - Added support for quantized versions of Llama 3.2 1B/3B
 
Kernel Libraries
- Implemented several portable operators: 
pixel_unshuffle,gather,topk,convolution_backward,narrow_copy,masked_select,max.unary_out,min.unary_out,scatter.src_out,scatter.value_out.repeat_interleave.Tensor_out - Implemented 
tile_cropcustom operator - Implemented scalar 
truncprimitive operator - Implemented BFloat16 support, focusing on LLM operator coverage (
op_to_copy,op_mul,op_mm,op_copy,op_slice_scatter,op_scalar_tensor,op_where,op_add, CPUBLAS gemm). - Fixed handling of rank 0 tensors in optimized 
add/sub/div/mul - Fixed 
_native_batch_norm_legit_no_stats_out 
First Time Contributors
Thanks to the following contributors for making their first commit for this release!
@navsud, @meyering, @tugsbayasgalan, @Abhishek8394, @RahulK4102, @RdoubleA, @varunchariArm, @laithsakka, @limintang, @veselinp, @maggiemoss, @azad-meta, @anyj0527, @jainapurva, @suchir1, @ru-m8, @wdvr, @anijain2305, @tianxf99, @sxu, @f-meloni, @Vysarat, @georgehong, @lg-zhang, @h-friederich, @AIWintermuteAI, @itisgrisha, @ykhrustalev, @hietalajulius, @Nick-Wei, @Abhi-hpp, @KapJI, @YIWENX14, @clee2000, @Michiel-Olieslagers, @karthik-manju, @jakmro, @Aleksei-grovety,
Full Changelog: v0.4.0...v0.5.0