Release v0.5.0 · pytorch/executorch

The 0.5 release of ExecuTorch accompanies the release of PyTorch 2.6, and includes various updates and improvements to ExecuTorch’s backend delegates, as well as slight improvements to the Python and C++ APIs. Most notably, dim order has been enabled in ExecuTorch export by default. For more details, please see this post

On the front of Llama model support, an eager runner has been added to the Llama example to allow running inference in eager mode; additionally, support for AttentionSink has been added for eager mode execution.

API Changes

Introduced a C++ TensorAccessor class for ExecuTorch tensors based on PyTorch’s TensorAccessor class
Introduced a Python save(path: str) to ExecutorchProgramManager to reduce boilerplate code required to serialize to a .pte file
Introduced the C++ PlatformMemoryAllocator class to allow kernel authors to provide their own memory allocation implementation
Introduced the C++ num_instructions() function to the C++ Method class
Enabled direct serialization of uint16 types in ExecuTorch programs

Build

ExecuTorch nightly binaries are now built only for python 3.10, 3.11 and 3.12
Introduced nightly builds for Apple platforms, which can be found listed here
Added support for NumPy 2

Backends

Arm

Added support for the following operators:
1D convolution, Tanh activation, select, 2D max pooling, upsample_nearest2d, cat/stack, rshift, concat, log_softmax, var, layer_norm
Improved support of reduction operators
Extended softmax to handle dim < 0
Added support for keep_dims == True for mean and var operators
Enabled reporting of Ethos-U PMU hardware counters in the Arm delegate executor
Multiple TOSA Spec support
Adding model evaluation functionality to the AOT Compiler

Cadence

Migrated most of the graph level compiler from internal Meta location to OSS location
Cadence OSS flow is now using ~50 graph-level optimization passes
Various improvements to the export workflow for Cadence chips
Expanded operator support to include 33 ATen operators and 11 quantized operators
Integrated multiple optimized kernels for HiFi and Fusion chips, resulting in large performance gains (double digit percent to orders of magnitude)
Enabled mobilenet_v2 and resnet50 as e2e tests

CoreML

Added the option to specify which CoreML compute unit to use in the Llama model export script
Fixed a compilation crash on iOS <16
Added support for dim order

Qualcomm

Enabled batch prefill for llama with weight sharing feature
Various improvements to Llama model support for both prefill and decode, including sha, static_llama (kv cache as io), graph break reduction, and more
Added example for the wav2letter model
Added support for the retinanet_fpn model
Added support for the SA8295 SoC
Added support for QAT
Added support dim order
Added DrawGraph utility for graph visualization

MediaTek

Integrated the MediaTek backend in the Android Llama application
Added support for dim order

MPS

Added support for dim order

Vulkan

Improved support for Llama model architectures in the Vulkan backend:
- Added implementation of SDPA + KV cache updated fused operator
- Added implementation of rotary embeddings
Various improvements to compute shader latency and memory footprint, such as:
- Introduced support for push constants in compute shaders, used to pass in tensor metadata (i.e. sizes)
- Switched from VK_IMAGE_TILING_OPTIMAL to VK_IMAGE_TILING_LINEAR as the default texture tiling setting which greatly reduces memory footprint of image textures used to store tensors
- Reduced register pressure in compute shaders by using lower precision integer types to store texture positions and tensor indices
Added export pass to automatically insert transition ops to switch to a different optimal/required storage types or memory layout between operators in the export graph

XNNPACK

Updated XNNPACK Version to commit hash 1ed874e65 which includes the newest KleidiAI Blockwise Kernels which gives around 20% performance improvement on Llama Prefill.
Support for delegating models quantized via torchao’s quantize_ api
New Partitioner XNNPACK Partitioner, with configurable settings that allow users greater control over how ops are partitioned
Support for to_edge_transform_and_lower, leveraging this API with the partitioner provides more stable lowerings
Allowed addmm and mm to call dynamic fp32 kernels
Fixes to partitioning of unsupported operators
Update cpuinfo dependency to resolve intermittent faults on UNISOC-based phones

Devtools

Added a public benchmark dashboard, offering insights into ExecuTorch model performance trends, commit-to-commit comparisons, and anomaly detection. Onboarded Llama3.2-1B to track perf with SpinQuant, QLora, CoreML ANE.
Added support for uint16 in the devtools inspector

Llama Model Support

Swapped TorchTune attention with custom export-friendly ExecuTorch attention
Added llama3_2_vision text decoder as a TorchTune exportable model
Added a React Native LLaMA app for iOS devices
Added support for the bfloat16 dtype in the LLM runner binary and the export_llama script
Added support for AttentionSink in the Llama example
Added TorchAO MPS low bit operators to the Llama runner
Added support for kv cache quantization; currently only 8-bit per token quantization is supported with FP32 as a dequantized dtype. This can be enabled in the export_llama script using the –quanitze_kv_cache option.
Added support for quantized versions of Llama 3.2 1B/3B

Kernel Libraries

Implemented several portable operators: pixel_unshuffle, gather, topk, convolution_backward, narrow_copy, masked_select, max.unary_out, min.unary_out, scatter.src_out, scatter.value_out. repeat_interleave.Tensor_out
Implemented tile_crop custom operator
Implemented scalar trunc primitive operator
Implemented BFloat16 support, focusing on LLM operator coverage (op_to_copy, op_mul, op_mm, op_copy, op_slice_scatter, op_scalar_tensor, op_where, op_add, CPUBLAS gemm).
Fixed handling of rank 0 tensors in optimized add/sub/div/mul
Fixed _native_batch_norm_legit_no_stats_out

First Time Contributors

Thanks to the following contributors for making their first commit for this release!

@navsud, @meyering, @tugsbayasgalan, @Abhishek8394, @RahulK4102, @RdoubleA, @varunchariArm, @laithsakka, @limintang, @veselinp, @maggiemoss, @azad-meta, @anyj0527, @jainapurva, @suchir1, @ru-m8, @wdvr, @anijain2305, @tianxf99, @sxu, @f-meloni, @Vysarat, @georgehong, @lg-zhang, @h-friederich, @AIWintermuteAI, @itisgrisha, @ykhrustalev, @hietalajulius, @Nick-Wei, @Abhi-hpp, @KapJI, @YIWENX14, @clee2000, @Michiel-Olieslagers, @karthik-manju, @jakmro, @Aleksei-grovety,

Full Changelog: v0.4.0...v0.5.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.5.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

API Changes

Build

Backends

Arm

Cadence

CoreML

Qualcomm

MediaTek

MPS

Vulkan

XNNPACK

Devtools

Llama Model Support

Kernel Libraries

First Time Contributors

Contributors

Uh oh!