Academic

Orion: Characterizing and Programming Apple's Neural Engine for LLM Training and Inference

arXiv:2603.06728v1 Announce Type: new Abstract: Over two billion Apple devices ship with a Neural Processing Unit (NPU) - the Apple Neural Engine (ANE) - yet this accelerator remains largely unused for large language model workloads. CoreML, Apple's public ML framework, imposes opaque abstractions that prevent direct ANE programming and do not support on-device training. We present Orion, to our knowledge the first open end-to-end system that combines direct ANE execution, a compiler pipeline, and stable multi-step training with checkpoint resume in a single native runtime, bypassing CoreML entirely via Apple's private _ANEClient and _ANECompiler APIs. Building on prior characterization work by maderix, we extend public knowledge of ANE constraints to a catalog of 20 restrictions on MIL IR programs, memory layout, compilation limits, and numerical behavior, including 14 previously undocumented constraints discovered during Orion development. Orion includes a compiler that lowers a g

R
Ramchand Kumaresan
· · 1 min read · 20 views

arXiv:2603.06728v1 Announce Type: new Abstract: Over two billion Apple devices ship with a Neural Processing Unit (NPU) - the Apple Neural Engine (ANE) - yet this accelerator remains largely unused for large language model workloads. CoreML, Apple's public ML framework, imposes opaque abstractions that prevent direct ANE programming and do not support on-device training. We present Orion, to our knowledge the first open end-to-end system that combines direct ANE execution, a compiler pipeline, and stable multi-step training with checkpoint resume in a single native runtime, bypassing CoreML entirely via Apple's private _ANEClient and _ANECompiler APIs. Building on prior characterization work by maderix, we extend public knowledge of ANE constraints to a catalog of 20 restrictions on MIL IR programs, memory layout, compilation limits, and numerical behavior, including 14 previously undocumented constraints discovered during Orion development. Orion includes a compiler that lowers a graph IR through five optimization passes to ANE-native MIL and a runtime that manages IOSurface-backed zero-copy tensor I/O, program caching, and delta compilation for weight updates. Because the ANE bakes weights at compile time, naive training normally requires full recompilation per step (~4.2 s). We show that compiled programs can instead be updated by unloading, patching weight files, and reloading, bypassing ANECCompile() and reducing recompilation from 4,200 ms to 494 ms per step (8.5x), yielding a 3.8x training speedup. On an M4 Max, Orion achieves 170+ tokens/s for GPT-2 124M inference and demonstrates stable training of a 110M-parameter transformer on TinyStories for 1,000 steps in 22 minutes with zero NaN occurrences. We also present LoRA adapter-as-input, enabling hot-swap of adapters via IOSurface inputs without recompilation.

Executive Summary

This article presents Orion, an open-source end-to-end system that unlocks the Apple Neural Engine (ANE) for large language model (LLM) training and inference workloads. Orion bypasses Apple's CoreML framework by utilizing private APIs, enabling direct ANE programming and on-device training. The system includes a compiler and runtime that optimize and manage ANE execution, memory, and program caching. The authors demonstrate significant training speedup and stable training of a 110M-parameter transformer, as well as the implementation of LoRA adapter-as-input for adapter hot-swap without recompilation. Orion's capabilities and performance improvements make it a valuable tool for LLM researchers and practitioners, particularly in the context of on-device LLM training and deployment.

Key Points

  • Orion is an open-source, end-to-end system that unlocks the Apple Neural Engine (ANE) for LLM training and inference workloads.
  • Orion bypasses Apple's CoreML framework by utilizing private APIs, enabling direct ANE programming and on-device training.
  • The system includes a compiler and runtime that optimize and manage ANE execution, memory, and program caching.

Merits

Strength

Orion's ability to bypass CoreML and enable direct ANE programming and on-device training significantly expands the capabilities of the ANE, making it a valuable tool for LLM researchers and practitioners.

Strength

The system's compiler and runtime optimizations enable efficient ANE execution, memory management, and program caching, leading to improved training speed and stability.

Demerits

Limitation

Orion's reliance on private APIs may limit its compatibility and adoption across different Apple devices and versions, potentially restricting its widespread use.

Limitation

The system's performance and stability may be sensitive to the specific hardware and software configurations, potentially introducing additional complexity and variability.

Expert Commentary

The article presents a significant contribution to the field of LLM training and deployment, particularly in the context of on-device training and deployment. Orion's capabilities and performance improvements demonstrate the potential for efficient and effective LLM training and deployment on Apple devices. However, the system's reliance on private APIs and potential sensitivity to hardware and software configurations may introduce additional complexity and variability. Further research and development are needed to fully explore and refine Orion's capabilities and potential applications.

Recommendations

  • Future researchers and practitioners should explore and extend Orion's capabilities and performance improvements, potentially optimizing and refining the system for wider adoption and use.
  • Apple and industry stakeholders should consider incorporating Orion's techniques and insights into future Neural Engine development and release, potentially influencing future policies and guidelines for Neural Engine use and development.

Sources