* Please refer to the English Version as our Official Version.
With the evolution of artificial intelligence (AI), utilizing small language models (SLM) to perform AI workloads on embedded devices has become a focus of industry attention. Small language models such as Llama, Gemma, and Phi3 have gained widespread recognition for their excellent cost-effectiveness, high efficiency, and ease of deployment on devices with limited computing power. Arm expects the number of such models to continue to grow in 2025.
Arm technology, with its significant advantages of high performance and low power consumption, provides an ideal operating environment for small language models, effectively improving operational efficiency and further optimizing user experience. In order to intuitively demonstrate the great potential of endpoint AI in the field of Internet of Things and edge computing, the Arm technical team recently created a technical demonstration. In the demonstration, when the user inputs a sentence, the system will generate a children's story based on the sentence extension. This demonstration was inspired by Microsoft's "Tiny Stories" paper and Andrej Karpathy's TinyLlama2 project, which used 21 million stories to train a small language model to generate text.
This demonstration is equipped with Arm Ethos-U85 NPU and runs a small language model on embedded hardware. Although Large Language Models (LLMs) are more widely known, they are receiving increasing attention due to their ability to provide excellent performance with fewer resources and lower costs, as well as being easier and more cost-effective to train.
Implementing Transformer based Small Language Model on Embedded Hardware
Arm's demonstration showcased Ethos-U85 as a small, low-power platform with the ability to run generative AI and highlighted the outstanding performance of small language models in specific fields. The TinyLlama2 model is more simplified compared to larger models from companies such as Meta, and is well suited for demonstrating the AI performance of Ethos-U85, making it an ideal choice for endpoint AI workloads.
To develop this demonstration, Arm conducted extensive modeling work, including creating an all integer INT8 (and INT8x16) TinyLlama2 model and converting it to a fixed shape TensorFlow Lite format suitable for Ethos-U85 constraints.
Arm's quantization method shows that the all integer language model achieves a good balance between high accuracy and output quality. Arm does not require floating-point operations through quantization activation, normalization functions, and matrix multiplication. Due to the high cost of floating-point operations in terms of chip area and energy consumption, this is a key consideration for resource constrained embedded devices.
Ethos-U85 runs language models on FPGA platforms at a frequency of 32 MHz, with a text generation speed of 7.5 to 8 tokens per second, comparable to human reading speed, while consuming only a quarter of computing resources. On practical system on chip (SoC) applications, this performance can be improved up to ten times, significantly enhancing the processing speed and energy efficiency of edge side AI.
The children's story generation feature adopts the open-source version of Llama2, combined with the Ethos NPU backend, and runs the demonstration on TFLite Micro. Most of the reasoning logic is written in C++language at the application layer, and by optimizing the context window content, the coherence of the story is improved, ensuring that AI can tell the story smoothly.
Due to hardware limitations, the team needs to adapt the Llama2 model to ensure its efficient operation on Ethos-U85 NPU, which requires careful consideration of performance and accuracy. The mixed quantization techniques of INT8 and INT16 demonstrate the potential of all integer models, which is beneficial for the AI community to actively optimize generative models for edge devices and promote the widespread application of neural networks on energy-efficient platforms such as Ethos-U85.
Arm Ethos-U85 showcases outstanding performance
The multiply accumulate (MAC) units of Ethos-U85 can be expanded from 128 to 2048, which is a 20% increase in energy efficiency compared to the previous generation product Ethos-U65. Another significant feature of Ethos-U85 compared to its previous generation is its ability to natively support Transformer networks.
Ethos-U85 supports seamless migration for partners using the previous generation Ethos-U NPU and fully utilizes its existing investment in Arm architecture based machine learning (ML) tools. With its excellent energy efficiency and outstanding performance, Ethos-U85 is increasingly favored by developers.
If 2048 MAC configurations are used on the chip, Ethos-U85 can achieve 4 TOPS performance. In the demonstration, Arm used a smaller configuration of 512 MACs on the FPGA platform and ran the TinyLlama2 small language model with 15 million parameters at a frequency of 32 MHz.
This capability highlights the possibility of directly embedding AI into devices. Despite limited memory (320 KB SRAM for caching and 32 MB for storage), Ethos-U85 can efficiently handle such workloads, laying the foundation for the widespread application of small language models and other AI applications in deep embedded systems.
Introducing Generative AI into Embedded Devices
Developers need more advanced tools to cope with the complexity of edge side AI. Arm is committed to meeting this demand by launching Ethos-U85 and supporting Transformer based models. With the increasing importance of edge side AI in embedded applications, Ethos-U85 is driving the implementation of various new use cases ranging from language models to advanced visual tasks.
Ethos-U85 NPU provides excellent performance and outstanding energy efficiency required for innovative cutting-edge solutions. Arm's demonstration demonstrated significant progress in introducing generative AI into embedded devices and highlighted the convenience and feasibility of deploying small language models on the Arm platform.
Arm is bringing new opportunities for edge AI in a wide range of application fields, and Ethos-U85 has become a key driving force for the development of a new generation of intelligent and low-power devices.