In this interview, Michaël Uyttersprot, Market Segment Manager for Artificial Intelligence, Machine Learning, and Vision at Avnet Silica, talks about whether it makes sense to implement generative AI on embedded devices and, if so, what the potential benefits might be.
With 20 years of experience in the industry, Michael is currently focusing on supporting the development and promotion of embedded vision and deep learning solutions to engineers for projects involving AI and machine learning. From robotics engineer to thought leader in the field, Michael has gained recognition and the ability to present his research and thoughts to large audiences at industry events such as Hardware Pioneers Max.
What is Generative AI, and how does it differ from traditional AI?
Generative AI involves generating something new, something that did not exist before. While most people are familiar with it through chatbots, which primarily focus on generating content, specifically text, it can also transform ideas into stunning artwork, create original music and write code. This differs from standard machine learning or deep learning, which typically focuses on pattern recognition or identifying objects in an image or video.
The next step beyond generative AI is agentic AI; AI systems composed of autonomous agents that can make decisions and act independently to achieve specific goals. These systems can interact with the real world, and learn from these interactions to improve their performance.
What are Large Language Models (LLMs) in this context?
Large Language Models (LLMs) are a specific type of generative AI used for generating text in chatbot-type applications. The size of the language model is important, measured in parameters. LLMs deployed in the Cloud typically have several hundred billion parameters.
A key idea is having an LLM stored in a file that you interact with on an embedded device, such as a zip file. Therefore, smaller LLMs, sometimes called Small Language Models (SLMs) are needed. These models range between one billion and 10 billion parameters. In addition to text-based LLMs, there are also multimodal LLMs that can process and understand information from various other data types, or ‘modalities’, such as images, audio, and video. These models generate outputs that enable deeper, more nuanced understanding and responses.
What are the benefits of running LLMs locally on embedded devices?
Running generative AI locally on embedded devices offers several advantages over using cloud services like ChatGPT. Firstly, you do not need to have an Internet connection, or you can choose not to have one, which reduces latency compared to running it on the Cloud. Data privacy is also a key point: investigations have found that a significant percentage of people ask confidential questions on chatbots. Running locally secures this data on the device. Additionally, if you can run it locally with open-source and free LLMs, you incur no cost.
What are the challenges or constraints when deploying LLMs on embedded devices?
Running LLMs on embedded devices is quite complex. As mentioned before, the larger LLMs that run in the Cloud are too large; therefore, smaller models are needed. Even so, the capability of storing content in a small file on an embedded device was not possible a few years ago. Embedded devices had limited RAM (4 or 8GB). They also had restrictions on processing power and may not have a GPU, which is not ideal for real-time applications, as the user doesn’t want to wait a minute after providing a text prompt. Some embedded systems may not run Linux or Windows, which presents integration challenges. Drivers can also be an issue. Finally, power consumption can also be a restriction, especially for battery-powered devices.
What are some examples of use cases or applications for local Generative AI on embedded devices?
We demonstrated generative AI running on an embedded device with an implementation inspired by Google’s NotebookLM podcast feature, where two avatars (a host and an expert) have a conversation on a decided topic like generative AI. A human user can also interrupt and ask questions. The content and specific questions generated are always different due to the generative nature of the AI.
Beyond this demonstration, numerous other use cases are possible, spanning from home automation to retail. One good example that comes to mind is a conferencing system, which can capture speech, guide meetings, take notes, and even translate different languages in real-time. In an industrial environment, issuing voice commands or asking questions to machines or cobots can be an effective approach.
What kind of hardware is used for these implementations?
In our investigation of running Generative AI models on embedded devices, we used a box product from ASRock. This box contains an AMD Ryzen 8000 processor with integrated GPU and AI engines (although the AI engines were not used in this case). It also had RAM, SSD, and ran Windows IoT.
Other hardware tested or being tested includes NXP i.MX 95, Renesas RV2H, and Qualcomm devices. Depending on the chosen hardware, there is a trade-off between the hardware capability, memory and processing power, and the size and quality of the SLM that can be run.
What core tools are needed to build such applications on embedded devices?
The software architecture involves a front-end (a web interface) and a back-end server written in Python, comprising these three components:
Speech-to-text engine: to understand what a human says and convert it to text. The Whisper model from OpenAI is a popular local model available in various sizes (tiny, base, small, medium), each with distinct trade-offs in terms of resources and speed. It includes features like noise reduction. A Cloud option from Google (a speech recognition library) was used for comparison.
Large Language Models: we’ve been using tools that can load models from sources like Hugging Face. To generate text responses. Gemma 2 (2 billion parameters) was found to be quite impressive, fitting on 2GB RAM. A 9 billion parameter model was also tested, showing a difference in word complexity compared to smaller models. Other SLMs that can be tested include Phi from Microsoft and Mistral from Mistral AI, amongst others. Cloud models like Llama (Meta AI), ChatGPT (OpenAI), and Claude (Anthropic) were used for comparison in our investigations.
Text-to-speech engine: to convert the generated text into spoken language. Piper (Rhasspy TTS engine) was used for local implementation, supporting 30 different languages. For higher quality, the ElevenLabs Cloud service was used. While the local Piper quality was usable, it lacked the natural sound of ElevenLabs. However, improvements in local text-to-speech quality are expected soon.
In conclusion, implementing generative AI and LLMs on embedded devices is feasible, offering significant benefits such as enhanced data privacy, reduced costs, and lower latency. While requiring optimisation for resource constraints like limited RAM and processing power, smaller models and specialised hardware make local deployment possible. This enables diverse applications, from conversational AI to industrial interactions, marking a promising future for ‘everywhere AI’.

Michaël Uyttersprot is Avnet Silica’s Market Segment Manager for Artificial Intelligence, Machine Learning and Vision. He has 20 years of experience in the industry, starting his career as an engineer in robotics. His current focus is on supporting the development and promotion of embedded vision and deep learning solutions to customers for projects involving AI and machine learning.
There’s plenty of other editorial on our sister site, Electronic Specifier! Or you can always join in the conversation by visiting our LinkedIn page.