LiteRT-Optimized INT8 LLM for Raspberry Pi4 Deployment

Kihwan Yoon1*, Hyeon-Cheol Moon2*, Aeri Kim2, Sungjei Kim3, Sang-Seol Lee2, Sung-Joon Jang2, Ganzorig Gankuyag2†, Jinwoo Jeong2†,
1BLUEDOT
2Korea Electronics Technology Institute
3Korea University of Technology and Education
*equal contribution
corresponding author

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of natural language processing tasks. However, their high computational and memory requirements pose significant challenges for deployment on resource-constrained edge devices such as the Raspberry Pi.

In this work, we investigate post-training quantization techniques to reduce the computational burden of LLMs while preserving their quality. We evaluate several LLMs under different precision settings and show that 8-bit quantization—especially when combined with runtime-level optimizations like LiteRT—achieves up to 2× faster inference on Raspberry Pi, compared to framework-native formats, without relying on hardware-specific acceleration libraries (e.g., GPU, NNAPI, or EdgeTPU), and with negligible degradation in output quality.

Our experiments highlight the practicality of lightweight LLM deployment on edge devices. These findings demonstrate the feasibility of real-time applications on low-power devices, enabling broader accessibility in edge environments.

Runtime Evaluation on the Rasberry Pi4

IAM-VFI
IAM-VFI

PytorchQAT vs LiteRT

IAM-VFI
IAM-VFI

Top : Quantized INT8 model’s (Llama-3.2-1B) response quality between Litet RT and Pytorch PTQ
Bottom : Compare response quality between LLMs

BibTeX