Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of natural language processing tasks. However, their high computational and memory requirements pose significant challenges for deployment on resource-constrained edge devices such as the Raspberry Pi.
In this work, we investigate post-training quantization techniques to reduce the computational burden of LLMs while preserving their quality. We evaluate several LLMs under different precision settings and show that 8-bit quantization—especially when combined with runtime-level optimizations like LiteRT—achieves up to 2× faster inference on Raspberry Pi, compared to framework-native formats, without relying on hardware-specific acceleration libraries (e.g., GPU, NNAPI, or EdgeTPU), and with negligible degradation in output quality.
Our experiments highlight the practicality of lightweight LLM deployment on edge devices. These findings demonstrate the feasibility of real-time applications on low-power devices, enabling broader accessibility in edge environments.
Top : Quantized INT8 model’s (Llama-3.2-1B) response quality between Litet RT and Pytorch PTQ
Bottom : Compare response quality between LLMs