Building a CPU-Only Voice AI Pipeline with Small Language Models
There is growing interest in Edge AI and device-run AI — systems that operate directly on local devices rather than relying on large cloud infrastructure. As hardware on tablets, phones, and low-power systems continues to improve, this raises an important question: how much useful AI functionality can we achieve with limited compute?
At the same time, we are seeing rapid progress in small language models, particularly models below one billion parameters. These models are not designed to do everything, but when they are well-trained and narrowly applied, they are becoming increasingly capable for specific tasks.
To explore this direction, I built and deployed a small but complete voice-based AI pipeline using three small models, each optimized for a single role.
The system consists of:
Automatic Speech Recognition (ASR) Model: openai/whisper-tiny This model takes audio input from a microphone and converts it into text. It is lightweight and suitable for CPU-only environments.
Text-to-Text Language Model (Reasoning) Model: HuggingFaceTB/SmolLM2-135M-Instruct This instruction-tuned small language model takes the transcribed text and generates a textual answer.
Text-to-Speech (TTS) Model: facebook/mms-tts-eng This model converts the generated text back into spoken audio.
The full pipeline is:
Microphone (audio) → ASR (text) → small LLM (answer) → TTS (audio output)
The entire system runs on CPU only, is deployed as a Hugging Face Gradio Space, and works on lightweight devices such as a tablet. No GPU is required.
Because these models are very small (well below 1B parameters), they also have limited context windows. As a result, this system works best when users ask short, focused questions, rather than long or highly detailed prompts. This constraint is not a flaw, but a design trade-off that comes with lower compute requirements and simpler deployment.
There is noticeable latency, so this is not yet a real-time conversational agent. However, that was not the objective. The objective was to validate the architecture and understand what becomes possible when small, specialized models are composed into a system.
Why this matters
As small language models continue to improve, especially below the 1B-parameter range, they enable:
lower compute and energy requirements
simpler deployment paths
CPU-only operation
experimentation on edge and constrained devices
This work suggests that useful AI systems do not necessarily require massive models or massive infrastructure. Instead, they require careful system design and appropriate model selection.
Next steps
Possible future improvements include:
reducing latency (e.g., faster ASR backends, shorter utterances)
streaming or incremental processing
conversational memory (within small context limits)
eventual edge or on-device deployment
As a starting point, this pipeline demonstrates that small models are already capable of supporting end-to-end voice applications, and that this approach is likely to become more practical as both models and devices continue to improve.
the application can be accessed at Huggingface space here
https://huggingface.co/spaces/Javedalam/Audio-text-audio
a Google colab notebook is here
https://colab.research.google.com/drive/10JHWCSBYoZK6-6LHTyLJ1K3UPmh5rqIR?usp=sharing