Voice typing architecture: inside Loqua's three-model voice typing stack
Why we separate speech recognition, language intelligence, and screen context - and how we think about the internal numbers honestly.
TL;DR
This is a blog-level look at Loqua's voice typing architecture — and at why the right voice ai architecture for dictation isn't one big ASR model. Loqua is built as three cooperating layers: speech recognition, language intelligence, and multimodal context. The reason is simple: dictation quality is not just word error rate. A useful dictation model stack must hear the words, understand technical names, and shape output for the destination app. Our internal targets are 200ms-class responsiveness, high technical-vocabulary recognition, and low single-digit WER in supported conditions; until we publish a benchmark page, treat those as internal measurements rather than third-party claims.
I'm Shuran. I run a small team of algorithm researchers who use voice typing every day. We built Loqua because the dictation tools we evaluated — most of them well-engineered, most of them genuinely useful — all flattened to the same ceiling once we pushed them on code, multilingual mixing, and app-aware formatting. The ceiling is structural, not a tuning problem. This is the architecture we ended up with after deciding the structure had to change.
This isn't a paper. It's a blog-level walkthrough — the intuition behind each layer, the numbers that came out, and what we actually use the stack for. If you want the deeper academic context, the further-reading section at the end points to the papers we drew from.
The wrap-Whisper trap
OpenAI's Whisper changed what's possible for English speech recognition. It's a remarkable model. The Whisper paper showed that scaling weakly-supervised audio training produces robust general ASR — across accents, conditions, and 99 languages — without per-domain fine-tuning. That's a structural win for the field.
But Whisper is an ASR model, not a dictation product. The gap between "transcribes speech accurately" and "produces text I can paste into an email without editing" is large. The wrap-Whisper approach — take Whisper's output and run it through a formatting layer — closes some of that gap. It fails in three places we cared about:
- Technical vocabulary. Whisper hears "react query" and gives you "react query." It doesn't know that in your codebase that's
@tanstack/react-queryand the package import is what you wanted. NER on technical vocab requires a model that sees the surrounding context, not a model that hears phonemes. - App-aware formatting. Whisper transcribes; it doesn't know whether you're in Slack or in a Python file. Bolting a formatter on top requires a heuristic — which is brittle — or a heavy LLM call per utterance, which is slow and cloud-dependent.
- Streaming under tight latency budgets. Whisper is excellent for batch transcription. Streaming, low-latency, on-device Whisper requires either compromise (smaller model, lower accuracy) or aggressive engineering that fights against the model's structure.
We tried the wrap-Whisper path for a research prototype. It was good enough to test ideas with. It was not good enough to use as our daily tool.
The three-model intuition: a different kind of speech recognition stack
The intuition is that voice typing is three jobs, and the same model is bad at all of them. The three jobs:
- Hear the words. Acoustic-to-token. This is what ASR is good at.
- Understand the intent. Clean up fillers, false starts, mid-sentence corrections. Recognize technical entities. Decide what the user actually meant to type.
- Place it correctly. Format the output for the destination app. Make the same intent come out as code in VS Code, as a Slack message in Slack, as a structured prompt in Cursor.
A single end-to-end model trying to do all three has a hard structural problem: the loss surface for "transcribe phonemes" and the loss surface for "format for Slack" point in different directions. Joint training compromises both. We can confirm this with our own ablations — we tried unifying layers 2 and 3 into one transformer trained on multi-task data, and accuracy on both fell.
Splitting into three lets each model do one job well. The cost is a small amount of latency for the hand-off between layers, which we recover by running everything on the Neural Engine in a single pipeline. That's the heart of Loqua's voice typing architecture: not a single monolithic transcriber, but a small dictation model stack where each layer is purpose-trained for its part of the problem.
Layer 1: speech recognition
Acoustic input → token sequence. This is a task-specific speech recognizer rather than a direct Whisper wrapper. Architecture choices were driven by three constraints we would not compromise on:
- Streaming-first. Output starts before the speaker finishes the utterance. This rules out non-causal attention over the full audio sequence as the default — we use a streaming-friendly variant.
- On-device Neural Engine compatibility. Model size and operator selection are constrained by what runs efficiently on Apple's Core ML via the Neural Engine. This is a real constraint — operators that look fine in a paper can fall off the Neural Engine path and become CPU-bound.
- Low-amplitude robustness. Our training data deliberately includes whisper-volume input (people dictating quietly in cafés, late at night). Most general ASR training data is normal-volume; whisper-volume requires explicit coverage.
The output of this layer is a token sequence with timing and confidence scores. It's not your final text — it's the raw recognition that the next layer cleans up.
Layer 2: language intelligence
Token sequence → cleaned, intent-resolved text. This is where we make the biggest research investment, because it's where most of the user-visible quality lives.
The job of this layer: take what the speech model heard and produce what the user meant. Three things happen in parallel:
- Filler and false-start removal. "Um, so, basically, we should — actually wait, let me start over — we should cache this" becomes "We should cache this." The mid-sentence correction is honored; the throat-clearing is gone.
- NER for technical vocabulary. The layer learns the names of common libraries, frameworks, model families, file extensions, terminal commands, and idiomatic API surfaces. "React query" becomes
@tanstack/react-querywhen the surrounding context is a JavaScript file. Our internal target is high-90s recognition on curated in-domain technical vocabulary, so identifiers come out right without forcing a personal dictionary for every common term. - Structural shaping. Whether the output is a sentence, a bullet list, a markdown table, or a code comment is decided here based on what the next layer (multimodal context) says about the destination.
This layer is the smallest model in the stack by parameter count, but the highest-impact in user experience. We spent more team-hours here than on the other two combined.
Layer 3: multimodal context
App state + screen + cursor → format directive. This is the omni-modal layer — and it's the reason we don't just call this "a dictation app." Loqua's job isn't to transcribe; it's to write what you meant where you meant it.
The context layer reads the active app via macOS Accessibility, the selected text (if any), the visible adjacent text, and the structural cues of the destination (Gmail compose vs Slack thread vs VS Code Python file vs Cursor chat panel). It outputs a format directive that the language layer uses to shape the final text.
The deeper intuition — why omni-modal architectures change voice typing, not just enhance it — is its own piece. See voice meets vision: how omni-modal models unlock context-aware dictation if you want that thread.
The numbers we track
| Metric | Current internal target | Why it matters |
|---|---|---|
| End-to-end latency | 200ms-class on Apple Silicon | Below the point where dictation starts to feel like waiting |
| Time-to-first-token (TTFT) | Sub-200ms in common streaming cases | First words appear while longer utterances are still being spoken |
| NER accuracy | High 90s on curated in-domain technical vocab | Identifiers, libraries, and model names must come out right |
| Multilingual WER | Low single digits in supported test conditions | Mixed English / Chinese and accented English must work in real workflows |
These are internal benchmark and dogfooding numbers, not a third-party benchmark suite. The test sets include our own technical vocabulary, code-switching examples, accented English samples, and noisy everyday environments. The next content step should be a public methodology page so these numbers can be cited without hand-waving.
What we use it for ourselves
The most important thing I can say about this stack is that we wrote it because we use it. Every member of the team dictates daily — for commits, PRs, internal Slack, longer technical writing, and (in my case) most of the prose for blog posts like this one. The decisions that shaped the architecture came from real use, not from a product spec.
Three places where the architecture-vs-feature distinction shows up in daily use:
- Code dictation. The NER quality is the difference between voice as a feasible interface for code and voice as a toy. See the code dictation guide for what this enables.
- Multilingual code-switching. Half our internal Slack is mixed Mandarin and English. The language layer is trained on code-switched data rather than around it; mid-sentence switches work without a mode toggle.
- App-aware formatting. The same voice phrase becoming a code comment in VS Code and a structured PR description on GitHub is the difference between voice typing and a useful product.
We're a small team. The honest version of this story is that we don't have the resources to maintain a wrapped-Whisper product AND a three-model research stack. We bet on the structural path because we needed the quality for ourselves, and we'd rather ship something narrower with quality than something broader without it.
How each layer was rebuilt in 2026
Layer 1 — speech recognition. The recognizer was tightened around streaming, low-amplitude speech, and technical vocabulary; the deeper walkthrough is in inside our omni-modal voice stack and the post-training details are in RL in our voice stack.
Layer 2 — language intelligence. The language layer now treats cleanup, entity preservation, and app-aware structure as one dictation model stack instead of separate post-processors. That is where reinforcement learning helps most: choosing the output users edit least.
Layer 3 — multimodal context. The context layer was rebuilt around local screen evidence: active app, selected text, visible identifiers, and cursor surroundings. See building a listener that sees what you see for the architecture.
The next frontier is non-word audio: AED and audio captioning as optional, local-first context. We cover that prototype-stage work in sounds with meaning.
Further reading
If you want to go deeper than this blog-level treatment:
- The Whisper paper (Radford et al., 2022) — for the weakly-supervised audio training paradigm we drew from for layer 1.
- Apple Core ML documentation — for what Neural Engine deployment looks like in practice.
- Our companion note on voice meets vision: omni-modal dictation for the layer-3 reasoning.
- Our note on privacy by design with a hybrid architecture for what crosses the wire and what doesn't.
If you have questions about the architecture or want to dig into any of these layers in more depth, ping us. We're a small team and we read every email.
Frequently asked questions
Try Loqua today
Free to start. Mac native. Built by algorithm researchers who use it every day.
Download for Mac