Engineering

Voice typing architecture: inside Loqua's three-model voice typing stack

Why we separate speech recognition, language intelligence, and screen context - and how we think about the internal numbers honestly.

Shuran Zhou, Founder · 2026-04-13 ·8 min ·Updated 2026-04-13

TL;DR

This is a blog-level look at Loqua's voice typing architecture — and at why the right voice ai architecture for dictation isn't one big ASR model. Loqua is built as three cooperating layers: speech recognition, language intelligence, and multimodal context. The reason is simple: dictation quality is not just word error rate. A useful dictation model stack must hear the words, understand technical names, and shape output for the destination app. Our internal targets are 200ms-class responsiveness, high technical-vocabulary recognition, and low single-digit WER in supported conditions; until we publish a benchmark page, treat those as internal measurements rather than third-party claims.

I'm Shuran. I run a small team of algorithm researchers who use voice typing every day. We built Loqua because the dictation tools we evaluated — most of them well-engineered, most of them genuinely useful — all flattened to the same ceiling once we pushed them on code, multilingual mixing, and app-aware formatting. The ceiling is structural, not a tuning problem. This is the architecture we ended up with after deciding the structure had to change.

This isn't a paper. It's a blog-level walkthrough — the intuition behind each layer, the numbers that came out, and what we actually use the stack for. If you want the deeper academic context, the further-reading section at the end points to the papers we drew from.

The wrap-Whisper trap

OpenAI's Whisper changed what's possible for English speech recognition. It's a remarkable model. The Whisper paper showed that scaling weakly-supervised audio training produces robust general ASR — across accents, conditions, and 99 languages — without per-domain fine-tuning. That's a structural win for the field.

But Whisper is an ASR model, not a dictation product. The gap between "transcribes speech accurately" and "produces text I can paste into an email without editing" is large. The wrap-Whisper approach — take Whisper's output and run it through a formatting layer — closes some of that gap. It fails in three places we cared about:

Technical vocabulary. Whisper hears "react query" and gives you "react query." It doesn't know that in your codebase that's @tanstack/react-query and the package import is what you wanted. NER on technical vocab requires a model that sees the surrounding context, not a model that hears phonemes.
App-aware formatting. Whisper transcribes; it doesn't know whether you're in Slack or in a Python file. Bolting a formatter on top requires a heuristic — which is brittle — or a heavy LLM call per utterance, which is slow and cloud-dependent.
Streaming under tight latency budgets. Whisper is excellent for batch transcription. Streaming, low-latency, on-device Whisper requires either compromise (smaller model, lower accuracy) or aggressive engineering that fights against the model's structure.

We tried the wrap-Whisper path for a research prototype. It was good enough to test ideas with. It was not good enough to use as our daily tool.

The three-model intuition: a different kind of speech recognition stack

The intuition is that voice typing is three jobs, and the same model is bad at all of them. The three jobs:

Hear the words. Acoustic-to-token. This is what ASR is good at.
Understand the intent. Clean up fillers, false starts, mid-sentence corrections. Recognize technical entities. Decide what the user actually meant to type.
Place it correctly. Format the output for the destination app. Make the same intent come out as code in VS Code, as a Slack message in Slack, as a structured prompt in Cursor.

A single end-to-end model trying to do all three has a hard structural problem: the loss surface for "transcribe phonemes" and the loss surface for "format for Slack" point in different directions. Joint training compromises both. We can confirm this with our own ablations — we tried unifying layers 2 and 3 into one transformer trained on multi-task data, and accuracy on both fell.

Splitting into three lets each model do one job well. The cost is a small amount of latency for the hand-off between layers, which we recover by running everything on the Neural Engine in a single pipeline. That's the heart of Loqua's voice typing architecture: not a single monolithic transcriber, but a small dictation model stack where each layer is purpose-trained for its part of the problem.

Layer 1: speech recognition

Acoustic input → token sequence. This is a task-specific speech recognizer rather than a direct Whisper wrapper. Architecture choices were driven by three constraints we would not compromise on:

Streaming-first. Output starts before the speaker finishes the utterance. This rules out non-causal attention over the full audio sequence as the default — we use a streaming-friendly variant.
On-device Neural Engine compatibility. Model size and operator selection are constrained by what runs efficiently on Apple's Core ML via the Neural Engine. This is a real constraint — operators that look fine in a paper can fall off the Neural Engine path and become CPU-bound.
Low-amplitude robustness. Our training data deliberately includes whisper-volume input (people dictating quietly in cafés, late at night). Most general ASR training data is normal-volume; whisper-volume requires explicit coverage.

The output of this layer is a token sequence with timing and confidence scores. It's not your final text — it's the raw recognition that the next layer cleans up.

Layer 2: language intelligence

Token sequence → cleaned, intent-resolved text. This is where we make the biggest research investment, because it's where most of the user-visible quality lives.

The job of this layer: take what the speech model heard and produce what the user meant. Three things happen in parallel:

Filler and false-start removal. "Um, so, basically, we should — actually wait, let me start over — we should cache this" becomes "We should cache this." The mid-sentence correction is honored; the throat-clearing is gone.
NER for technical vocabulary. The layer learns the names of common libraries, frameworks, model families, file extensions, terminal commands, and idiomatic API surfaces. "React query" becomes @tanstack/react-query when the surrounding context is a JavaScript file. Our internal target is high-90s recognition on curated in-domain technical vocabulary, so identifiers come out right without forcing a personal dictionary for every common term.
Structural shaping. Whether the output is a sentence, a bullet list, a markdown table, or a code comment is decided here based on what the next layer (multimodal context) says about the destination.

This layer is the smallest model in the stack by parameter count, but the highest-impact in user experience. We spent more team-hours here than on the other two combined.

Layer 3: multimodal context

App state + screen + cursor → format directive. This is the omni-modal layer — and it's the reason we don't just call this "a dictation app." Loqua's job isn't to transcribe; it's to write what you meant where you meant it.

The context layer reads the active app via macOS Accessibility, the selected text (if any), the visible adjacent text, and the structural cues of the destination (Gmail compose vs Slack thread vs VS Code Python file vs Cursor chat panel). It outputs a format directive that the language layer uses to shape the final text.

The deeper intuition — why omni-modal architectures change voice typing, not just enhance it — is its own piece. See voice meets vision: how omni-modal models unlock context-aware dictation if you want that thread.

The numbers we track

Metric	Current internal target	Why it matters
End-to-end latency	200ms-class on Apple Silicon	Below the point where dictation starts to feel like waiting
Time-to-first-token (TTFT)	Sub-200ms in common streaming cases	First words appear while longer utterances are still being spoken
NER accuracy	High 90s on curated in-domain technical vocab	Identifiers, libraries, and model names must come out right
Multilingual WER	Low single digits in supported test conditions	Mixed English / Chinese and accented English must work in real workflows

These are internal benchmark and dogfooding numbers, not a third-party benchmark suite. The test sets include our own technical vocabulary, code-switching examples, accented English samples, and noisy everyday environments. The next content step should be a public methodology page so these numbers can be cited without hand-waving.

What we use it for ourselves

The most important thing I can say about this stack is that we wrote it because we use it. Every member of the team dictates daily — for commits, PRs, internal Slack, longer technical writing, and (in my case) most of the prose for blog posts like this one. The decisions that shaped the architecture came from real use, not from a product spec.

Three places where the architecture-vs-feature distinction shows up in daily use:

Code dictation. The NER quality is the difference between voice as a feasible interface for code and voice as a toy. See the code dictation guide for what this enables.
Multilingual code-switching. Half our internal Slack is mixed Mandarin and English. The language layer is trained on code-switched data rather than around it; mid-sentence switches work without a mode toggle.
App-aware formatting. The same voice phrase becoming a code comment in VS Code and a structured PR description on GitHub is the difference between voice typing and a useful product.

We're a small team. The honest version of this story is that we don't have the resources to maintain a wrapped-Whisper product AND a three-model research stack. We bet on the structural path because we needed the quality for ourselves, and we'd rather ship something narrower with quality than something broader without it.

How each layer was rebuilt in 2026

Layer 1 — speech recognition. The recognizer was tightened around streaming, low-amplitude speech, and technical vocabulary; the deeper walkthrough is in inside our omni-modal voice stack and the post-training details are in RL in our voice stack.

Layer 2 — language intelligence. The language layer now treats cleanup, entity preservation, and app-aware structure as one dictation model stack instead of separate post-processors. That is where reinforcement learning helps most: choosing the output users edit least.

Layer 3 — multimodal context. The context layer was rebuilt around local screen evidence: active app, selected text, visible identifiers, and cursor surroundings. See building a listener that sees what you see for the architecture.

The next frontier is non-word audio: AED and audio captioning as optional, local-first context. We cover that prototype-stage work in sounds with meaning.

Frequently asked questions

Why three models instead of one big LLM?

A single end-to-end model has a structural problem: the loss surface for "transcribe phonemes correctly" points in a different direction than the loss surface for "format output for Slack." We tried unifying layers 2 and 3 into one transformer trained on multi-task data, and accuracy on both fell. Three purpose-trained models, each doing one job well, beat one model trying to do all three.

Why not just wrap Whisper?

Whisper is a great ASR model but it's not a dictation product. The wrap-Whisper approach falls short on technical vocabulary (no in-context NER), app-aware formatting (requires a heavy post-processor), and on-device streaming (Whisper is optimized for batch). We needed all three for our own daily use.

Did you train models from scratch?

Yes, for the speech recognition and language intelligence layers. The multimodal context layer builds on omni-modal research patterns (see the omni-modal blog post for the lineage), with our own training data and task-specific fine-tuning for dictation.

How big is each model?

We don't publish exact parameter counts — they're tuned to fit the Neural Engine budget while hitting our latency and accuracy targets. All three layers together run inside the on-device portion of the stack. Cloud is reserved for specific cases (longer rewrites, certain translations) and is user-toggleable.

How do you evaluate accuracy?

With internal benchmark suites for each layer plus daily team dogfooding. Speech recognition is measured with WER-style protocols on supported conditions. NER is measured on curated technical vocabulary. We should publish methodology before treating any number as an external benchmark claim.

Will you open-source any of this?

Not currently planned for the production stack — the team is small and the work to maintain a clean public release alongside the product would slow us down. We do publish notes like this one when there's something worth saying. If you want to collaborate on research, send us an email.