Engineering

Omni-modal voice typing: multimodal understanding, MoE, and streaming text output

A technical walkthrough of how Loqua combines listening, multimodal task understanding, and text rendering without making voice feel slow.

Shuran Zhou, Founder · 2026-05-27 ·8 min ·Updated 2026-05-27

TL;DR

Omni-modal voice typing is not one giant audio model. Loqua is a Mac-native voice typing tool that splits speech recognition, multimodal instruction planning, and text rendering into a streaming text-output pipeline. That split gives us room for an audio encoder MoE, local screen context, and 200ms-class interaction without implying that the product is a wrapper around a third-party model.

I'm Shuran, founder of Loqua.ai. This is the deeper engineering note behind our omni-modal voice typing stack: why we separate acoustic recognition from multimodal task understanding and text rendering, where conditional experts helped, where they did not, and why streaming dictated almost every architecture choice.

Why monolithic audio LMs hit a Mac wall

A monolithic audio language model is attractive on paper: feed audio, screen frames, and text into one large model and ask it to emit the final dictation. In daily Mac use, that design hits three walls at once. First, the latency budget is brutal. If the first useful token arrives after the speaker pauses, voice typing feels like batch transcription rather than typing.

Here is the budget we hold ourselves to. From microphone to first visible character, we have about 200 milliseconds. Of that, roughly 40ms is platform overhead (Core Audio buffering, inter-process delivery, and final render), 60ms is acoustic feature extraction and front-end inference, 70ms is instruction-planner decoding for the first partial, and 30ms is left for the text renderer to emit a safe early commit. A single 1B+ parameter audio LM running every step in that window is plausible on a benchmark machine but unreliable on a real laptop with a browser, an IDE, and Zoom already running.

Second, the user-visible job is not just automatic speech recognition. The same utterance must become a Slack paragraph, a Git commit subject, a Cursor prompt, or a Python comment. A monolithic model can be trained to do all of those, but it has to relearn the format conventions for every app and re-derive them every time the cursor moves. Third, the model has to stay inside a laptop thermal and memory envelope. A 7B audio-vision-text model swapped into VRAM next to an IDE index is a recipe for thermal throttling and a noisy fan. Public work such as Whisper and recent omni-model reports helped the field understand robust audio and multimodal modeling, but shipping a local product forces narrower, more opinionated choices.

Our conclusion was simple: omni-modal voice typing should be a pipeline, not a single all-purpose block. The pipeline lets each layer keep a small objective, a measurable failure mode, and a bounded runtime cost.

The multimodal instruction pipeline

Loqua has no speech-reply or TTS layer in this product path. The instruction planner consumes streaming audio features, selected screen context, active app metadata, the recent text around the cursor, and the active instruction prompt. It produces a compact text-output plan: intent, entities, edit mode, target format, and uncertainty.

The text renderer converts that plan into text. It does not generate audio, spoken replies, or TTS output. It decides whether the plan becomes a Markdown checklist, a sentence, a code comment, or a structured instruction. This separation keeps expensive cross-modal reasoning out of the hot path when the user is simply dictating prose.

What each layer must not do

The discipline is in what we removed. The acoustic front end does not try to be smart about destination format; it surfaces audio tokens with timing and confidence. The instruction planner does not write final prose; it makes a structured plan small enough to revise cheaply. The text renderer does not re-read the audio or synthesize speech; it trusts the plan, destination context, and instruction prompt. Letting any layer cross those boundaries was the single most common cause of latency regressions in our early prototypes, because cross-layer feedback loops turn streaming into batch processing without anyone noticing until the trace is read.

Layer	Primary input	Output	Failure we watch
Acoustic front end	Streaming microphone frames	Audio tokens with timing	Low voice and noisy-room misses
Instruction planner	Audio tokens + screen context + instruction prompt	Intent and text-output plan	Wrong destination assumption
Text renderer	Plan + app constraints	Final text only	Over-formatting or late correction

The benefit is debuggability. When the output is wrong, we can tell whether the recognizer heard the wrong word, the instruction planner chose the wrong intent, or the text renderer wrote in the wrong register. That is much easier than interpreting one long hidden trajectory. In our internal triage, roughly two-thirds of post-launch quality fixes isolated cleanly to a single layer; the remainder required cross-layer changes, but only after diagnosis was clear.

MoE on the audio encoder

The audio encoder is where conditional computation paid off. Not every utterance needs the same acoustic treatment. Quiet speech, accented English, mixed English and Chinese, code identifiers, and meeting-room background noise stress different parts of the model. A small mixture-of-experts router lets the encoder spend more capacity on hard regions without making every frame expensive.

We kept the routing conservative. The router is conditioned on acoustic statistics and shallow lexical hints, not on personal user profiles. Experts specialize in patterns such as low-amplitude speech, fast technical dictation, and code-switching. We rejected a larger expert pool because routing instability made streaming behavior worse: a model that is accurate but changes style mid-utterance is not usable for typing.

What we tried and dropped

Two ideas looked attractive in research and failed in product. First, per-user expert adaptation: training a small adapter per heavy user. It improved cold-start accuracy by a few points on our internal test set but doubled cold-launch memory footprint and made privacy boundaries muddier. Second, a router that took the active app identifier as a strong signal. It overfit to a handful of common apps and silently degraded in new ones. We replaced it with the current acoustic-and-lexical router and moved app-aware behavior up to the instruction planner and text renderer, where it belongs.

In practical terms, audio encoder MoE helps Loqua keep the common path fast while making the tail less fragile. That is the product value: not a benchmark trick, but fewer moments where a library name, a quiet phrase, or a bilingual fragment breaks the flow.

Streaming: token by token vs phrase by phrase

Streaming was the hardest tradeoff. Token-by-token output feels responsive, but it can expose premature guesses. Phrase-by-phrase output is cleaner, but it feels late. We use a hybrid: early partials for low-risk spans, delayed commits for entities and edits that need screen context.

For example, when you say "change the fetch profile function to return early if the auth client is missing," the system can stream ordinary words quickly. But the tokens around fetchProfile and authClient wait for the context layer to confirm visible identifiers. This is why text rendering is separate from recognition: it can revise a small span before committing text, without restarting the whole utterance.

Mid-sentence bilingual input is a related case. When you say "把这个 retry 函数改成指数退避," the instruction planner produces a text-output plan that interleaves Chinese spans with one English code token. The text renderer emits the Chinese characters as soon as they clear an internal confidence threshold and waits a few extra milliseconds on the English identifier to confirm it against the visible IDE buffer. The user sees the Chinese flow first, then the identifier, then the rest. Reordering output would have been faster but felt wrong; the eye expects left-to-right typing.

Commit boundaries are the other knob. A commit boundary is the moment the text renderer promises not to revise. We pin those to natural pauses, sentence terminators, and structural transitions such as paragraph, list item, or code block. Inside a commit boundary the text renderer is free to revise; once committed, the text is final from the user's point of view. That contract is what makes streaming feel honest rather than jittery.

The result is a streaming voice stack that feels immediate but does not treat every early acoustic guess as final. For voice typing, that compromise matters more than raw transcript speed.

Open research context

We follow open research closely because it sharpens our vocabulary for the problems we see in product. Papers and reports on audio-language modeling, multimodal encoders, and preference optimization show the field converging on split responsibilities, streaming constraints, and cross-modal alignment. For background, start with Whisper, Qwen2 technical reporting, and the public Qwen2.5-Omni overview.

The important product boundary: these references are literature context. Loqua's production stack is original research trained and optimized in-house for Mac dictation. We cite open work to explain the field, not to suggest provenance.

What we built and what's next

What shipped from this direction is a narrow system: Mac voice typing that listens, reads local screen context, and writes app-aware text. The next work is more rigorous evaluation. We need public methodology for latency, technical NER, multilingual WER, and screen-context disambiguation so the claims are reviewable outside our dogfooding loop.

The second direction is better calibration. When the model is uncertain about a visible identifier, it should ask less from the user by choosing a safe output form or preserving the raw phrase. We are also exploring lightweight uncertainty markers in the text renderer so that downstream UI can highlight low-confidence spans for a quick keyboard correction without breaking the dictation rhythm.

The third direction is non-word audio. That work is still prototype-stage, and we cover it separately in Sounds with meaning. Together with our work on reinforcement learning and the multimodal listener, it forms the agenda for the next year of our omni-modal voice typing stack.

Frequently asked questions

What is omni-modal voice typing?

Omni-modal voice typing means the dictation system considers more than audio. It can combine speech with local screen context, active app metadata, selected text, and cursor surroundings. In Loqua, that context helps the system decide whether the same spoken phrase should become prose, a code comment, a list, or an edit.

Why does Loqua split understanding from text rendering?

The split makes the pipeline faster and easier to debug. The instruction planner resolves intent, entities, destination context, and the active instruction prompt. The text renderer writes final text in the right shape. It does not generate speech or TTS output. If output fails, we can inspect whether hearing, task understanding, or text rendering was responsible instead of guessing inside one monolithic model.

Does MoE make local voice typing too heavy?

It can if the expert pool is large or routing is unstable. Loqua keeps the audio encoder MoE conservative: small experts, simple routing signals, and a fast default path. The goal is not maximum capacity. The goal is spending extra compute only on difficult acoustic regions.

Why not stream every token immediately?

Immediate streaming can expose guesses that are expensive to correct, especially around identifiers and app-specific formatting. Loqua streams low-risk spans quickly and delays risky spans until context confirms them. That feels responsive while avoiding the most frustrating mid-utterance corrections.

Is this architecture only useful for Mac?

The general split can work elsewhere, but Mac is where we can make strong product assumptions: Apple Silicon, desktop app context, Accessibility APIs, and keyboard-driven workflows. Narrowing the platform lets the model and runtime optimize for one daily environment instead of a generic cross-platform compromise.

How should engineers evaluate a voice stack like this?

Do not only measure word error rate. Also measure time to first usable text, entity preservation, app-aware formatting accuracy, correction rate, and privacy boundary behavior. A voice typing stack can have good ASR and still be poor at writing useful text in the destination app.

What does a commit boundary mean in a streaming voice stack?

A commit boundary is the point at which the text renderer promises not to revise the text it has already emitted. Loqua places boundaries at natural pauses, sentence terminators, and structural transitions. Inside a boundary the text renderer can revise freely; once committed, the text is final from the user's point of view, which is what makes streaming feel honest.

Why not personalize the audio encoder per user?

We tried it and dropped it. Per-user adapters improved cold-start accuracy by a few points on internal sets but doubled launch memory footprint and weakened our privacy story. Loqua keeps the encoder generic and pushes user-specific behavior up the stack into the instruction planner, the text renderer, and local dictionaries that never leave the device.

Try Loqua today

Free to start. Mac native. Built by algorithm researchers who use it every day.

Download

More from the Loqua Blog

How-to

How to dictate code on Mac: a complete guide for Cursor, VS Code, and Claude Code

Compare

Loqua vs Wispr Flow: a Mac-first Wispr Flow alternative for context, coding, and privacy

Engineering

Private voice dictation Mac edition: how Loqua's hybrid voice typing stack keeps your data on your side