Engineering

Multimodal voice recognition: building a listener that sees what you see

Why audio-only ASR still fails in real workflows, and how Loqua uses local screen context to disambiguate intent.

TL;DR

Multimodal voice recognition is the missing layer between transcript and useful dictation. Loqua is a Mac-native voice typing tool that combines audio with local screen context, active app metadata, and cursor surroundings. That lets the same sound become the right identifier, instruction, or formatted text in the destination app.

Audio-only speech recognition has become good enough that its remaining failures are easy to underestimate. Clean speech benchmarks hide the real product problem: users dictate inside apps, around visible code, in mixed languages, and with partial references such as "this function" or "the above bullet."

Where ASR still fails

The classic example is homophones. "From foo import bar" and from foo import bar sound similar but belong to different worlds. So do "cache the auth client" and "cash the auth client" if the model does not know the cursor is in a TypeScript file. Audio alone cannot reliably infer destination.

Code identifiers make this sharper. A user may say "fetch profile," but the visible function is fetchProfile. A transcript model hears words; a dictation model should preserve the identifier. Multimodal voice recognition treats the visible text as evidence, not decoration.

Deixis is the third sharp edge. When a user says "replace this with a guard clause," the spoken text is technically a complete request, but its meaning depends entirely on what "this" points to. Without selection awareness or a stable cursor reference, the system has to guess, and any guess that is wrong wastes more time than retyping. Audio-only ASR cannot resolve deixis at all; it can only transcribe the demonstrative and hope a downstream tool figures it out.

  • Homophones: plain English vs code syntax.
  • Entities: package names, class names, file paths, and command flags.
  • Deixis: "this," "that," "above," "the selected part."
  • Format: prose, bullet, code comment, commit message, or prompt.

The multimodal listener architecture

Loqua's listener has three local inputs: streaming audio features, screen-derived context, and app metadata. The audio path proposes what was said. The context path summarizes where the text will land: app, field type, selected text, nearby tokens, and visible structural hints. The app path adds constraints such as whether line breaks, Markdown, or code syntax are appropriate.

The listener does not need to understand the entire screen as a human would. It needs the minimum useful evidence for dictation. In VS Code, that may be visible identifiers, language mode, and selected code. In Slack, it may be the thread topic and recent tone. In Notes, it may be heading level and list context.

What we deliberately do not try to do

Several capabilities are out of scope on purpose. The listener does not perform OCR on screenshots of remote content, does not summarize windows the user is not actively typing in, and does not build a persistent visual history. It also does not try to infer fine-grained intent from images: a graph, a video frame, or a design canvas is not interpreted, only the surrounding text is. Each removal is a deliberate product choice that trades capability for predictability and a cleaner privacy boundary.

This is why we call it audio visual dictation only in the narrow product sense: audio plus visual context for writing. The goal is not general visual reasoning. The goal is fewer wrong words at the cursor.

How screen context resolves ambiguity

Screen context dictation changes output by constraining possibilities. If the cursor is inside a Python file and the visible line already contains from fastapi import, the spoken word "router" is more likely to be a symbol than a generic noun. If the cursor is in Gmail, the same phrase should become a sentence.

You say
"add a guard before fetch profile if auth client is missing"
Loqua writes (in VS Code)
if (!authClient) return null;
const profile = await fetchProfile(authClient);
You say
"can you take a look at the PR I just pushed and let me know if the retry logic looks right"
Loqua writes (in Slack)
Could you take a look at the PR I just pushed? Want to make sure the retry logic looks right.

The listener also handles selection-aware editing. If text is selected, dictation is interpreted as an instruction over that text unless the user explicitly asks to insert new prose. That one distinction removes an entire class of accidental duplicate text.

Context conflicts are handled by trusting the strongest evidence first. The active app is the most reliable signal because it is structurally guaranteed by the operating system. Selected text comes next. Visible nearby tokens are the softest signal because they may be stale or accidental. When two signals disagree, the listener prefers the harder one and lowers confidence rather than picking one and committing.

Privacy: screen context stays local

Context-aware speech recognition has a privacy cost if implemented carelessly. Loqua's rule is that screen context needed by the listener stays local by default. The context summary is computed on device; it is used to shape the current utterance; it is not retained as a general screen log.

Concretely, what reaches the on-device listener is a short, ephemeral context bundle: active app identifier, language and field type, selection range, and a few hundred characters of nearby visible text. What never leaves the device by default is the broader window content, other tabs, other apps, or any persistent history of any of the above. Optional cloud features, when enabled by the user, receive the dictated audio or text under the boundaries already described in our hybrid privacy note; they never receive the raw context bundle.

This boundary matters because a listener that sees what you see may observe code, messages, or drafts. We treat that as sensitive data. The privacy architecture is covered in more detail in our hybrid privacy note, but the short version is clear: the screen context path is local-first, and optional cloud features do not receive raw surrounding screen content.

Open research context

The research background includes audio-language modeling, visual-language projection, and multimodal instruction tuning. Useful starting points include Whisper for robust ASR, LLaVA for visual instruction tuning patterns, and ImageBind for alignment across modalities.

Those papers are literature context. Loqua's multimodal voice recognition stack is original work tuned for the Mac dictation surface: local context, low-latency streaming, and app-aware output. We borrow the field's vocabulary, not a dependency chain.

Roadmap

The next step is better uncertainty reporting. If context suggests two possible identifiers, the system should preserve ambiguity instead of inventing confidence. We also want finer app adapters for terminals, spreadsheets, IDE chat panels, and design tools, where the shape of useful output differs dramatically.

The terminal adapter is the most concrete near-term work. A terminal is structurally a single line at the cursor, but contextually it is a long history of previous commands and outputs that should inform what the user is about to type. A spreadsheet adapter is the opposite shape: a tiny visible context window with rigid column meaning. Both adapters reuse the same listener architecture; the difference is in what counts as evidence and where the text renderer draws its formatting cues.

The long-term direction is not "the model sees everything." It is narrower and safer: the listener sees enough local context to write what you meant, where you meant it, with less cleanup. That is the product promise of multimodal voice recognition.

Frequently asked questions

What is multimodal voice recognition?
Multimodal voice recognition combines audio with another signal, such as screen context or app metadata, to infer the intended written output. In Loqua, it means the system does not only transcribe speech; it also considers where the cursor is and what text is visible nearby.
Why does audio-only ASR fail on code?
Code contains identifiers, package names, casing, punctuation, and syntax that may not be obvious from sound alone. A model can hear 'fetch profile' correctly and still miss that the visible identifier is fetchProfile. Screen context gives the recognizer evidence that audio lacks.
Does Loqua record my screen?
No in the product sense described here. Loqua reads local context needed for the current dictation event, such as active app, selected text, and nearby visible text. It is not designed as a continuous screen recorder, and the context path stays local by default.
How is this different from a personal dictionary?
A personal dictionary maps known phrases to preferred spellings. Multimodal context can resolve phrases the user never pre-registered by looking at visible evidence. If an identifier appears next to the cursor, Loqua can preserve it without requiring a manual dictionary entry.
Can screen context make mistakes?
Yes. If the visible context is stale, ambiguous, or irrelevant, the listener can overfit to it. The product challenge is calibration: use context when it is strong, preserve raw speech when uncertain, and avoid making a confident rewrite from weak evidence.
Is multimodal voice recognition only for developers?
No. Developers feel the pain first because code is dense with identifiers. The same idea helps in email, notes, spreadsheets, project tools, and chat. The destination app changes what the spoken phrase should become, even when the words are ordinary.
What exactly is in the context bundle the listener receives?
An ephemeral payload: active app identifier, field type and language mode, current selection range, and a small window of nearby visible text — usually a few hundred characters. It is built per utterance, used during dictation, and not persisted as a general screen log.

Try Loqua today

Free to start. Mac native. Built by algorithm researchers who use it every day.

Download for Mac

More from the Loqua Blog

engineering
Omni-modal voice typing: multimodal understanding, MoE, and streaming text output
how-to
How to dictate code on Mac: a complete guide for Cursor, VS Code, and Claude Code
compare
Loqua vs Typeless: a Mac-native Typeless alternative for context, coding, and depth