Engineering

Voice meets vision: how omni-modal models unlock multimodal voice typing

From audio-only ASR to audio + vision + text — the paradigm shift that turned voice typing from "transcribe what I said" into "write what I meant, where I meant it."

TL;DR

Multimodal voice typing means the system uses speech plus local context to decide what the words should become. Loqua is a context-aware voice typing tool for Mac: it listens to your voice, reads local destination context, and writes app-aware text. This intro explains why screen-aware voice typing matters without diving into the full architecture.

Loqua is a context-aware voice typing tool for Mac. The important shift is from transcript to destination-aware writing: the same spoken phrase should become different text in Slack, Cursor, GitHub, Apple Notes, and a code editor.

This is the introductory version of our voice + vision ai thinking. Open research across audio, language, and multimodal systems gives the field useful vocabulary, but Loqua's production stack is original work trained and optimized in-house for Mac dictation.

The shift from transcription to context

Audio-only ASR answers one question: what words did the user say? Dictation asks a second question: what should those words become at the cursor? That second question is why multimodal voice typing exists. A transcript can be accurate and still be wrong for the destination.

When you dictate into a code editor, punctuation, identifiers, comments, and selected text matter. When you dictate into email, tone and paragraph shape matter. When you dictate into a task app, owner and due date matter. Screen-aware voice typing turns those visible cues into constraints for writing.

Why screen context changes dictation

The same phrase can mean different things depending on the app. "Add a guard before fetch profile" should become code-adjacent text in an IDE, a task in Linear, and a plain request in Slack. Audio alone cannot reliably choose between those forms.

Loqua's context layer reads local signals such as active app, selected text, visible adjacent text, and destination field type. It does not need a full screenshot narrative. It needs enough local evidence to preserve identifiers, decide whether you are inserting or editing, and choose the right output shape.

What changes at the cursor

You say
"add a check that the user is logged in before we fetch the profile if not just redirect to sign in"
Loqua writes (in VS Code)
if (!user.isLoggedIn) {
  return redirect('/signin');
}
You say (same words)
"add a check that the user is logged in before we fetch the profile if not just redirect to sign in"
Loqua writes (in Linear)
Add auth guard before profile fetch. If user is not logged in, redirect to sign-in instead of fetching profile.

The output changes because the destination changes. That is the practical value of omni model dictation as a product category: context makes writing decisions that a transcript cannot.

The privacy boundary

Screen context is powerful enough that it needs a clear boundary. Loqua's context path is local-first by default. The active app, selected text, and nearby visible content are used to shape the current utterance, not to create a general screen log.

For the full boundary, see privacy by design with a hybrid architecture. The short version: audio and screen context are treated as sensitive local signals, and optional cloud features do not receive raw surrounding screen content.

Want to go deeper?

Further reading

For literature context, start with Whisper for robust speech recognition, LLaVA for visual instruction tuning, and ImageBind for cross-modal alignment. Those links explain the field; they are not a provenance claim about Loqua.

Frequently asked questions

What counts as screen context for Loqua?
Screen context means local signals around the current dictation target: active app, selected text, nearby visible text, file type, cursor position, and field shape. Loqua uses these cues to decide whether your spoken phrase should become prose, a task, a prompt, or code-adjacent text.
Does Loqua send screenshots anywhere?
The context path is local-first by default. Loqua uses screen-derived signals to shape the current utterance and does not need to send raw surrounding screen content to optional cloud features. See the privacy article for the full boundary.
How does context impact latency?
Context is gathered in parallel with speech recognition. That means the destination evidence is usually ready by the time the final text needs to be rendered. The architecture is designed around 200ms-class interaction rather than a slow post-processing call.
Why does voice plus vision matter for code?
Code is full of identifiers, casing, syntax, and selected regions that are not recoverable from sound alone. If the model can see a visible identifier near the cursor, it can preserve that name instead of writing a generic transcript.
Is this an agent that acts on my screen?
No. This article is about dictation, not autonomous screen control. Loqua uses local context to write better text at the cursor. It does not browse around your apps or take actions unless you explicitly use another tool for that purpose.
Where should I read the deeper architecture?
Start with Inside our omni-modal voice stack for the multimodal instruction pipeline, then read Building a listener that sees what you see for disambiguation, and Sounds with meaning for the prototype-stage non-word audio direction.

Try Loqua today

Free to start. Mac native. Built by algorithm researchers who use it every day.

Download for Mac

More from the Loqua Blog

engineering
Omni-modal voice typing: multimodal understanding, MoE, and streaming text output
engineering
Multimodal voice recognition: building a listener that sees what you see
engineering
Audio event detection dictation: sounds with meaning beyond words
productivity
Voice productivity stack: 9 tools we actually use to write, ship, and think
how-to
How to dictate code on Mac: a complete guide for Cursor, VS Code, and Claude Code