Voice meets vision: how omni-modal models unlock multimodal voice typing
From audio-only ASR to audio + vision + text — the paradigm shift that turned voice typing from "transcribe what I said" into "write what I meant, where I meant it."
TL;DR
Multimodal voice typing means the system uses speech plus local context to decide what the words should become. Loqua is a context-aware voice typing tool for Mac: it listens to your voice, reads local destination context, and writes app-aware text. This intro explains why screen-aware voice typing matters without diving into the full architecture.
Loqua is a context-aware voice typing tool for Mac. The important shift is from transcript to destination-aware writing: the same spoken phrase should become different text in Slack, Cursor, GitHub, Apple Notes, and a code editor.
This is the introductory version of our voice + vision ai thinking. Open research across audio, language, and multimodal systems gives the field useful vocabulary, but Loqua's production stack is original work trained and optimized in-house for Mac dictation.
The shift from transcription to context
Audio-only ASR answers one question: what words did the user say? Dictation asks a second question: what should those words become at the cursor? That second question is why multimodal voice typing exists. A transcript can be accurate and still be wrong for the destination.
When you dictate into a code editor, punctuation, identifiers, comments, and selected text matter. When you dictate into email, tone and paragraph shape matter. When you dictate into a task app, owner and due date matter. Screen-aware voice typing turns those visible cues into constraints for writing.
Why screen context changes dictation
The same phrase can mean different things depending on the app. "Add a guard before fetch profile" should become code-adjacent text in an IDE, a task in Linear, and a plain request in Slack. Audio alone cannot reliably choose between those forms.
Loqua's context layer reads local signals such as active app, selected text, visible adjacent text, and destination field type. It does not need a full screenshot narrative. It needs enough local evidence to preserve identifiers, decide whether you are inserting or editing, and choose the right output shape.
What changes at the cursor
if (!user.isLoggedIn) { return redirect('/signin');}The output changes because the destination changes. That is the practical value of omni model dictation as a product category: context makes writing decisions that a transcript cannot.
The privacy boundary
Screen context is powerful enough that it needs a clear boundary. Loqua's context path is local-first by default. The active app, selected text, and nearby visible content are used to shape the current utterance, not to create a general screen log.
For the full boundary, see privacy by design with a hybrid architecture. The short version: audio and screen context are treated as sensitive local signals, and optional cloud features do not receive raw surrounding screen content.
Want to go deeper?
- Inside our omni-modal voice stack — the multimodal instruction pipeline, MoE, and streaming.
- Building a listener that sees what you see — how multimodal context resolves ASR ambiguity.
- Sounds with meaning — AED, audio captioning, and the next frontier.
Further reading
For literature context, start with Whisper for robust speech recognition, LLaVA for visual instruction tuning, and ImageBind for cross-modal alignment. Those links explain the field; they are not a provenance claim about Loqua.
Frequently asked questions
Try Loqua today
Free to start. Mac native. Built by algorithm researchers who use it every day.
Download for Mac