Audio event detection dictation: sounds with meaning beyond words
A prototype-stage note on how non-word audio could enrich dictation without breaking privacy or flow.
TL;DR
Audio event detection dictation is still prototype-stage at Loqua. Loqua is a Mac-native voice typing tool, and our shipped focus is words, context, and app-aware output. We are researching whether non-word audio such as laughter, pauses, doorbells, or sighs can become optional structured context without making dictation noisy or invasive.
This post is deliberately more tentative than our other engineering notes. Sounds with meaning is not a shipped feature. It is an early research direction: can sound understanding voice typing capture useful non-word signals while preserving the calm flow of dictation?
The non-word audio gap
Voice typing systems usually throw away everything that is not a word. That is sensible for clean transcription, but it loses information. In a meeting, laughter can mark agreement or tension. In a journal, a long pause may matter. In accessibility workflows, a doorbell, timer, or baby crying can be useful context.
Think about what a typical dictation transcript looks like after a one-hour meeting. The words are there, but the rhythm is flattened: the long pause before someone disagrees, the chuckle that softened a hard piece of feedback, the moment of silence after a difficult question. A human reviewing the transcript fills those in from memory. A teammate who could not attend has no signal at all. Audio event detection dictation is one way to put a small amount of that texture back into the written record, without asking the user to narrate it.
The risk is obvious: not every sound should become text. Most background audio is irrelevant. Some of it is private. Some of it is ambiguous. Audio event detection dictation only makes sense if it is optional, local-first, and conservative about when a sound changes the written output.
AED vs audio captioning
Audio event detection (AED) answers a compact question: what event happened, and roughly when? Audio captioning writes a natural-language description of a sound scene. For dictation, AED is often enough. A tag such as "laughter" or "doorbell" can be a marker; a full caption may be too verbose.
| Technique | Output | Dictation fit |
|---|---|---|
| AED | Event label + timestamp | Meeting markers, reminders, accessibility cues |
| Audio captioning | Sentence describing scene | Journaling, media notes, review workflows |
| Emotion/prosody cues | Tentative affect signal | Only useful with strong user control |
Why we lean toward AED first
An AED tag fails quietly. If the model labels something as "applause" and it was not, the user sees a single bracketed marker that is easy to delete. A wrong audio caption is harder to undo: it shapes the surrounding paragraph, biases a reader, and lingers in summaries. For a dictation product where trust is built one sentence at a time, the cost of a small wrong tag is much lower than the cost of a confidently wrong sentence. Our early bias is toward small structured markers, not automatic prose. A marker is easier to review, delete, or ignore.
What this could mean for dictation
In meetings, non-word audio could create optional markers: "[laughter]" after a joke, "[long pause]" before a decision, or "[doorbell]" when the speaker is interrupted. In journaling, it could help preserve emotional texture without forcing the user to narrate it. In accessibility workflows, it could turn environmental sound into a short note or reminder.
A concrete sketch. Imagine a meeting note where the user has opted into meeting markers. The transcript would read like ordinary prose with rare, compact tags: "We agreed to ship the migration this week. [laughter] Then we walked through the rollback plan. [long pause] Someone asked whether we should defer the index changes." The reader gets a richer sense of what happened without a paragraph of stage direction.
A journaling sketch is even narrower. The user dictates a quick end-of-day note; an audible long pause might surface as a "[reflection]" tag the user can keep, edit, or delete on review. Nothing is committed automatically into the body of the journal entry without a chance to look.
We are not trying to make dictation theatrical. The goal is not to write every cough or keyboard click. The goal is to detect a narrow set of high-signal events and let the user decide whether those events become text, tags, or nothing.
Research foundations
Several public research lines are relevant. CLAP explores contrastive language-audio pretraining. BEATs studies audio pretraining for acoustic understanding. AudioSet is a large-scale dataset for audio events, and AudioCaps is a reference point for audio captioning.
These are research foundations, not a product dependency statement. Loqua's prototype work is focused on the Mac dictation question: which sound cues are useful at the cursor, which should stay invisible, and how can the user control the boundary?
What we're prototyping
We're prototyping three narrow behaviors. First, meeting markers: optional tags for laughter, silence, applause, and interruptions. Second, journaling cues: user-approved tags for long pauses or audible exasperation. Third, accessibility alerts: a local sound cue that can become a reminder or note when the user asks for it.
The user experience we are sketching internally is deliberately quiet. Detected events appear as chips in a small review surface next to the dictated text, not in the text itself. The user can drag a chip into the document, dismiss it, or convert it into a destination-specific tag. Default behavior is "never insert without consent." Default mode is off until the user opts in for a given workflow.
The prototype is local-first and opt-in. Nothing in this direction should silently annotate private background sound. We are also testing a "marker only" mode where detected sounds never enter prose automatically; they appear as reviewable chips before insertion.
Hard problems we have not solved
The hardest problem is meaning. Laughter can mean agreement, discomfort, sarcasm, or nothing. A sigh can mean fatigue, relief, or microphone noise. We do not want a model inventing emotional interpretation from weak evidence. The second hard problem is privacy: environmental sound can reveal more than users expect.
The third hard problem is shared spaces. Even with strict opt-in, a microphone in a meeting room hears people who never opted into anything. A non-word audio feature that captures laughter in that room is still capturing information about people who are not the user. We do not think this is unsolvable, but it shapes the constraint set heavily: the detector should run locally on the user's device, the markers should never be shared without explicit action, and the default for ambient classes should lean toward silence over inference.
So the current standard is conservative. Audio captioning dictation should require clear user control, visible markers, and easy deletion. The bar to move audio event detection dictation from prototype to shipped is concrete: an opt-in flow that a careful user would describe as honest, default-off behavior in any environment we have not explicitly tested, and a UX that makes a wrong tag a single keystroke to dismiss. Until those pieces feel right, this stays research-frontier work, not a core shipped promise.
Frequently asked questions
Try Loqua today
Free to start. Mac native. Built by algorithm researchers who use it every day.
Download for Mac