Engineering

Audio event detection dictation: sounds with meaning beyond words

A prototype-stage note on how non-word audio could enrich dictation without breaking privacy or flow.

Shuran Zhou, Founder · 2026-05-12 ·6 min ·Updated 2026-05-12

TL;DR

Audio event detection dictation is still prototype-stage at Loqua. Loqua is a Mac-native voice typing tool, and our shipped focus is words, context, and app-aware output. We are researching whether non-word audio such as laughter, pauses, doorbells, or sighs can become optional structured context without making dictation noisy or invasive.

This post is deliberately more tentative than our other engineering notes. Sounds with meaning is not a shipped feature. It is an early research direction: can sound understanding voice typing capture useful non-word signals while preserving the calm flow of dictation?

The non-word audio gap

Voice typing systems usually throw away everything that is not a word. That is sensible for clean transcription, but it loses information. In a meeting, laughter can mark agreement or tension. In a journal, a long pause may matter. In accessibility workflows, a doorbell, timer, or baby crying can be useful context.

Think about what a typical dictation transcript looks like after a one-hour meeting. The words are there, but the rhythm is flattened: the long pause before someone disagrees, the chuckle that softened a hard piece of feedback, the moment of silence after a difficult question. A human reviewing the transcript fills those in from memory. A teammate who could not attend has no signal at all. Audio event detection dictation is one way to put a small amount of that texture back into the written record, without asking the user to narrate it.

The risk is obvious: not every sound should become text. Most background audio is irrelevant. Some of it is private. Some of it is ambiguous. Audio event detection dictation only makes sense if it is optional, local-first, and conservative about when a sound changes the written output.

AED vs audio captioning

Audio event detection (AED) answers a compact question: what event happened, and roughly when? Audio captioning writes a natural-language description of a sound scene. For dictation, AED is often enough. A tag such as "laughter" or "doorbell" can be a marker; a full caption may be too verbose.

Technique	Output	Dictation fit
AED	Event label + timestamp	Meeting markers, reminders, accessibility cues
Audio captioning	Sentence describing scene	Journaling, media notes, review workflows
Emotion/prosody cues	Tentative affect signal	Only useful with strong user control

Why we lean toward AED first

An AED tag fails quietly. If the model labels something as "applause" and it was not, the user sees a single bracketed marker that is easy to delete. A wrong audio caption is harder to undo: it shapes the surrounding paragraph, biases a reader, and lingers in summaries. For a dictation product where trust is built one sentence at a time, the cost of a small wrong tag is much lower than the cost of a confidently wrong sentence. Our early bias is toward small structured markers, not automatic prose. A marker is easier to review, delete, or ignore.

What this could mean for dictation

In meetings, non-word audio could create optional markers: "[laughter]" after a joke, "[long pause]" before a decision, or "[doorbell]" when the speaker is interrupted. In journaling, it could help preserve emotional texture without forcing the user to narrate it. In accessibility workflows, it could turn environmental sound into a short note or reminder.

A concrete sketch. Imagine a meeting note where the user has opted into meeting markers. The transcript would read like ordinary prose with rare, compact tags: "We agreed to ship the migration this week. [laughter] Then we walked through the rollback plan. [long pause] Someone asked whether we should defer the index changes." The reader gets a richer sense of what happened without a paragraph of stage direction.

A journaling sketch is even narrower. The user dictates a quick end-of-day note; an audible long pause might surface as a "[reflection]" tag the user can keep, edit, or delete on review. Nothing is committed automatically into the body of the journal entry without a chance to look.

We are not trying to make dictation theatrical. The goal is not to write every cough or keyboard click. The goal is to detect a narrow set of high-signal events and let the user decide whether those events become text, tags, or nothing.

Research foundations

Several public research lines are relevant. CLAP explores contrastive language-audio pretraining. BEATs studies audio pretraining for acoustic understanding. AudioSet is a large-scale dataset for audio events, and AudioCaps is a reference point for audio captioning.

These are research foundations, not a product dependency statement. Loqua's prototype work is focused on the Mac dictation question: which sound cues are useful at the cursor, which should stay invisible, and how can the user control the boundary?

What we're prototyping

We're prototyping three narrow behaviors. First, meeting markers: optional tags for laughter, silence, applause, and interruptions. Second, journaling cues: user-approved tags for long pauses or audible exasperation. Third, accessibility alerts: a local sound cue that can become a reminder or note when the user asks for it.

The user experience we are sketching internally is deliberately quiet. Detected events appear as chips in a small review surface next to the dictated text, not in the text itself. The user can drag a chip into the document, dismiss it, or convert it into a destination-specific tag. Default behavior is "never insert without consent." Default mode is off until the user opts in for a given workflow.

The prototype is local-first and opt-in. Nothing in this direction should silently annotate private background sound. We are also testing a "marker only" mode where detected sounds never enter prose automatically; they appear as reviewable chips before insertion.

Hard problems we have not solved

The hardest problem is meaning. Laughter can mean agreement, discomfort, sarcasm, or nothing. A sigh can mean fatigue, relief, or microphone noise. We do not want a model inventing emotional interpretation from weak evidence. The second hard problem is privacy: environmental sound can reveal more than users expect.

The third hard problem is shared spaces. Even with strict opt-in, a microphone in a meeting room hears people who never opted into anything. A non-word audio feature that captures laughter in that room is still capturing information about people who are not the user. We do not think this is unsolvable, but it shapes the constraint set heavily: the detector should run locally on the user's device, the markers should never be shared without explicit action, and the default for ambient classes should lean toward silence over inference.

So the current standard is conservative. Audio captioning dictation should require clear user control, visible markers, and easy deletion. The bar to move audio event detection dictation from prototype to shipped is concrete: an opt-in flow that a careful user would describe as honest, default-off behavior in any environment we have not explicitly tested, and a UX that makes a wrong tag a single keystroke to dismiss. Until those pieces feel right, this stays research-frontier work, not a core shipped promise.

Frequently asked questions

What is audio event detection dictation?

It is a research direction where a dictation tool can detect selected non-word sounds, such as laughter or a doorbell, and optionally turn them into structured markers. In Loqua, this is prototype-stage work, not a shipped core feature.

How is AED different from audio captioning?

AED usually returns compact event labels and timestamps. Audio captioning writes a fuller sentence about the sound scene. Dictation often needs the smaller signal because users want clean writing, not a transcript of every background sound.

Would Loqua automatically write background sounds into my text?

That is not the product direction. Any sound-understanding feature should be opt-in, local-first, and reviewable. Our prototype bias is toward markers that the user can accept, ignore, or delete rather than automatic prose insertion.

Why would non-word audio help meetings?

Meetings contain useful cues that are not words: laughter after agreement, a long pause before a decision, or an interruption. A compact marker can help reconstruct context later, especially when notes are used to generate tasks or follow-up summaries.

What are the privacy risks?

Environmental audio can reveal people, places, and situations that the user did not intend to document. That is why the feature must be narrow, optional, local-first, and visibly controlled. A useful marker is not worth surprising the user.

When will sounds with meaning ship?

There is no committed ship date. The shipped Loqua focus remains words, screen context, app-aware output, and low latency. Sounds with meaning will only move forward if the prototype can be useful without adding noise or privacy ambiguity.

What about shared spaces where others have not opted in?

It is a real constraint on the design. The detector runs locally on the user's device, markers are never shared without explicit action, and the default for ambient sound classes leans toward silence over inference. A useful marker is not worth recording information about people who never agreed to be recorded.

Try Loqua today

Free to start. Mac native. Built by algorithm researchers who use it every day.

Download

More from the Loqua Blog

Engineering

Multimodal voice recognition: building a listener that sees what you see

How-to

Hands-free dictation for writers: how to draft 3000 words of novel, essay, or long-form in one sitting

Compare

Loqua vs Wispr Flow: a Mac-first Wispr Flow alternative for context, coding, and privacy