Engineering

Reinforcement learning voice typing: GRPO, DPO, and on-policy distillation in our voice stack

How Loqua thinks about preference optimization after supervised speech and text training hit the long tail.

TL;DR

Reinforcement learning voice typing is how we improve the long tail after supervised training stops buying quality. Loqua is a Mac-native voice typing tool that uses preference-style training signals for rare technical terms, app-aware structure, latency, and natural final text. We treat RL as a calibration layer, not as a magic replacement for data quality.

For a voice product, the painful errors are not average errors. They are the few moments when the model changes a package name, writes a stiff Slack reply, or waits too long before committing text. Reinforcement learning voice typing is our umbrella term for the post-training loop that targets those moments.

Why supervised loss stops paying off

Supervised learning is necessary. It teaches the model the task: audio in, context in, text out. But eventually, the loss keeps improving on easy examples while the product stops getting noticeably better. The remaining issues are preference-shaped, not simply label-shaped.

Consider technical dictation. A supervised pair can teach that "react query" sometimes means @tanstack/react-query. But the product question is conditional: should the model preserve the spoken phrase, rewrite it as an import path, or leave it as plain English because the cursor is in a customer email? The right answer depends on context and user tolerance for correction.

A concrete pattern: our internal benchmark for clean read-aloud speech improved by less than a percentage point across three consecutive supervised iterations, while dogfood edit rate on real workflows changed by more than four points. That gap is the signature of preference-shaped failure: the model is technically closer to the gold transcript yet less aligned with what the user actually wanted to commit to the document.

That is where rl for speech recognition and text rendering becomes useful. We can reward outputs that preserve entities, arrive quickly, and match destination format, while penalizing overconfident rewrites. The reward is not "more clever." The reward is "less editing after dictation."

GRPO vs DPO vs PPO

We separate three families of post-training tools. PPO is flexible and historically important, with a long lineage from policy-gradient work such as Proximal Policy Optimization. DPO is attractive when you have pairwise preference data and want a simpler objective; see the Direct Preference Optimization paper for the clean formulation.

GRPO-style training is useful for grouped candidates: generate several outputs for the same utterance and context, rank them with a reward function, then update toward the better group behavior. For Loqua, grouped comparison fits many voice errors. We do not only ask "is this transcript correct?" We ask which output is best for the current app, latency budget, and edit intent.

MethodWhere it helpsWhere we are careful
DPOPairwise style and formatting preferencesCan overfit to preference data wording
GRPO-style groupingMultiple candidates for same voice contextReward design must avoid verbosity bias
PPO-style loopsInteractive objectives with explicit rewardMore moving parts and tuning burden

How we matched methods to layers

In practice each layer of the stack gets a different post-training tool. The text renderer is the natural home for DPO and grouped optimization because its decisions are local and easy to compare side by side. The instruction planner uses lighter pairwise updates to nudge intent classification and format planning. The acoustic front end mostly stays out of RL; preference signals are too distant from frame-level audio to be useful, and we get more from data curation and supervised refinement there. The practical choice is not ideological. We pick the smallest loop that exposes the failure mode clearly.

On-policy distillation for text rendering

On-policy distillation matters because the text renderer must learn from states it actually visits. Traditional offline distillation can train on clean teacher outputs that the smaller student never reaches at inference time. In a streaming dictation product, that mismatch is visible: once the student takes a slightly wrong partial path, later tokens become awkward.

Our text-rendering training uses on-policy distillation ideas: let the student produce candidate continuations, score those continuations with a stronger evaluator and task reward, then train on the student's own trajectory rather than on a disconnected gold path. Recent literature on on-policy distillation and related memory-policy optimization work gives useful language for this problem.

Concretely, a training step looks like this. We take a real dogfood utterance and screen context. The student produces three to five candidate continuations under streaming constraints. A larger evaluator scores each candidate along entity preservation, latency, destination fit, and naturalness. The student is then updated to prefer the higher-scoring trajectory, weighted by how far it currently is from the evaluator's choice. The student never sees an offline gold path; it only ever sees its own behavior, ranked.

The lesson we care about is simple: train the model where it will live. For voice typing, it lives in partial utterances, visible context, uncertain identifiers, and user edits. A beautiful offline transcript is not enough.

Reward shaping: latency, accuracy, naturalness

The reward has four parts. Accuracy rewards entity preservation, low WER in supported conditions, and correct edit intent. Latency rewards early useful text, not just early tokens. Naturalness rewards text that reads like the user, including concise Slack replies and clean technical prose. Safety rewards conservative behavior when uncertainty is high.

Reward shaping voice systems is easy to get wrong. If you overweight latency, the model commits too early. If you overweight formatting, it turns casual notes into templates. If you overweight entity preservation, it may keep raw dictated fragments that should have been cleaned. We tune reward weights by comparing real dogfooding edits before and after each training run.

  • Latency reward: time to first usable text and time to stable commit.
  • Entity reward: technical names, file paths, commands, and mixed-language spans.
  • Destination reward: correct shape for Slack, GitHub, Cursor, VS Code, email, or notes.
  • Correction reward: fewer user backspaces and fewer manual rewrites after insertion.

Counterfactual pairs are the most useful preference data we collect. For every accepted edit a user makes after dictation, we can construct a pair where the dictated text is the rejected candidate and the edited text is the preferred one. That data is dense, naturally aligned with real use, and free of synthetic-preference artifacts. We treat it as a slow, high-signal feedback loop rather than a real-time online RL signal.

What it looks like in production

In production, RL does not appear as a visible feature. It shows up as fewer annoying moments. A Git commit message gets a concise imperative form. A customer email keeps a warmer tone. A Python comment preserves the exact identifier that is visible near the cursor. A long utterance starts streaming quickly but delays risky entity spans until context is available.

A small concrete example: dictating "fix the bug where retry exhausts the queue" in a terminal window with a recent git diff visible produces fix: drain retry queue before exhausting backoff window as the commit subject. The same utterance with the cursor in a Slack thread produces "Fixing the bug where retry exhausts the queue — should land this afternoon." Same speech, same speaker, two different destination-appropriate outputs. The instruction planner chose the destination plan; the text renderer, post-trained with destination reward, produced the right shape.

We also keep the post-training boundary narrow. The core recognizer, instruction planner, and text renderer are trained in-house for Loqua's dictation surface. Public research on RLHF, DPO, GRPO-like grouping, and on-policy distillation informs our evaluation vocabulary, but the production stack is tuned against our own data, runtime constraints, and privacy boundary.

Failure modes and debugging

RL makes bad reward functions more obvious. The common failure modes are verbosity bias, premature commitment, style drift, and reward hacking around easy formatting cues. We debug them with ablations: remove latency reward, freeze the entity reward, compare no-screen-context candidates, and replay real dogfooding utterances through old and new checkpoints.

Our pre-merge checklist for an RL run is short and deliberate. Did the correction rate go down on real dogfood data, not only on a held-out preference set? Did p95 time to first usable text stay inside budget? Did entity preservation hold or improve across English, Chinese, and code-identifier slices? Did the text renderer stop adding unsolicited bullet points or trailing politeness? If any of those answers is no, the checkpoint goes back for tuning rather than shipping.

The most important discipline is to preserve a human-readable error taxonomy. A bad output should be labeled as hearing, entity, intent, destination, tone, latency, or privacy-boundary failure. Without that taxonomy, reinforcement learning voice typing becomes a pile of numbers that can improve while the product feels worse.

Frequently asked questions

What does reinforcement learning voice typing mean in Loqua?
It means post-training the voice stack with rewards tied to dictation quality: entity preservation, destination-aware formatting, latency, naturalness, and fewer manual edits. It does not mean replacing supervised training. It is the layer we use after supervised data stops improving the long tail.
Why is DPO useful for voice typing?
DPO is useful when the difference between two outputs is a preference rather than a hard label. For example, both a formal email sentence and a concise Slack sentence may be valid English, but only one matches the destination context. Pairwise preference data captures that distinction cleanly.
Where does GRPO-style grouping help?
Grouped optimization helps when we can generate several candidate outputs for the same utterance and context. The reward can rank candidates by latency, entity accuracy, and destination fit. That maps well to dictation because one spoken phrase can have several plausible written forms.
What is on-policy distillation in this setting?
On-policy distillation means training the student on trajectories it actually produces, not only on clean teacher outputs. In streaming voice typing, the model often operates on partial context and uncertain prefixes. Training on those visited states makes the text renderer more robust at inference time.
Can reward shaping make output worse?
Yes. Overweight latency and the model commits too early. Overweight style and it over-formats simple notes. Overweight entity preservation and it refuses to clean spoken fragments. We treat reward weights as product decisions and test them against real dogfooding edits.
How do you know RL improved the product?
We look beyond aggregate loss. We compare correction rate, accepted first-pass output, time to stable text, entity preservation, and human review on real workflows. If a checkpoint improves a reward metric but increases user edits, it is not a product improvement.
Where does user data come from for preference training?
Mostly from our own team and opt-in dogfooders. The richest signal is the diff between what was dictated and what the user kept after editing, treated as a counterfactual preference pair. We deliberately keep online RL out of the product loop; user trust matters more than a small extra signal.

Try Loqua today

Free to start. Mac native. Built by algorithm researchers who use it every day.

Download for Mac

More from the Loqua Blog

engineering
Omni-modal voice typing: multimodal understanding, MoE, and streaming text output
how-to
Voice typing for AI coding: voice prompt Cursor and Claude Code without typing
compare
Loqua vs Wispr Flow: a Mac-first Wispr Flow alternative for context, coding, and privacy