Audio / Voice Notes — 2026-01-17
What works
- Media understanding (audio): If audio understanding is enabled (or auto‑detected), OpenClaw:
- Locates the first audio attachment (local path or URL) and downloads it if needed.
- Enforces maxBytes before sending to each model entry.
- Runs the first eligible model entry in order (provider or CLI).
- If it fails or skips (size/timeout), it tries the next entry.
- On success, it replaces Body with an [Audio] block and sets {{Transcript}}.
- Command parsing: When transcription succeeds, CommandBody/RawBody are set to the transcript so slash commands still work.
- Verbose logging: In --verbose, we log when transcription runs and when it replaces the body.
Auto-detection (default)
If you don’t configure models and tools.media.audio.enabled is not set to false, OpenClaw auto-detects in this order and stops at the first working option:
- Local CLIs (if installed)
- sherpa-onnx-offline (requires SHERPA_ONNX_MODEL_DIR with encoder/decoder/joiner/tokens)
- whisper-cli (from whisper-cpp; uses WHISPER_CPP_MODEL or the bundled tiny model)
- whisper (Python CLI; downloads models automatically)
- Gemini CLI (gemini) using read_many_files
- Provider keys (OpenAI → Groq → Deepgram → Google)
To disable auto-detection, set tools.media.audio.enabled: false. To customize, set tools.media.audio.models. Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on PATH (we expand ~), or set an explicit CLI model with a full command path.
Config examples
Provider + CLI fallback (OpenAI + Whisper CLI)
{
tools: {
media: {
audio: {
enabled: true,
maxBytes: 20971520,
models: [
{ provider: "openai", model: "gpt-4o-mini-transcribe" },
{
type: "cli",
command: "whisper",
args: ["--model", "base", "{{MediaPath}}"],
timeoutSeconds: 45
}
]
}
}
}
}
Provider-only with scope gating
{
tools: {
media: {
audio: {
enabled: true,
scope: {
default: "allow",
rules: [
{ action: "deny", match: { chatType: "group" } }
]
},
models: [
{ provider: "openai", model: "gpt-4o-mini-transcribe" }
]
}
}
}
}
Provider-only (Deepgram)
{
tools: {
media: {
audio: {
enabled: true,
models: [{ provider: "deepgram", model: "nova-3" }]
}
}
}
}
Notes & limits
- Provider auth follows the standard model auth order (auth profiles, env vars, models.providers.*.apiKey).
- Deepgram picks up DEEPGRAM_API_KEY when provider: "deepgram" is used.
- Deepgram setup details: Deepgram (audio transcription).
- Audio providers can override baseUrl, headers, and providerOptions via tools.media.audio.
- Default size cap is 20MB (tools.media.audio.maxBytes). Oversize audio is skipped for that model and the next entry is tried.
- Default maxChars for audio is unset (full transcript). Set tools.media.audio.maxChars or per-entry maxChars to trim output.
- OpenAI auto default is gpt-4o-mini-transcribe; set model: "gpt-4o-transcribe" for higher accuracy.
- Use tools.media.audio.attachments to process multiple voice notes (mode: "all" + maxAttachments).
- Transcript is available to templates as {{Transcript}}.
- CLI stdout is capped (5MB); keep CLI output concise.
Gotchas
- Scope rules use first-match wins. chatType is normalized to direct, group, or room.
- Ensure your CLI exits 0 and prints plain text; JSON needs to be massaged via jq -r .text.
- Keep timeouts reasonable (timeoutSeconds, default 60s) to avoid blocking the reply queue.