Local STT (Parakeet / Whisper)
Unified local speech-to-text using ONNX Runtime with int8 quantization. Choose your backend:
- Parakeet (default): Best accuracy for English, correctly captures names and filler words
- Whisper: Fastest inference, supports 99 languages
Usage
# Default: Parakeet v2 (best English accuracy)
~/.openclaw/skills/local-stt/scripts/local-stt.py audio.ogg
# Explicit backend selection
~/.openclaw/skills/local-stt/scripts/local-stt.py audio.ogg -b whisper
~/.openclaw/skills/local-stt/scripts/local-stt.py audio.ogg -b parakeet -m v3
# Quiet mode (suppress progress)
~/.openclaw/skills/local-stt/scripts/local-stt.py audio.ogg --quiet
Options
-b/--backend: parakeet (default), whisper
-m/--model: Model variant (see below)
--no-int8: Disable int8 quantization
-q/--quiet: Suppress progress
--room-id: Matrix room ID for direct message
Models
Parakeet (default backend)
| Model |
Description |
| v2 (default) |
English only, best accuracy |
| v3 |
Multilingual |
Whisper
| Model |
Description |
| tiny |
Fastest, lower accuracy |
| base (default) |
Good balance |
| small |
Better accuracy |
| large-v3-turbo |
Best quality, slower |
Benchmark (24s audio)
| Backend/Model |
Time |
RTF |
Notes |
| Whisper Base int8 |
0.43s |
0.018x |
Fastest |
| Parakeet v2 int8 |
0.60s |
0.025x |
Best accuracy |
| Parakeet v3 int8 |
0.63s |
0.026x |
Multilingual |
openclaw.json
{
"tools": {
"media": {
"audio": {
"enabled": true,
"models": [
{
"type": "cli",
"command": "~/.openclaw/skills/local-stt/scripts/local-stt.py",
"args": ["--quiet", "{{MediaPath}}"],
"timeoutSeconds": 30
}
]
}
}
}
}