Skip to content

Commit f43a35a

Browse files
author
Alex J Lennon
committed
Realtime viseme MQTT harness: normalised payloads, running peak, phoneme test, docs
- tools/realtime_viseme_mqtt.py: mic -> ONNX -> MQTT; visemes and visemes_peak normalised to sum=1 (incl. silence); running peak over ~1s; phoneme-test mode with play-back clips and peak-based pass; --speak plays pre-generated phoneme files from data/phoneme_prompts/ - tools/generate_phoneme_prompts.py: pre-generate segment_01.wav (silence) and segment_02..15.mp3 (TTS phonemes, 'thin' for TH, repeated 5x) for phoneme test - tools/test_viseme_inference.py: automated ONNX viseme inference test - QUICKSTART: MQTT publication usage (command, topic, payload shape, subscribe example), phoneme test, normalisation and visemes_peak docs - TODO.md: phoneme test audio improvement (human-recorded clips) - pyproject.toml: realtime extra adds edge-tts - .gitignore: data/phoneme_prompts/ (generated audio) Made-with: Cursor
1 parent deba953 commit f43a35a

8 files changed

Lines changed: 1323 additions & 2 deletions

File tree

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -226,4 +226,7 @@ bin/
226226
VisemesAtHome/
227227

228228
OpenLipSync.Inference.Test/
229-
OpenLipSync.Inference.Standalone/
229+
OpenLipSync.Inference.Standalone/
230+
231+
# Generated phoneme test audio (run tools/generate_phoneme_prompts.py to create)
232+
data/phoneme_prompts/

QUICKSTART.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,47 @@ The repo does **not** ship a golden WAV with committed “expected” visemes (t
175175

176176
If the app says **"Model not found"**, ensure there is an ONNX export under `export/` (e.g. `export/quick_laptop_uk_15ep_*/model.onnx` and `config.json`). The app uses the **newest** `model.onnx` under `export/`.
177177

178+
## 8. Real-time mic → visemes → MQTT (test harness)
179+
180+
A Python test harness streams live microphone audio through the ONNX model and publishes viseme activations as JSON to an MQTT broker.
181+
182+
**Install extra deps (from project root):**
183+
184+
```bash
185+
uv sync --extra realtime
186+
```
187+
188+
**Run (default: newest model under `export/`, broker `mqtt.dynamicdevices.co.uk:1883`, topic `openlipsync/visemes`):**
189+
190+
```bash
191+
uv run python tools/realtime_viseme_mqtt.py
192+
```
193+
194+
**MQTT usage:** Messages are published to `openlipsync/visemes/<client_id>` (client ID is stable per device; override with `--client-id`). Each message is JSON with: `t` (Unix time), `frame`, `client_id`, `visemes` (per-frame activations, normalised to sum=1), and `visemes_peak` (running max over ~1 s, normalised to sum=1). Subscribe to all clients with `mosquitto_sub -h mqtt.dynamicdevices.co.uk -t 'openlipsync/visemes/#'` or to one client with `openlipsync/visemes/<client_id>`.
195+
196+
**Options:**
197+
198+
- `--broker HOST` — MQTT broker host (default: mqtt.dynamicdevices.co.uk)
199+
- `--port PORT` — MQTT port (default: 1883)
200+
- `--topic TOPIC` — MQTT topic prefix; the connection’s client ID is appended so each client has a unique topic (default: openlipsync/visemes → openlipsync/visemes/&lt;client_id&gt;)
201+
- `--client-id ID` — MQTT client ID (default: MAC-derived, e.g. olips-a1b2c3d4e5f6)
202+
- `--warmup SECS` — Seconds of audio used to compute mel normalization stats (default: 1.0)
203+
- `--publish-every N` — Publish every N frames (1 = every ~10 ms; 5 = every ~50 ms)
204+
- `--model-dir PATH` — Use a specific export dir instead of the newest under `export/`
205+
- `--device NAME` — Sounddevice input device (e.g. list with `python -c "import sounddevice; print(sounddevice.query_devices())"`)
206+
207+
**Phoneme check:** Run with `--phoneme-test`. Each segment is either silence (segment 1) or a test phoneme played from pre-generated files (segment 2–15). The mic captures that audio and the harness reports **peak** viseme activations and ✓/✗ (expected in top 2). Add `--speak` to play the clips: generate once with `uv run --extra realtime python tools/generate_phoneme_prompts.py` (writes `data/phoneme_prompts/`: segment_01.wav = silence, segment_02..15.mp3 = TTS e, ah, eh, oh, oo, p, f, thin, t, k, sh, s, n, r, each repeated 5×). Playback runs in the background; need **ffplay**, **afplay**, or **mpv**.
208+
209+
**JSON payload shape:** Each message is per-frame viseme activations **normalised so each set sums to 1** (including silence). Fields: `t`, `frame`, `client_id`, `visemes` (name → 0–1, sum=1), and `visemes_peak` (running max over last ~1 s, then normalised to sum=1). Example:
210+
211+
```json
212+
{"t": 1739123456.78, "frame": 42, "client_id": "openlipsync-a1b2c3d4", "visemes": {"silence": 0.9, "PP": 0.01, "aa": 0.02, ...}, "visemes_peak": {"silence": 0.95, "PP": 0.02, "aa": 0.45, ...}}
213+
```
214+
215+
Normalization uses a short warmup (default 1 s) to estimate mean/std over the mic input; then those stats are fixed for the rest of the session. Stop with Ctrl+C.
216+
217+
**Improving the phoneme test:** The default clips are TTS (Edge TTS). In practice, silence and SS often peak high (playback + mic setup), so the pass rate can be low even when the correct viseme fires. Using **human-recorded** phoneme clips (same filenames in `data/phoneme_prompts/`: segment_01.wav … segment_15.mp3) would improve realism and pass rate. See TODO.md (phoneme test audio).
218+
178219
## Troubleshooting
179220

180221
| Issue | What to do |

TODO.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# TODO
2+
3+
## Phoneme test audio
4+
5+
**Improve phoneme test pass rate and realism** by replacing TTS-generated clips with human-recorded phoneme clips.
6+
7+
- **Current:** `tools/generate_phoneme_prompts.py` creates segment_01.wav (silence) and segment_02..15.mp3 via Edge TTS. Playback + mic often causes silence/SS to dominate, so pass rate is low.
8+
- **Proposed:** Have someone record the 15 segments (silence + e, ah, eh, oh, oo, p, f, thin, t, k, sh, s, n, r, each repeated several times). Save as the same filenames in `data/phoneme_prompts/` (segment_01.wav, segment_02.mp3 … segment_15.mp3). No harness changes needed.
9+
- **Refs:** QUICKSTART “Improving the phoneme test”; `tools/generate_phoneme_prompts.py` docstring TODO.

pyproject.toml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,9 @@ gui = [
2525
export = [
2626
"onnx",
2727
]
28+
realtime = [
29+
"onnxruntime",
30+
"sounddevice",
31+
"paho-mqtt",
32+
"edge-tts",
33+
]

tools/generate_phoneme_prompts.py

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Pre-generate phoneme test audio for the realtime harness.
4+
5+
Creates one file per segment in data/phoneme_prompts/:
6+
- segment_01: 1 s silence (WAV)
7+
- segment_02..15: TTS of the target sound repeated REPEAT_COUNT times (e.g. "e e e e e", "thin thin ..." for /θ/) as MP3.
8+
9+
At run time, --speak plays the clip for each segment in the background while the mic
10+
captures; the harness checks that the expected viseme is detected (peak activation, top-2).
11+
12+
TODO: Replace TTS clips with human-recorded phoneme clips (same filenames) for better
13+
pass rate and more natural test; see TODO.md (phoneme test audio).
14+
15+
Usage (from project root):
16+
uv run --extra realtime python tools/generate_phoneme_prompts.py
17+
"""
18+
19+
from __future__ import annotations
20+
21+
import asyncio
22+
import sys
23+
import wave
24+
from pathlib import Path
25+
26+
PROJECT_ROOT = Path(__file__).resolve().parents[1]
27+
OUT_DIR = PROJECT_ROOT / "data" / "phoneme_prompts"
28+
29+
# Segment 1 = silence (we write a WAV). Segments 2-15 = TTS text for the sound only.
30+
# Use "thin" for TH so TTS produces the /θ/ sound, not "tee aitch".
31+
PHONEME_TEXTS = [
32+
None, # 1: silence, no TTS
33+
"e", "ah", "eh", "oh", "oo", "p", "f", "thin", "t", "k", "sh", "s", "n", "r",
34+
]
35+
REPEAT_COUNT = 5 # Say each sound this many times per segment for a stronger test
36+
SAMPLE_RATE = 16000
37+
38+
VOICE = "en-GB-SoniaNeural"
39+
40+
41+
def write_silence_wav(path: Path, duration_sec: float = 1.0) -> None:
42+
with wave.open(str(path), "wb") as w:
43+
w.setnchannels(1)
44+
w.setsampwidth(2)
45+
w.setframerate(SAMPLE_RATE)
46+
n = int(SAMPLE_RATE * duration_sec)
47+
w.writeframes(b"\x00\x00" * n)
48+
49+
50+
async def main() -> None:
51+
try:
52+
import edge_tts
53+
except ImportError:
54+
print(
55+
"edge-tts is not installed. From the project root, run:\n"
56+
" uv run --extra realtime python tools/generate_phoneme_prompts.py\n"
57+
"(--extra realtime installs edge-tts into the environment, then runs this script.)",
58+
file=sys.stderr,
59+
)
60+
sys.exit(1)
61+
62+
OUT_DIR.mkdir(parents=True, exist_ok=True)
63+
print(f"Writing {len(PHONEME_TEXTS)} files to {OUT_DIR}")
64+
65+
# Segment 1: silence
66+
seg1 = OUT_DIR / "segment_01.wav"
67+
write_silence_wav(seg1)
68+
print(f" {seg1.name} (silence)")
69+
70+
for i, text in enumerate(PHONEME_TEXTS[1:], start=2):
71+
path = OUT_DIR / f"segment_{i:02d}.mp3"
72+
# Repeat the sound REPEAT_COUNT times so each segment is a solid test
73+
repeated = " ".join([text] * REPEAT_COUNT)
74+
communicate = edge_tts.Communicate(repeated, VOICE)
75+
await communicate.save(str(path))
76+
print(f" {path.name} ({text} x{REPEAT_COUNT})")
77+
78+
print("Done. Run phoneme-test with --speak: each segment plays this sound while the mic captures and checks the viseme.")
79+
80+
81+
if __name__ == "__main__":
82+
asyncio.run(main())

0 commit comments

Comments
 (0)