Skip to content

feat: integrate Qwen2.5-Omni models with native audio input support#1581

Draft
ramkrishna2910 wants to merge 2 commits intomainfrom
claude/integrate-qwen-omni-models-LXY26
Draft

feat: integrate Qwen2.5-Omni models with native audio input support#1581
ramkrishna2910 wants to merge 2 commits intomainfrom
claude/integrate-qwen-omni-models-LXY26

Conversation

@ramkrishna2910
Copy link
Copy Markdown
Contributor

@ramkrishna2910 ramkrishna2910 commented Apr 8, 2026

Add Qwen2.5-Omni-7B and Qwen2.5-Omni-3B to the model registry as
llamacpp models with a new "omni" label. Omni models activate
experience mode in the UI (unified chat with image upload and mic)
but are treated as single models rather than composites.

For omni models, the experience mic records raw audio via
useAudioCapture, builds a WAV, and sends it as a multimodal
content part (image_url with data:audio/wav;base64 URI) in the
chat completions request. This lets the omni model process audio
natively rather than transcribing it first with Whisper.

Add Qwen2.5-Omni-7B and Qwen2.5-Omni-3B to the model registry as
llamacpp models with a new "omni" label. Omni models activate
experience mode in the UI (unified chat with image upload and mic)
but are treated as single models rather than composites.

For omni models, the experience mic records raw audio via
useAudioCapture, builds a WAV, and sends it as a multimodal
content part (image_url with data:audio/wav;base64 URI) in the
chat completions request. This lets the omni model process audio
natively rather than transcribing it first with Whisper.

Changes:
- server_models.json: Add Qwen2.5-Omni-7B/3B with omni+vision labels
- experienceModels.ts: isOmniModel() helper; isExperienceModel()
  returns true for omni; download/load checks handle single-model omni
- ChatWindow.tsx: Compute and pass isOmni prop to LLMChatPanel
- LLMChatPanel.tsx: Raw audio recording, WAV encoding, audio preview,
  audio file upload, and audio content parts in chat messages
- ModelManager.tsx: Show 'omni' label; skip composite-only paths
- backendInstaller.ts: Skip composite experience path for omni models
- styles.css: Audio preview, recording indicator, message audio styles

https://claude.ai/code/session_01Pwbj1PmPJCpGMTSoba4ML4
Comment thread src/app/src/renderer/components/panels/LLMChatPanel.tsx Fixed
Comment thread src/app/src/renderer/components/panels/LLMChatPanel.tsx Fixed
@ramkrishna2910
Copy link
Copy Markdown
Contributor Author

@claude review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude finished @ramkrishna2910's task in 2m 44s —— View job


PR Review: feat: integrate Qwen2.5-Omni models with native audio input support

  • Read changed files and gather context
  • Analyze security vulnerabilities flagged by CodeQL
  • Review model registry changes
  • Review frontend/UI changes
  • Review audio capture and multimodal content implementation
  • Post comprehensive review

Summary

Good overall approach — native audio-to-model without Whisper is the right architecture for an omni model. The implementation is mostly clean, but there are two security issues that need to be addressed before merge, plus several design concerns worth discussing.


🔴 Security Issues (Must Fix)

1. XSS + URL redirect via unvalidated src attribute (LLMChatPanel.tsx:801)

CodeQL is correct here. The url variable comes from item.image_url.url, which is populated from both user uploads and assistant/server responses. A compromised or malicious server response could inject javascript:alert(1) or a redirect URI.

// Current — line 801
return <img key={index} src={url} className="message-image" />;

The data:audio/ prefix check on line 798–799 already shows the right idea. Apply the same pattern to the image case:

if (item.type === 'image_url') {
  const url = item.image_url.url;
  if (url.startsWith('data:audio/')) {
    return <audio key={index} src={url} controls preload="metadata" className="message-audio" />;
  }
  // Only render images from safe sources
  if (url.startsWith('data:image/') || url.startsWith('https://') || url.startsWith('http://')) {
    return <img key={index} src={url} className="message-image" />;
  }
  return null;
}

Fix this →


🟡 Design Concerns

2. isExperienceModel returning true for omni models is fragile

In experienceModels.ts, omni models return true from isExperienceModel, but then 3 different call sites need && !isOmniModel() guards to opt out of the composite-model behavior:

  • ModelManager.tsx:862
  • ModelManager.tsx:935
  • backendInstaller.ts:530

There are also unguarded call sites that happen to work coincidentally (e.g. ModelManager.tsx:476getModelSize — works because getExperienceComponents returns [] for omni, triggering the components.length === 0 fallback). This is an accident-waiting-to-happen.

Suggested approach: Don't conflate the two. Keep isExperienceModel for true composite experiences, and handle isOmni separately where experience-mode UI behaviour is needed (e.g. ChatWindow.tsx:81 for isExperienceSelected).

3. Audio sent as type: 'image_url' in the chat completions payload

contentArray.push({ type: 'image_url', image_url: { url: omniAudioDataUrl } });

This reuses the vision multimodal structure for audio. The Qwen2.5-Omni model's expected multimodal format via llama.cpp may be different — for example, a top-level audio_url content type. This may silently fail or be ignored by the model. It's worth confirming whether llama.cpp's Qwen2.5-Omni implementation specifically handles data:audio/wav;base64,... via image_url before shipping.

4. No recording duration limit

omniAudioChunksRef accumulates raw PCM16 indefinitely. At 16 kHz mono × 2 bytes = ~32 KB/s. A 10-minute accidental recording would allocate ~19 MB in the React tree, and the btoa conversion would be proportionally slow. Consider capping recording at e.g. 60 seconds and showing a timer in the audio-recording-indicator.

5. buildWavDataUrl — slow btoa encoding for large buffers

let binary = '';
const bytes = new Uint8Array(buffer);
for (let i = 0; i < bytes.length; i++) binary += String.fromCharCode(bytes[i]);
return `data:audio/wav;base64,${btoa(binary)}`;

String concatenation over large arrays is O(n²) in most JS engines due to repeated reallocation. For non-trivial recordings, prefer:

// Process in chunks to avoid stack overflow and improve perf
const CHUNK_SIZE = 8192;
let binary = '';
for (let i = 0; i < bytes.length; i += CHUNK_SIZE) {
  binary += String.fromCharCode(...bytes.subarray(i, i + CHUNK_SIZE));
}

🟢 Minor Notes

6. Audio upload button uses the microphone icon

LLMChatPanel.tsx:1098 — The audio file upload button uses <MicrophoneIcon active={false} />, making it visually identical to the mic-record button. Users have no way to distinguish "upload audio file" from "record audio". An upload/attachment icon would be clearer.

7. omni not in getCategoryLabel map (ModelManager.tsx:512)

The label map doesn't include 'omni', so it falls back to the capitalization logic and shows "Omni". This is fine as-is, but if you want consistency with other labels (e.g. "Vision", "Experience"), it's worth adding explicitly.

8. Model registry — mmproj path resolves correctly

The legacy_mmproj_to_checkpoint function in model_manager.cpp derives the mmproj HuggingFace path as {repo_from_checkpoint}:{mmproj_filename} — so for Qwen2.5-Omni-7B-GGUF, it resolves to ggml-org/Qwen2.5-Omni-7B-GGUF:mmproj-Qwen2.5-Omni-7B-f16.gguf. Please verify that file actually exists in that HF repo at the expected path before merge.


Bottom Line

The security issue (#1) must be fixed. Issue #2 (fragile isExperienceModel expansion) is a meaningful design risk that will likely cause a follow-up bug. Issues #3 (audio content type) needs verification against actual llama.cpp behavior. The rest are polish items.

@ramkrishna2910 ramkrishna2910 marked this pull request as draft April 8, 2026 04:44
1. Fix XSS + URL redirect (CodeQL): validate URL schemes before
   rendering <img>/<audio> src attributes — only allow data:image/,
   data:audio/, https://, and http:// schemes.

2. Refactor experience/omni separation: revert isExperienceModel to
   its original pure form (composite experience models only). Add new
   isExperienceOrOmni() helper used exclusively in the UI layer
   (ChatWindow, ModelSelector) where experience-mode behavior is
   needed. This eliminates the fragile !isOmniModel() guards that
   were scattered across ModelManager and backendInstaller.

3. Add 60-second recording duration limit with a live timer display
   in the recording indicator. Auto-stops and attaches audio when
   the limit is reached.

4. Fix O(n²) btoa encoding: use chunk-based String.fromCharCode
   with subarray slices (8KB chunks) instead of per-byte string
   concatenation.

https://claude.ai/code/session_01Pwbj1PmPJCpGMTSoba4ML4
return <audio key={index} src={url} controls preload="metadata" className="message-audio" />;
}
if (url.startsWith('data:image/') || url.startsWith('https://') || url.startsWith('http://')) {
return <img key={index} src={url} alt="Uploaded" className="message-image" />;
return <audio key={index} src={url} controls preload="metadata" className="message-audio" />;
}
if (url.startsWith('data:image/') || url.startsWith('https://') || url.startsWith('http://')) {
return <img key={index} src={url} alt="Uploaded" className="message-image" />;
@ramkrishna2910 ramkrishna2910 self-assigned this Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants