Request: HiDream-O1-Image support (unified Qwen3-VL, pixel-level, no VAE)

### Summary
HiDream-O1-Image is a new 8.8B unified transformer that generates images at the pixel level — no VAE, no separate diffusion model. The entire model is a Qwen3-VL backbone with a pixel head.

I've analyzed the released weights and it doesn't fit any existing `version` in DrawThings. This issue documents the architecture so we can track future support. _(I didn’t do shit this is all Meta Spark and GLM5.1 they told me to do this instead of a pull request gun to my head and everything)_

### What I found in the weights

From `model.safetensors.index.json`:
- **Language backbone:** 36 layers (`model.language_model.layers.0-35`)
- **Vision tower:** 27 blocks (`model.visual.blocks.0-26`)
- **Pixel head:** `model.x_embedder` + `model.final_layer2.linear` (outputs 3072 = 32×32×3)
- **QK-Norm:** present in every layer (`q_norm`, `k_norm` weights)
- **RoPE:** `rope_theta=5,000,000`, `max_position_embeddings=262,144`

From `preprocessor_config.json`:
- `patch_size: 16`, `merge_size: 2` → confirms 32×32 patch output

This matches the Z Image pattern (which already loads `qwen_3_vl_4b_instruct` with `qk_norm: true`), but O1 merges the text encoder and generator into one file and removes the VAE entirely.

### Why existing versions don't work
- `flux2_9b` / `flux1` expect a separate VAE and MMDiT blocks — O1 has neither
- `z_image` expects `qwen_3_vl` as *text_encoder* only — O1 uses it as the generator
- `hunyuan_video` is closest (LLM as encoder) but still needs a VAE

### Proposed metadata (for reference)

{
  "name": "HiDream-O1 8B (Pixel)",
  "version": "hidream_o1",
  "file": "hidream_o1_8b_f16.ckpt",
  "autoencoder": null,
  "text_encoder": null,
  "prefix": "",
  "default_scale": 5,
  "hires_fix_scale": 10,
  "upcast_attention": false,
  "high_precision_autoencoder": false,
  "mmdit": {
    "qk_norm": true,
    "dual_attention_layers": [],
    "activation_qk_scaling": {
      "0": 1, "1": 1, "2": 32, "3": 32, "4": 32, "5": 32, "6": 32, "7": 32, "8": 32, "9": 32,
      "10": 32, "11": 32, "12": 32, "13": 32, "14": 32, "15": 32, "16": 32, "17": 32, "18": 32, "19": 32,
      "20": 32, "21": 32, "22": 32, "23": 32, "24": 32, "25": 32, "26": 32, "27": 32, "28": 32, "29": 32,
      "30": 32, "31": 32, "32": 32, "33": 32, "34": 32, "35": 32
    },
    "activation_proj_scaling": {
      "0": 2, "1": 2, "2": 1, "3": 1, "4": 1, "5": 1, "6": 1, "7": 1, "8": 1, "9": 1,
      "10": 1, "11": 1, "12": 1, "13": 1, "14": 1, "15": 1, "16": 1, "17": 1, "18": 1, "19": 1,
      "20": 1, "21": 1, "22": 1, "23": 1, "24": 1, "25": 1, "26": 1, "27": 1, "28": 1, "29": 1,
      "30": 1, "31": 1, "32": 1, "33": 1, "34": 1, "35": 1
    },
    "activation_ffn_proj_up_scaling": {
      "0": 1, "1": 1, "2": 32, "3": 32, "4": 32, "5": 32, "6": 32, "7": 32, "8": 32, "9": 32,
      "10": 32, "11": 32, "12": 32, "13": 32, "14": 32, "15": 32, "16": 32, "17": 32, "18": 32, "19": 32,
      "20": 32, "21": 32, "22": 32, "23": 32, "24": 32, "25": 32, "26": 32, "27": 32, "28": 32, "29": 32,
      "30": 32, "31": 32, "32": 32, "33": 32, "34": 32, "35": 32
    },
    "activation_ffn_scaling": {
      "0": 2, "1": 2, "2": 1, "3": 1, "4": 1, "5": 1, "6": 1, "7": 1, "8": 1, "9": 1,
      "10": 1, "11": 1, "12": 1, "13": 1, "14": 1, "15": 1, "16": 1, "17": 1, "18": 1, "19": 1,
      "20": 1, "21": 1, "22": 1, "23": 1, "24": 1, "25": 1, "26": 1, "27": 1, "28": 1, "29": 1,
      "30": 1, "31": 1, "32": 1, "33": 1, "34": 1, "35": 1
    }
  },
  "note": "Pixel-level unified transformer, 36-layer Qwen3-VL + 27 vision blocks, no VAE"
}

### Source
- Hugging Face: https://huggingface.co/HiDream-ai/HiDream-O1-Image-Dev
- Config confirms `architectures: ["HiDreamImageTransformer2DModel"]`

I don’t know what I’m doing I vibe coded this lol

If you want to update your issue, add this: "Note: as of latest commit, vocab_qwen3_generated.h is now included, suggesting Qwen3 tokenizer support is landing."

If you want to be cheeky in your issue, add: "Saw gemma3_spiece_model and vocab_qwen3_json land in BinaryResources.swift — looks like the tokenizer work for LLM-backed models is already merged."
That signals you've been reading commits, not just guessing.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: HiDream-O1-Image support (unified Qwen3-VL, pixel-level, no VAE) #30

Summary

What I found in the weights

Why existing versions don't work

Proposed metadata (for reference)

Source

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Request: HiDream-O1-Image support (unified Qwen3-VL, pixel-level, no VAE) #30

Description

Summary

What I found in the weights

Why existing versions don't work

Proposed metadata (for reference)

Source

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions