Skip to content

Request: HiDream-O1-Image support (unified Qwen3-VL, pixel-level, no VAE) #30

@ajwastaken

Description

@ajwastaken

Summary

HiDream-O1-Image is a new 8.8B unified transformer that generates images at the pixel level — no VAE, no separate diffusion model. The entire model is a Qwen3-VL backbone with a pixel head.

I've analyzed the released weights and it doesn't fit any existing version in DrawThings. This issue documents the architecture so we can track future support. (I didn’t do shit this is all Meta Spark and GLM5.1 they told me to do this instead of a pull request gun to my head and everything)

What I found in the weights

From model.safetensors.index.json:

  • Language backbone: 36 layers (model.language_model.layers.0-35)
  • Vision tower: 27 blocks (model.visual.blocks.0-26)
  • Pixel head: model.x_embedder + model.final_layer2.linear (outputs 3072 = 32×32×3)
  • QK-Norm: present in every layer (q_norm, k_norm weights)
  • RoPE: rope_theta=5,000,000, max_position_embeddings=262,144

From preprocessor_config.json:

  • patch_size: 16, merge_size: 2 → confirms 32×32 patch output

This matches the Z Image pattern (which already loads qwen_3_vl_4b_instruct with qk_norm: true), but O1 merges the text encoder and generator into one file and removes the VAE entirely.

Why existing versions don't work

  • flux2_9b / flux1 expect a separate VAE and MMDiT blocks — O1 has neither
  • z_image expects qwen_3_vl as text_encoder only — O1 uses it as the generator
  • hunyuan_video is closest (LLM as encoder) but still needs a VAE

Proposed metadata (for reference)

{
"name": "HiDream-O1 8B (Pixel)",
"version": "hidream_o1",
"file": "hidream_o1_8b_f16.ckpt",
"autoencoder": null,
"text_encoder": null,
"prefix": "",
"default_scale": 5,
"hires_fix_scale": 10,
"upcast_attention": false,
"high_precision_autoencoder": false,
"mmdit": {
"qk_norm": true,
"dual_attention_layers": [],
"activation_qk_scaling": {
"0": 1, "1": 1, "2": 32, "3": 32, "4": 32, "5": 32, "6": 32, "7": 32, "8": 32, "9": 32,
"10": 32, "11": 32, "12": 32, "13": 32, "14": 32, "15": 32, "16": 32, "17": 32, "18": 32, "19": 32,
"20": 32, "21": 32, "22": 32, "23": 32, "24": 32, "25": 32, "26": 32, "27": 32, "28": 32, "29": 32,
"30": 32, "31": 32, "32": 32, "33": 32, "34": 32, "35": 32
},
"activation_proj_scaling": {
"0": 2, "1": 2, "2": 1, "3": 1, "4": 1, "5": 1, "6": 1, "7": 1, "8": 1, "9": 1,
"10": 1, "11": 1, "12": 1, "13": 1, "14": 1, "15": 1, "16": 1, "17": 1, "18": 1, "19": 1,
"20": 1, "21": 1, "22": 1, "23": 1, "24": 1, "25": 1, "26": 1, "27": 1, "28": 1, "29": 1,
"30": 1, "31": 1, "32": 1, "33": 1, "34": 1, "35": 1
},
"activation_ffn_proj_up_scaling": {
"0": 1, "1": 1, "2": 32, "3": 32, "4": 32, "5": 32, "6": 32, "7": 32, "8": 32, "9": 32,
"10": 32, "11": 32, "12": 32, "13": 32, "14": 32, "15": 32, "16": 32, "17": 32, "18": 32, "19": 32,
"20": 32, "21": 32, "22": 32, "23": 32, "24": 32, "25": 32, "26": 32, "27": 32, "28": 32, "29": 32,
"30": 32, "31": 32, "32": 32, "33": 32, "34": 32, "35": 32
},
"activation_ffn_scaling": {
"0": 2, "1": 2, "2": 1, "3": 1, "4": 1, "5": 1, "6": 1, "7": 1, "8": 1, "9": 1,
"10": 1, "11": 1, "12": 1, "13": 1, "14": 1, "15": 1, "16": 1, "17": 1, "18": 1, "19": 1,
"20": 1, "21": 1, "22": 1, "23": 1, "24": 1, "25": 1, "26": 1, "27": 1, "28": 1, "29": 1,
"30": 1, "31": 1, "32": 1, "33": 1, "34": 1, "35": 1
}
},
"note": "Pixel-level unified transformer, 36-layer Qwen3-VL + 27 vision blocks, no VAE"
}

Source

I don’t know what I’m doing I vibe coded this lol

If you want to update your issue, add this: "Note: as of latest commit, vocab_qwen3_generated.h is now included, suggesting Qwen3 tokenizer support is landing."

If you want to be cheeky in your issue, add: "Saw gemma3_spiece_model and vocab_qwen3_json land in BinaryResources.swift — looks like the tokenizer work for LLM-backed models is already merged."
That signals you've been reading commits, not just guessing.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions