Summary
HiDream-O1-Image is a new 8.8B unified transformer that generates images at the pixel level — no VAE, no separate diffusion model. The entire model is a Qwen3-VL backbone with a pixel head.
I've analyzed the released weights and it doesn't fit any existing version in DrawThings. This issue documents the architecture so we can track future support. (I didn’t do shit this is all Meta Spark and GLM5.1 they told me to do this instead of a pull request gun to my head and everything)
What I found in the weights
From model.safetensors.index.json:
- Language backbone: 36 layers (
model.language_model.layers.0-35)
- Vision tower: 27 blocks (
model.visual.blocks.0-26)
- Pixel head:
model.x_embedder + model.final_layer2.linear (outputs 3072 = 32×32×3)
- QK-Norm: present in every layer (
q_norm, k_norm weights)
- RoPE:
rope_theta=5,000,000, max_position_embeddings=262,144
From preprocessor_config.json:
patch_size: 16, merge_size: 2 → confirms 32×32 patch output
This matches the Z Image pattern (which already loads qwen_3_vl_4b_instruct with qk_norm: true), but O1 merges the text encoder and generator into one file and removes the VAE entirely.
Why existing versions don't work
flux2_9b / flux1 expect a separate VAE and MMDiT blocks — O1 has neither
z_image expects qwen_3_vl as text_encoder only — O1 uses it as the generator
hunyuan_video is closest (LLM as encoder) but still needs a VAE
Proposed metadata (for reference)
{
"name": "HiDream-O1 8B (Pixel)",
"version": "hidream_o1",
"file": "hidream_o1_8b_f16.ckpt",
"autoencoder": null,
"text_encoder": null,
"prefix": "",
"default_scale": 5,
"hires_fix_scale": 10,
"upcast_attention": false,
"high_precision_autoencoder": false,
"mmdit": {
"qk_norm": true,
"dual_attention_layers": [],
"activation_qk_scaling": {
"0": 1, "1": 1, "2": 32, "3": 32, "4": 32, "5": 32, "6": 32, "7": 32, "8": 32, "9": 32,
"10": 32, "11": 32, "12": 32, "13": 32, "14": 32, "15": 32, "16": 32, "17": 32, "18": 32, "19": 32,
"20": 32, "21": 32, "22": 32, "23": 32, "24": 32, "25": 32, "26": 32, "27": 32, "28": 32, "29": 32,
"30": 32, "31": 32, "32": 32, "33": 32, "34": 32, "35": 32
},
"activation_proj_scaling": {
"0": 2, "1": 2, "2": 1, "3": 1, "4": 1, "5": 1, "6": 1, "7": 1, "8": 1, "9": 1,
"10": 1, "11": 1, "12": 1, "13": 1, "14": 1, "15": 1, "16": 1, "17": 1, "18": 1, "19": 1,
"20": 1, "21": 1, "22": 1, "23": 1, "24": 1, "25": 1, "26": 1, "27": 1, "28": 1, "29": 1,
"30": 1, "31": 1, "32": 1, "33": 1, "34": 1, "35": 1
},
"activation_ffn_proj_up_scaling": {
"0": 1, "1": 1, "2": 32, "3": 32, "4": 32, "5": 32, "6": 32, "7": 32, "8": 32, "9": 32,
"10": 32, "11": 32, "12": 32, "13": 32, "14": 32, "15": 32, "16": 32, "17": 32, "18": 32, "19": 32,
"20": 32, "21": 32, "22": 32, "23": 32, "24": 32, "25": 32, "26": 32, "27": 32, "28": 32, "29": 32,
"30": 32, "31": 32, "32": 32, "33": 32, "34": 32, "35": 32
},
"activation_ffn_scaling": {
"0": 2, "1": 2, "2": 1, "3": 1, "4": 1, "5": 1, "6": 1, "7": 1, "8": 1, "9": 1,
"10": 1, "11": 1, "12": 1, "13": 1, "14": 1, "15": 1, "16": 1, "17": 1, "18": 1, "19": 1,
"20": 1, "21": 1, "22": 1, "23": 1, "24": 1, "25": 1, "26": 1, "27": 1, "28": 1, "29": 1,
"30": 1, "31": 1, "32": 1, "33": 1, "34": 1, "35": 1
}
},
"note": "Pixel-level unified transformer, 36-layer Qwen3-VL + 27 vision blocks, no VAE"
}
Source
I don’t know what I’m doing I vibe coded this lol
If you want to update your issue, add this: "Note: as of latest commit, vocab_qwen3_generated.h is now included, suggesting Qwen3 tokenizer support is landing."
If you want to be cheeky in your issue, add: "Saw gemma3_spiece_model and vocab_qwen3_json land in BinaryResources.swift — looks like the tokenizer work for LLM-backed models is already merged."
That signals you've been reading commits, not just guessing.
Summary
HiDream-O1-Image is a new 8.8B unified transformer that generates images at the pixel level — no VAE, no separate diffusion model. The entire model is a Qwen3-VL backbone with a pixel head.
I've analyzed the released weights and it doesn't fit any existing
versionin DrawThings. This issue documents the architecture so we can track future support. (I didn’t do shit this is all Meta Spark and GLM5.1 they told me to do this instead of a pull request gun to my head and everything)What I found in the weights
From
model.safetensors.index.json:model.language_model.layers.0-35)model.visual.blocks.0-26)model.x_embedder+model.final_layer2.linear(outputs 3072 = 32×32×3)q_norm,k_normweights)rope_theta=5,000,000,max_position_embeddings=262,144From
preprocessor_config.json:patch_size: 16,merge_size: 2→ confirms 32×32 patch outputThis matches the Z Image pattern (which already loads
qwen_3_vl_4b_instructwithqk_norm: true), but O1 merges the text encoder and generator into one file and removes the VAE entirely.Why existing versions don't work
flux2_9b/flux1expect a separate VAE and MMDiT blocks — O1 has neitherz_imageexpectsqwen_3_vlas text_encoder only — O1 uses it as the generatorhunyuan_videois closest (LLM as encoder) but still needs a VAEProposed metadata (for reference)
{
"name": "HiDream-O1 8B (Pixel)",
"version": "hidream_o1",
"file": "hidream_o1_8b_f16.ckpt",
"autoencoder": null,
"text_encoder": null,
"prefix": "",
"default_scale": 5,
"hires_fix_scale": 10,
"upcast_attention": false,
"high_precision_autoencoder": false,
"mmdit": {
"qk_norm": true,
"dual_attention_layers": [],
"activation_qk_scaling": {
"0": 1, "1": 1, "2": 32, "3": 32, "4": 32, "5": 32, "6": 32, "7": 32, "8": 32, "9": 32,
"10": 32, "11": 32, "12": 32, "13": 32, "14": 32, "15": 32, "16": 32, "17": 32, "18": 32, "19": 32,
"20": 32, "21": 32, "22": 32, "23": 32, "24": 32, "25": 32, "26": 32, "27": 32, "28": 32, "29": 32,
"30": 32, "31": 32, "32": 32, "33": 32, "34": 32, "35": 32
},
"activation_proj_scaling": {
"0": 2, "1": 2, "2": 1, "3": 1, "4": 1, "5": 1, "6": 1, "7": 1, "8": 1, "9": 1,
"10": 1, "11": 1, "12": 1, "13": 1, "14": 1, "15": 1, "16": 1, "17": 1, "18": 1, "19": 1,
"20": 1, "21": 1, "22": 1, "23": 1, "24": 1, "25": 1, "26": 1, "27": 1, "28": 1, "29": 1,
"30": 1, "31": 1, "32": 1, "33": 1, "34": 1, "35": 1
},
"activation_ffn_proj_up_scaling": {
"0": 1, "1": 1, "2": 32, "3": 32, "4": 32, "5": 32, "6": 32, "7": 32, "8": 32, "9": 32,
"10": 32, "11": 32, "12": 32, "13": 32, "14": 32, "15": 32, "16": 32, "17": 32, "18": 32, "19": 32,
"20": 32, "21": 32, "22": 32, "23": 32, "24": 32, "25": 32, "26": 32, "27": 32, "28": 32, "29": 32,
"30": 32, "31": 32, "32": 32, "33": 32, "34": 32, "35": 32
},
"activation_ffn_scaling": {
"0": 2, "1": 2, "2": 1, "3": 1, "4": 1, "5": 1, "6": 1, "7": 1, "8": 1, "9": 1,
"10": 1, "11": 1, "12": 1, "13": 1, "14": 1, "15": 1, "16": 1, "17": 1, "18": 1, "19": 1,
"20": 1, "21": 1, "22": 1, "23": 1, "24": 1, "25": 1, "26": 1, "27": 1, "28": 1, "29": 1,
"30": 1, "31": 1, "32": 1, "33": 1, "34": 1, "35": 1
}
},
"note": "Pixel-level unified transformer, 36-layer Qwen3-VL + 27 vision blocks, no VAE"
}
Source
architectures: ["HiDreamImageTransformer2DModel"]I don’t know what I’m doing I vibe coded this lol
If you want to update your issue, add this: "Note: as of latest commit, vocab_qwen3_generated.h is now included, suggesting Qwen3 tokenizer support is landing."
If you want to be cheeky in your issue, add: "Saw gemma3_spiece_model and vocab_qwen3_json land in BinaryResources.swift — looks like the tokenizer work for LLM-backed models is already merged."
That signals you've been reading commits, not just guessing.