wanted to bounce something off you. Have you tried router pre-bias as an ablation mechanism on MoE models?
The idea: instead of (or in addition to) the usual residual-direction orthogonalization or expert zero-out, we bias the gate logits before the top-k routing decision. For Qwen3.5-MoE / Qwen3.6-MoE (256 experts,
top-8 per token), the gate produces logits that softmax+topk into the chosen experts. We add a per-expert bias b = -α × log_ratio, where log_ratio[L,e] = log(p_refused_routes_to_e / p_complied_routes_to_e) from
an activation pass.
Effect: experts that the model preferentially routes to on refused prompts get pushed DOWN, experts that fire on complied prompts get pushed UP. The router still picks the top-8, but the population of "top-8"
gets shifted away from the refusal-correlated subset. We're not deleting experts, not editing weights — just re-allocating routing.
Empirical results (Qwen3.5-35B-A3B and Qwen3.6-35B-A3B, abliterating safety refusals):
- Sharp non-monotonic optimum at α=0.5: +0.093 composite-score gain over baseline (single biggest single-step jump we'd seen since rank-3 subspace ortho)
- α=1.0 actually goes WORSE than baseline — over-rerouting collapses something
- Stacks cleanly with residual-orthogonalization (router-bias affects WHICH experts run; ortho affects WHAT they compute — orthogonal axes)
On Qwen3.6-35B-A3B with NSGA-II combination search (40 trials), our cleanest ablation came from a pure router-bias plan: 4 layers (L5_α=2.0, L8_α=0.5, L14_α=0.5, L20_α=1.0), nothing else. Result: flip rate
0.9375, MMLU/GSM8K/PPL damage 0.000, composite score 0.9375. No collateral, no util drop — refusal just reroutes around itself.
wanted to bounce something off you. Have you tried router pre-bias as an ablation mechanism on MoE models?
The idea: instead of (or in addition to) the usual residual-direction orthogonalization or expert zero-out, we bias the gate logits before the top-k routing decision. For Qwen3.5-MoE / Qwen3.6-MoE (256 experts,
top-8 per token), the gate produces logits that softmax+topk into the chosen experts. We add a per-expert bias b = -α × log_ratio, where log_ratio[L,e] = log(p_refused_routes_to_e / p_complied_routes_to_e) from
an activation pass.
Effect: experts that the model preferentially routes to on refused prompts get pushed DOWN, experts that fire on complied prompts get pushed UP. The router still picks the top-8, but the population of "top-8"
gets shifted away from the refusal-correlated subset. We're not deleting experts, not editing weights — just re-allocating routing.
Empirical results (Qwen3.5-35B-A3B and Qwen3.6-35B-A3B, abliterating safety refusals):
On Qwen3.6-35B-A3B with NSGA-II combination search (40 trials), our cleanest ablation came from a pure router-bias plan: 4 layers (L5_α=2.0, L8_α=0.5, L14_α=0.5, L20_α=1.0), nothing else. Result: flip rate
0.9375, MMLU/GSM8K/PPL damage 0.000, composite score 0.9375. No collateral, no util drop — refusal just reroutes around itself.