Skip to content

Ablation Technique - pre-bias #322

@janfeddersen-wq

Description

@janfeddersen-wq

wanted to bounce something off you. Have you tried router pre-bias as an ablation mechanism on MoE models?

The idea: instead of (or in addition to) the usual residual-direction orthogonalization or expert zero-out, we bias the gate logits before the top-k routing decision. For Qwen3.5-MoE / Qwen3.6-MoE (256 experts,
top-8 per token), the gate produces logits that softmax+topk into the chosen experts. We add a per-expert bias b = -α × log_ratio, where log_ratio[L,e] = log(p_refused_routes_to_e / p_complied_routes_to_e) from
an activation pass.

Effect: experts that the model preferentially routes to on refused prompts get pushed DOWN, experts that fire on complied prompts get pushed UP. The router still picks the top-8, but the population of "top-8"
gets shifted away from the refusal-correlated subset. We're not deleting experts, not editing weights — just re-allocating routing.

Empirical results (Qwen3.5-35B-A3B and Qwen3.6-35B-A3B, abliterating safety refusals):

  • Sharp non-monotonic optimum at α=0.5: +0.093 composite-score gain over baseline (single biggest single-step jump we'd seen since rank-3 subspace ortho)
  • α=1.0 actually goes WORSE than baseline — over-rerouting collapses something
  • Stacks cleanly with residual-orthogonalization (router-bias affects WHICH experts run; ortho affects WHAT they compute — orthogonal axes)

On Qwen3.6-35B-A3B with NSGA-II combination search (40 trials), our cleanest ablation came from a pure router-bias plan: 4 layers (L5_α=2.0, L8_α=0.5, L14_α=0.5, L20_α=1.0), nothing else. Result: flip rate
0.9375, MMLU/GSM8K/PPL damage 0.000, composite score 0.9375. No collateral, no util drop — refusal just reroutes around itself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions