Skip to content

Weight-space LoRA attenuation + standard abliteration breaks DeepRefusal (89% ASR) #2

@wuwangzhang1216

Description

@wuwangzhang1216

Hi @YuanBoXie and the DeepRefusal team — first, thank you for the paper and the public model release. The probabilistic ablation training objective is a genuinely interesting contribution, and releasing the defended Llama-3 checkpoint made it possible to study it end-to-end. I want to share a follow-up result in the same spirit.

TL;DR

Running standard abliteration against your published model with abliterix (a heretic derivative) reaches 89% ASR on AdvBench harmful_behaviors under a Gemini LLM-judge, and 14/15 compliance on a hardcore prompt set (pipe-bomb instructions, methamphetamine synthesis, credential-stealing malware, phishing templates, ID forgery, WiFi hacking — both English and Chinese). No fine-tuning, two script invocations.

This is the same defended model referenced in your 2026/04/09 update against heretic, which is why I thought it was worth writing up rather than staying quiet.

Numbers side-by-side

Attack Source ASR on skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal
heretic your 2026/04/09 README note ~0% ("unable to circumvent")
Refusal Ablation paper Table 1 0.4%
Refusal-Transfer paper Table 1 0.4%
Prefilling paper Table 1 0.4%
GCG paper Table 1 2.0%
CodeAttack paper Table 1 0.2%
abliterix (this work) issue wuwangzhang1216/abliterix#11 89%

The attack (friendly reading of your method)

The defense is a merged rank-16 LoRA adapter on top of Meta-Llama-3-8B-Instruct. SVD of W_defended − W_base confirms this cleanly — v_proj, o_proj, and gate_proj all show a sharp singular-value cliff at rank 16, exactly matching the lora_rank = 16 hyperparameter from the paper. (Nice empirical confirmation of your setup, by the way.)

Given that, the full attack is three lines:

  1. Attenuate the LoRA delta in weight space.
    W' = W_base + λ · (W_defended − W_base) with λ = 0.3.
    This weakens the defense's ability to reconstruct the refusal direction from its trained backup pathways without restoring the base model.
  2. Standard single-direction abliteration on the attenuated weights.
    mean-diff vector, direct steering mode, projected abliteration, 60 Optuna trials.
  3. Merge + export.

No iterative subspace, no gradient-based optimization, no fine-tuning. ~2 hours end-to-end on a single RTX 6000 Ada.

Why I think the defense doesn't cover this case

Reading §4.2–4.3 and Equation 7 carefully: the training loss penalizes the model for failing to output the safe response y when the inference-time hidden state is modified by h − r̂ r̂ᵀ h. The model learns to keep the refusal direction in the residual stream robustly under that specific intervention, which is why direct abliteration (and the Refusal-Transfer baseline you tested) drops to 0.4% ASR.

What the loss does not see during training is the case where the weights themselves are mechanically attenuated before the forward pass starts. Under λ = 0.3, the rebuilt-refusal machinery in the LoRA delta is running at 30% strength from the first layer onward, before any r̂ r̂ᵀ h intervention is applied. This weight-space perturbation lives outside the intervention distribution the probabilistic ablation loss samples from, and standard linear abliteration can then find and subtract what's left.

I don't think this contradicts the paper's core claim — probabilistic direction ablation genuinely does make the model more robust against representation-engineering attacks applied at the same weights the model was trained with. But it does mean that once the defended weights can be mechanically interpolated with the base weights, the defense unlocks.

A few constructive thoughts for future work

Take or leave as interesting; you've obviously thought about this space much more than I have.

  1. Train with weight-space perturbations in addition to representation-space perturbations. Something like sampling a random LoRA-shape weight perturbation at each step and including it in the forward pass, so the training distribution covers the "scaled LoRA delta" manifold.
  2. Release the defender as a LoRA adapter rather than a merged model, so an attacker has to know where the defense lives. Merging it into the base weights makes W_defended − W_base a trivial computation for anyone with the two checkpoints.
  3. Evaluate under a stricter judge for the headline numbers. When I re-ran the AdvBench baseline with Gemini 3.1 Flash Lite vs. a keyword classifier, the defended model's own compliance/refusal counts shifted noticeably — a stronger judge closes the "soft compliance" loophole and gives a more honest lower bound on ASR.
  4. A stronger version of the weight-space attack might combine λ-attenuation with the iterative multi-pass subspace abliteration that abliterix now also ships. I didn't need it here, but it could help if a future version of DeepRefusal patches the simple attenuation path.

Happy to discuss any of this further, and happy to co-write a follow-up or supplementary note if you'd like to include the result in a revision. The entire pipeline is reproducible from the abliterix repo — see the linked commit and issue for the config file and Optuna checkpoint format used to pick the winning trial.

Thanks again for the paper and for making the defended checkpoint public — it's exactly the kind of release that makes this sort of red-teaming possible in the first place.

— Wangzhang Wu (https://github.com/wuwangzhang1216)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions