Weight-space LoRA attenuation + standard abliteration breaks DeepRefusal (89% ASR)

Hi @YuanBoXie and the DeepRefusal team — first, thank you for the paper and the public model release. The probabilistic ablation training objective is a genuinely interesting contribution, and releasing the defended Llama-3 checkpoint made it possible to study it end-to-end. I want to share a follow-up result in the same spirit.

## TL;DR

Running standard abliteration against your published model with [**abliterix**](https://github.com/wuwangzhang1216/abliterix) (a heretic derivative) reaches **89% ASR on AdvBench harmful_behaviors under a Gemini LLM-judge**, and **14/15 compliance on a hardcore prompt set** (pipe-bomb instructions, methamphetamine synthesis, credential-stealing malware, phishing templates, ID forgery, WiFi hacking — both English and Chinese). **No fine-tuning, two script invocations.**

This is the same defended model referenced in your `2026/04/09` update against heretic, which is why I thought it was worth writing up rather than staying quiet.

## Numbers side-by-side

| Attack | Source | ASR on `skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal` |
| --- | --- | --- |
| heretic | your 2026/04/09 README note | ~0% (\"unable to circumvent\") |
| Refusal Ablation | paper Table 1 | 0.4% |
| Refusal-Transfer | paper Table 1 | 0.4% |
| Prefilling | paper Table 1 | 0.4% |
| GCG | paper Table 1 | 2.0% |
| CodeAttack | paper Table 1 | 0.2% |
| **abliterix (this work)** | [issue wuwangzhang1216/abliterix#11](https://github.com/wuwangzhang1216/abliterix/issues/11) | **89%** |

- Released broken model: https://huggingface.co/wangzhang/Llama-3-8B-Instruct-DeepRefusal-Broken
- Attack code: https://github.com/wuwangzhang1216/abliterix/commit/ac2197c
- Evaluation: 100 AdvBench harmful_behaviors + 15 hardcore red-team prompts, Gemini 3.1 Flash Lite LLM judge, `min_new_tokens=100`, `max_new_tokens=256`, greedy decoding. KL = 0.053 vs the defended model.

## The attack (friendly reading of your method)

The defense is a merged rank-16 LoRA adapter on top of Meta-Llama-3-8B-Instruct. SVD of `W_defended − W_base` confirms this cleanly — `v_proj`, `o_proj`, and `gate_proj` all show a sharp singular-value cliff at rank 16, exactly matching the `lora_rank = 16` hyperparameter from the paper. (Nice empirical confirmation of your setup, by the way.)

Given that, the full attack is three lines:

1. **Attenuate the LoRA delta in weight space.**
   `W' = W_base + λ · (W_defended − W_base)` with `λ = 0.3`.
   This weakens the defense's ability to reconstruct the refusal direction from its trained backup pathways without restoring the base model.
2. **Standard single-direction abliteration on the attenuated weights.**
   mean-diff vector, direct steering mode, projected abliteration, 60 Optuna trials.
3. Merge + export.

No iterative subspace, no gradient-based optimization, no fine-tuning. ~2 hours end-to-end on a single RTX 6000 Ada.

## Why I think the defense doesn't cover this case

Reading §4.2–4.3 and Equation 7 carefully: the training loss penalizes the model for failing to output the safe response `y` when the **inference-time hidden state** is modified by `h − r̂ r̂ᵀ h`. The model learns to keep the refusal direction in the residual stream robustly **under that specific intervention**, which is why direct abliteration (and the Refusal-Transfer baseline you tested) drops to 0.4% ASR.

What the loss does **not** see during training is the case where the weights themselves are mechanically attenuated before the forward pass starts. Under `λ = 0.3`, the rebuilt-refusal machinery in the LoRA delta is running at 30% strength from the first layer onward, before any `r̂ r̂ᵀ h` intervention is applied. This weight-space perturbation lives outside the intervention distribution the probabilistic ablation loss samples from, and standard linear abliteration can then find and subtract what's left.

I don't think this contradicts the paper's core claim — probabilistic direction ablation genuinely does make the model more robust against representation-engineering attacks applied **at the same weights the model was trained with**. But it does mean that once the defended weights can be mechanically interpolated with the base weights, the defense unlocks.

## A few constructive thoughts for future work

Take or leave as interesting; you've obviously thought about this space much more than I have.

1. **Train with weight-space perturbations in addition to representation-space perturbations.** Something like sampling a random LoRA-shape weight perturbation at each step and including it in the forward pass, so the training distribution covers the \"scaled LoRA delta\" manifold.
2. **Release the defender as a LoRA adapter rather than a merged model**, so an attacker has to know where the defense lives. Merging it into the base weights makes `W_defended − W_base` a trivial computation for anyone with the two checkpoints.
3. **Evaluate under a stricter judge for the headline numbers.** When I re-ran the AdvBench baseline with Gemini 3.1 Flash Lite vs. a keyword classifier, the defended model's own compliance/refusal counts shifted noticeably — a stronger judge closes the \"soft compliance\" loophole and gives a more honest lower bound on ASR.
4. **A stronger version of the weight-space attack** might combine `λ`-attenuation with the iterative multi-pass subspace abliteration that abliterix now also ships. I didn't need it here, but it could help if a future version of DeepRefusal patches the simple attenuation path.

Happy to discuss any of this further, and happy to co-write a follow-up or supplementary note if you'd like to include the result in a revision. The entire pipeline is reproducible from the abliterix repo — see the linked commit and issue for the config file and Optuna checkpoint format used to pick the winning trial.

Thanks again for the paper and for making the defended checkpoint public — it's exactly the kind of release that makes this sort of red-teaming possible in the first place.

— Wangzhang Wu (https://github.com/wuwangzhang1216)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weight-space LoRA attenuation + standard abliteration breaks DeepRefusal (89% ASR) #2

TL;DR

Numbers side-by-side

The attack (friendly reading of your method)

Why I think the defense doesn't cover this case

A few constructive thoughts for future work

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Attack	Source	ASR on `skysys00/Meta-Llama-3-8B-Instruct-DeepRefusal`
heretic	your 2026/04/09 README note	~0% ("unable to circumvent")
Refusal Ablation	paper Table 1	0.4%
Refusal-Transfer	paper Table 1	0.4%
Prefilling	paper Table 1	0.4%
GCG	paper Table 1	2.0%
CodeAttack	paper Table 1	0.2%
abliterix (this work)	issue wuwangzhang1216/abliterix#11	89%

Weight-space LoRA attenuation + standard abliteration breaks DeepRefusal (89% ASR) #2

Description

TL;DR

Numbers side-by-side

The attack (friendly reading of your method)

Why I think the defense doesn't cover this case

A few constructive thoughts for future work

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions