Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance
Abstract
Safety alignment of large language models can be fully recovered with a single safety example, maintaining utility and achieving convergence in few epochs through identified low-rank gradient structures.
Fine-tuning safety-aligned large language models (LLMs) can substantially compromise their safety. Previous approaches require many safety samples or calibration sets, which not only incur significant computational overhead during realignment but also lead to noticeable degradation in model utility. Contrary to this belief, we show that safety alignment can be fully recovered with only a single safety example, without sacrificing utility and at minimal cost. Remarkably, this recovery is effective regardless of the number of harmful examples used in fine-tuning or the size of the underlying model, and convergence is achieved within just a few epochs. Furthermore, we uncover the low-rank structure of the safety gradient, which explains why such efficient correction is possible. We validate our findings across five safety-aligned LLMs and multiple datasets, demonstrating the generality of our approach.
Community
Fine-tuning safety-aligned large language models (LLMs) can substantially compromise their safety. Previous approaches require many safety samples or calibration sets, which not only incur significant computational overhead during realignment but also lead to noticeable degradation in model utility. Contrary to this belief, we show that safety alignment can be fully recovered with only a single safety example, without sacrificing utility and at minimal cost. Remarkably, this recovery is effective regardless of the number of harmful examples used in fine-tuning or the size of the underlying model, and convergence is achieved within just a few epochs. Furthermore, we uncover the low-rank structure of the safety gradient, which explains why such efficient correction is possible. We validate our findings across five safety-aligned LLMs and multiple datasets, demonstrating the generality of our approach.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Curvature-Aware Safety Restoration In LLMs Fine-Tuning (2025)
- Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning (2025)
- Rethinking Deep Alignment Through The Lens Of Incomplete Learning (2025)
- Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education (2025)
- Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation (2025)
- Alignment-Aware Quantization for LLM Safety (2025)
- Adversarial Contrastive Learning for LLM Quantization Attacks (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper