DS1 spectrogram: LLM Watermark Evasion via Bias Inversion

LLM Watermark Evasion via Bias Inversion

September 27, 20252509.23019

Authors

Jeongyeon Hwang,Sangdon Park,Jungseul Ok

Abstract

Watermarking for large language models (LLMs) embeds a statistical signal during generation to enable detection of model-produced text. While watermarking has proven effective in benign settings, its robustness under adversarial evasion remains contested.

To advance a rigorous understanding and evaluation of such vulnerabilities, we propose the Bias-Inversion Rewriting Attack (BIRA), which is theoretically motivated and model-agnostic. BIRA weakens the watermark signal by suppressing the logits of likely watermarked tokens during LLM-based rewriting, without any knowledge of the underlying watermarking scheme.

Across recent watermarking methods, BIRA achieves over 99% evasion while preserving the semantic content of the original text. Beyond demonstrating an attack, our results reveal a systematic vulnerability, emphasizing the need for stress testing and robust defenses.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.