Multi-lingual Multi-turn Automated Red Teaming for LLMs

Language Model Models (LLMs) have improved dramatically in the past few years, increasing their adoption and the scope of their capabilities over time. A significant amount of work is dedicated to "model alignment", i.e., preventing LLMs to generate unsafe responses when deployed into customer-facing applications.

One popular method to evaluate safety risks is red-teaming, where agents attempt to bypass alignment by crafting elaborate prompts that trigger unsafe responses from a model. Standard human-driven red-teaming is costly, time-consuming and rarely covers all the recent features (e.g., multi-lingual, multi-modal aspects), while proposed automation methods only cover a small subset of LLMs capabilities (i.e., English or single-turn).

We present Multi-lingual Multi-turn Automated Red Teaming (MM-ART), a method to fully automate conversational, multi-lingual red-teaming operations and quickly identify prompts leading to unsafe responses. Through extensive experiments on different languages, we show the studied LLMs are on average 71% more vulnerable after a 5-turn conversation in English than after the initial turn.

For conversations in non-English languages, models display up to 195% more safety vulnerabilities than the standard single-turn English approach, confirming the need for automated red-teaming methods matching LLMs capabilities.

Multi-lingual Multi-turn Automated Red Teaming for LLMs

Authors

Abstract

Resources

Stay in the loop

Pages

Tools

Details