DS1 spectrogram: The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar

The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar

2606.26015

Authors

Vsevolod Karimov,Vitaliy Egorov,Bulat Khakimov,Alexander Panchenko,Ilseyar Alimova

Abstract

Text detoxification, the automated detection and mitigation of abusive and harmful content, is essential for ensuring the safety of online communities and protecting users. However, low resource languages such as Tatar have received little research attention.

In this paper we present Tatoxa, a novel state-of-the-art system for text detoxification in the Tatar language. Comparative experiments show that the proposed approach outperforms existing open source and proprietary commercial LLMs on key quality metrics.

We also introduce a new dataset for text detoxification in Tatar, designed for fine tuning and evaluation in low resource settings. Finally, cross lingual transfer experiments indicate that transfer from other languages, including the culturally close Russian, performs significantly worse than training on native Tatar data even when a large Russian corpus is available.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • takara.ai
  • Custom AI and machine learning from the Frontier Research Team.
  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.