DS1 spectrogram: CANDLE: Character-level Arabic Noise Deduplication using Lightweight Encoder

CANDLE: Character-level Arabic Noise Deduplication using Lightweight Encoder

2606.24758

Authors

Taif Nono,Orjuwan Zaafarani,Kholood Al Tabash,Ahmad Ghannam,Anas Salamah

Abstract

Handling repeated characters in text can be tricky, since they can represent either the correct spelling of a word or informal character elongation often seen in social media posts. We present CANDLE, a lightweight system for character-level Arabic noise deduplication that addresses this challenge without relying on handcrafted rules, dictionaries, or morphological analyzers.

At the heart of CANDLE is a novel application of Connectionist Temporal Classification (CTC) to this task, a formulation not previously explored for character deduplication, which frames normalization as a sequence alignment problem over a character-based encoder. Evaluated on three benchmarks spanning clean newspaper, manually curated ambiguous cases, and real-world social media text, the CTC model achieves a Sentence Error Rate (SER) as low as $5.37%$ and consistently outperforms a classification-based baseline by a large margin.

To reduce inference overhead, we distill the 6-layer CTC model into a 2-layer student, achieving a $3\times$ depth reduction with minimal performance degradation. Beyond deduplication accuracy, normalization yields a practical downstream benefit: a relative reduction in tokenizer fertility of up to $12.8%$ across a diverse set of Arabic LLM tokenizers, directly lowering inference costs and improving context window utilization.

We release all code and models publicly to support reproducibility and advance future research\footnote{https://github.com/abjadai/candle}.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • takara.ai
  • Custom AI and machine learning from the Frontier Research Team.
  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.