DS1 spectrogram: KOMBO: Korean Character Representations Based on the Combination Rules of Subcharacters

KOMBO: Korean Character Representations Based on the Combination Rules of Subcharacters

April 27, 20262604.23948

Authors

SangKeun Lee,SungHo Kim,Juhyeong Park,Yeachan Kim

Abstract

The Korean writing system, Hangeul, has a unique character representation rigidly following the invention principles recorded in Hunminjeongeum.\footnote{Hunminjeongeum is a book published in 1446 that describes the principles of invention and usage of Hangeul, devised by King Sejong \cite{Hunminjeongeum_Guide}.} However, existing pre-trained language models (PLMs) for Korean have overlooked these principles. In this paper, we introduce a novel framework for Korean PLMs called KOMBO, which firstly brings the invention principles of Hangeul to represent character. Our proposed method, KOMBO, exhibits notable experimental proficiency across diverse NLP tasks.

In particular, our method outperforms the state-of-the-art Korean PLM by an average of 2.11% in five Korean natural language understanding tasks. Furthermore, extensive experiments demonstrate that our proposed method is suitable for comprehending the linguistic features of the Korean language.

Consequently, we shed light on the superiority of using subcharacters over the typical subword-based approach for Korean PLMs. Our code is available at: https://github.com/SungHo3268/KOMBO.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.