A SEMI-SUPERVISED FRAMEWORK NAMED AUGSBERT-UZ FOR HIGH-PERFORMANCE SEMANTIC TEXTUAL SIMILARITY IN UZBEK
Authors/Creators
Description
Semantic Textual Similarity (STS) is one of the fundamental task of Natural Language Processing (NLP). As Uzbek has scarcity of large-scale annotated datasets, while it is morphologically rich language, STS remains a significant challenge for researchers. Standard Transformer-based cross-encoders offer high accuracy but are computationally prohibitive for large-scale applications, whereas bi-encoders are fast but require substantial training data to perform well. In this paper, we introduce AugSBERT-Uz, a novel semi-supervised model that produces a state-of-the-art sentence embedding model for the Uzbek language. The paper employs a “teacher-student” knowledge distillation approach. First, a high-accuracy cross-encoder (the “teacher”), based on the monolingual BERTbek model, is fine-tuned on a small, human-annotated “gold” dataset. This teacher model is then used to automatically label millions of sentence pairs from a large unlabeled corpus, developing a vast “silver-standard” dataset. Finally, a bi-encoder (the “student”) with a Siamese architecture is trained on this augmented dataset using Multiple Negatives Ranking Loss. The proposed framework enables the Bi-encoder to achieve performance remarkably close to the high-accuracy cross-encoder with 83.2 spearman correlation, while retaining its computational efficiency (inference time response - 5 seconds), making it suitable for large-scale semantic search and clustering tasks. This method effectively bridges the performance gap caused by data scarcity, developing a model that is both accurate and scalable. AugSBERT-Uz presents a novel and scalable solution for developing high-quality semantic representations for low-resource, agglutinative languages. This work provides the first high-performance, publicly available sentence embedding model for Uzbek, paving the way for advancements in regional NLP applications.
Files
A.T.-6.pdf
Files
(738.2 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:2fcae2bcb3c49404d046eb737d9e4b1b
|
738.2 kB | Preview Download |