There is a newer version of the record available.

Published May 5, 2025 | Version v4
Dataset Open

AMSunda: A Novel Dataset for Sundanese Information Retrieval

Description

The AMSunda dataset was introduced as the first resource designed explicitly for fine-tuning and evaluating embedding models in the Sundanese language. AMSunda dataset consists of two dataset types: (1) triplet data containing a query passage, a positive, and a negative response aimed for fine-tuning embedding models, and (2) BEIR-compatible data structured for evaluating embedding models on retrieval tasks.

Files

corpus.csv

Files (8.8 MB)

Name Size Download all
md5:abc3a4b378f3b26e0320c2359739438c
3.7 MB Preview Download
md5:b730c66b01cdf856ab4bfc0b51ecf0ff
576.8 kB Preview Download
md5:8e08dcb03c18c7d5e0e2a748f8ce4fa2
894.2 kB Preview Download
md5:dc506a69f1688d7e9bbd5da4fdfb5ad0
3.6 MB Preview Download