There is a newer version of the record available.

Published May 23, 2025 | Version v5
Dataset Open

AMSunda: A Novel Dataset for Sundanese Information Retrieval

Description

The AMSunda dataset was introduced as the first resource designed explicitly for fine-tuning and evaluating embedding models in the Sundanese language. AMSunda dataset consists of two dataset types: (1) triplet data containing a query passage, a positive, and a negative response aimed for fine-tuning embedding models, and (2) BEIR-compatible data structured for evaluating embedding models on retrieval tasks.

Files

corpus.csv

Files (8.9 MB)

Name Size Download all
md5:124d4585273e15f020d95a03c13e95ee
3.7 MB Preview Download
md5:aad03ba1462ac3a2d323562bdfcd9482
576.8 kB Preview Download
md5:3cac24dc4357a85889af5fa6915ea5b4
899.7 kB Preview Download
md5:9a58f3bd058024b0fee608307e90c6dc
3.7 MB Preview Download