AMSunda: A Novel Dataset for Sundanese Information Retrieval

Maesya, Aries; Arifin, Yulyani; Budiharto, Widodo; Amalia, Zahra

doi:10.5281/zenodo.15341319

There is a newer version of the record available.

Published May 5, 2025 | Version v4

Dataset Open

AMSunda: A Novel Dataset for Sundanese Information Retrieval

1. Pakuan University
2. Binus University

The AMSunda dataset was introduced as the first resource designed explicitly for fine-tuning and evaluating embedding models in the Sundanese language. AMSunda dataset consists of two dataset types: (1) triplet data containing a query passage, a positive, and a negative response aimed for fine-tuning embedding models, and (2) BEIR-compatible data structured for evaluating embedding models on retrieval tasks.

Files

corpus.csv

Files (8.8 MB)

Name	Size	Download all
corpus.csv md5:abc3a4b378f3b26e0320c2359739438c	3.7 MB	Preview Download
qrels.csv md5:b730c66b01cdf856ab4bfc0b51ecf0ff	576.8 kB	Preview Download
queries.csv md5:8e08dcb03c18c7d5e0e2a748f8ce4fa2	894.2 kB	Preview Download
triplet.csv md5:dc506a69f1688d7e9bbd5da4fdfb5ad0	3.6 MB	Preview Download

377

Views

Downloads

Show more details

	All versions	This version
Views	377	51
Downloads	8,522	5,864
Data volume	32.0 GB	23.0 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Languages

Sundanese, English

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: May 5, 2025
Modified: May 5, 2025

AMSunda: A Novel Dataset for Sundanese Information Retrieval

Authors/Creators

Description

Files

corpus.csv

Files (8.8 MB)