AMSunda: A Novel Dataset for Sundanese Information Retrieval

Maesya, Aries; Arifin, Yulyani; Budiharto, Widodo; Amalia, Zahra

doi:10.5281/zenodo.15493551

There is a newer version of the record available.

Published May 23, 2025 | Version v5

Dataset Open

AMSunda: A Novel Dataset for Sundanese Information Retrieval

1. Pakuan University
2. Binus University

The AMSunda dataset was introduced as the first resource designed explicitly for fine-tuning and evaluating embedding models in the Sundanese language. AMSunda dataset consists of two dataset types: (1) triplet data containing a query passage, a positive, and a negative response aimed for fine-tuning embedding models, and (2) BEIR-compatible data structured for evaluating embedding models on retrieval tasks.

Files

corpus.csv

Files (8.9 MB)

Name	Size	Download all
corpus.csv md5:124d4585273e15f020d95a03c13e95ee	3.7 MB	Preview Download
qrels.csv md5:aad03ba1462ac3a2d323562bdfcd9482	576.8 kB	Preview Download
queries.csv md5:3cac24dc4357a85889af5fa6915ea5b4	899.7 kB	Preview Download
triplet.csv md5:9a58f3bd058024b0fee608307e90c6dc	3.7 MB	Preview Download

369

Views

Downloads

Show more details

	All versions	This version
Views	369	26
Downloads	8,522	323
Data volume	32.0 GB	1.2 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Languages

Sundanese, English

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: May 23, 2025
Modified: May 23, 2025

AMSunda: A Novel Dataset for Sundanese Information Retrieval

Authors/Creators

Description

Files

corpus.csv

Files (8.9 MB)