AMSunda: A Novel Dataset for Sundanese Information Retrieval

Maesya, Aries; Arifin, Yulyani; Budiharto, Widodo; Amalia, Zahra

doi:10.5281/zenodo.14890507

There is a newer version of the record available.

Published April 27, 2025 | Version v1

Dataset Open

AMSunda: A Novel Dataset for Sundanese Information Retrieval

1. Pakuan University
2. Binus University

The AMSunda dataset was introduced as the first resource designed explicitly for fine-tuning and evaluating embedding models in the Sundanese language. AMSunda dataset consists of two dataset types: (1) triplet data containing a query passage, a positive, and a negative response aimed for fine-tuning embedding models, and (2) BEIR-compatible data structured for evaluating embedding models on retrieval tasks.

Files

Files (5.6 MB)

Name	Size	Download all
corpus.jsonl md5:e52f6f2fd7a24be7a6865aeea2383bca	2.1 MB	Download
qrels.tsv md5:1b0229f379d1da04d7f88697da72dd08	569.3 kB	Download
queries.jsonl md5:40545d7d5e757a43d3c7478b829b5c91	736.6 kB	Download
triplet.jsonl md5:9477161c8bab6424ec72f46ddf6ffa5d	2.2 MB	Download

324

Views

Downloads

Show more details

	All versions	This version
Views	324	114
Downloads	6,856	84
Data volume	26.1 GB	126.5 MB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Languages

Sundanese

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: April 27, 2025
Modified: April 27, 2025

AMSunda: A Novel Dataset for Sundanese Information Retrieval

Authors/Creators

Description

Files

Files (5.6 MB)