TIGQA: An Expert-Annotated Question-Answering Dataset in Tigrinya

Teklehaymanot, Hailay Kidu

doi:10.5281/zenodo.11423987

Published June 2, 2024 | Version v1

Dataset Open

TIGQA: An Expert-Annotated Question-Answering Dataset in Tigrinya

Teklehaymanot, Hailay Kidu^{1, 2, 3, 4}

1. Leibniz University Hannover
2. L3S Research Center
3. Mekelle University
4. Ethiopian Institute of Technology (EiT-M)

What is TIGQA?

TigQA is an expert-annotated dataset in Tigrinya, a low-resource language spoken by approximately 10 million speakers in Eritrea and the Tigray region of Ethiopia. Our proposed SQuAD-like dataset contains 2,685 question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books, with the answers being provided by teachers from the region.

Files

Files (369.1 kB)

Name	Size	Download all
TIGQA Tigrinya Question Answering dataset.docx md5:83d447ffac5273ccad19a4f85e6094d3	369.1 kB	Download

Additional details

Is published in: Dataset: ELRA and ICCL (Other)

@inproceedings{teklehaymanot-etal-2024-tigqa-expert, title = "{TIGQA}: An Expert-Annotated Question-Answering Dataset in {T}igrinya", author = "Teklehaymanot, Hailay Kidu and Fazlija, Dren and Ganguly, Niloy and Patro, Gourab Kumar and Nejdl, Wolfgang", editor = "Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro and Sakti, Sakriani and Xue, Nianwen", booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.lrec-main.1404", pages = "16142--16161", abstract = "The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources. This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert-annotated dataset containing 2,685 question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the-art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pre-trained models. The notable disparities between human performance and the best model performance underscore the potential for fu- ture enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC. Keywords: Tigrinya QA dataset, Low resource QA dataset, domain specific QA", }

	All versions	This version
Views	140	140
Downloads	53	53
Data volume	24.0 MB	24.0 MB

What is TIGQA?

Files (369.1 kB)

Related works

References

TIGQA: An Expert-Annotated Question-Answering Dataset in Tigrinya

Authors/Creators

Description

What is TIGQA?

Files

Files (369.1 kB)

Additional details

Related works

References