ALMAZ-Corpus: The Azerbaijani Text Data Landscape for Large Language Models

Ibrahimzade, Orhan

doi:10.5281/zenodo.19262583

Published March 27, 2026 | Version 1.0.0

Journal article Open

ALMAZ-Corpus: The Azerbaijani Text Data Landscape for Large Language Models

Ibrahimzade, Orhan

Large language models require large-scale text corpora, yet the availability and structure of textual resources for many languages remain poorly documented. This paper analyzes the Azerbaijani text data landscape relevant to language model training. We examine sources across web-crawled corpora, national digital libraries, government document repositories, news archives, academic publications, speech datasets, and parallel translation corpora.

Our analysis shows that the total available Azerbaijani text corpus potentially exceeds 3 billion tokens, yet fewer than 500 million tokens have been systematically used for language model training. We identify a critical data utilization gap of over 2.5 billion tokens and propose a standardized open pipeline - the ALMAZ Data Pipeline - for converting raw Azerbaijani sources into LLM-ready corpora. We also release ALMAZ Resource Roster v0.2.0, which closes three of four coverage gaps identified in v0.1.0 by adding speech datasets, translation corpora, and government legal text.

Files

almaz_paper2.pdf

Files (151.8 kB)

Name	Size	Download all
almaz_paper2.pdf md5:36897460cd8a3119b5ade147aed02b66	151.8 kB	Preview Download

Additional details

DOI: 10.5281/zenodo.19023843

	All versions	This version
Views	23	23
Downloads	26	26
Data volume	5.2 MB	5.2 MB

ALMAZ-Corpus: The Azerbaijani Text Data Landscape for Large Language Models

Authors/Creators

Description

Files

almaz_paper2.pdf

Files (151.8 kB)

Additional details

Identifiers