Published March 27, 2026 | Version 1.0.0

ALMAZ-Corpus: The Azerbaijani Text Data Landscape for Large Language Models

Authors/Creators

Description

Large language models require large-scale text corpora, yet the availability and structure of textual resources for many languages remain poorly documented. This paper analyzes the Azerbaijani text data landscape relevant to language model training. We examine sources across web-crawled corpora, national digital libraries, government document repositories, news archives, academic publications, speech datasets, and parallel translation corpora.

Our analysis shows that the total available Azerbaijani text corpus potentially exceeds 3 billion tokens, yet fewer than 500 million tokens have been systematically used for language model training. We identify a critical data utilization gap of over 2.5 billion tokens and propose a standardized open pipeline - the ALMAZ Data Pipeline - for converting raw Azerbaijani sources into LLM-ready corpora. We also release ALMAZ Resource Roster v0.2.0, which closes three of four coverage gaps identified in v0.1.0 by adding speech datasets, translation corpora, and government legal text.

Files

almaz_paper2.pdf

Files (151.8 kB)

Name Size Download all
md5:36897460cd8a3119b5ade147aed02b66
151.8 kB Preview Download

Additional details