ALMAZ-Corpus: The Azerbaijani Text Data Landscape for Large Language Models
Authors/Creators
Description
Large language models require large-scale text corpora, yet the availability and structure of textual resources for many languages remain poorly documented. This paper analyzes the Azerbaijani text data landscape relevant to language model training. We examine sources across web-crawled corpora, national digital libraries, government document repositories, news archives, academic publications, speech datasets, and parallel translation corpora.
Our analysis shows that the total available Azerbaijani text corpus potentially exceeds 3 billion tokens, yet fewer than 500 million tokens have been systematically used for language model training. We identify a critical data utilization gap of over 2.5 billion tokens and propose a standardized open pipeline - the ALMAZ Data Pipeline - for converting raw Azerbaijani sources into LLM-ready corpora. We also release ALMAZ Resource Roster v0.2.0, which closes three of four coverage gaps identified in v0.1.0 by adding speech datasets, translation corpora, and government legal text.
Files
almaz_paper2.pdf
Files
(151.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:36897460cd8a3119b5ade147aed02b66
|
151.8 kB | Preview Download |