Published February 25, 2023 | Version 0.0.1
Conference paper Open

Text classification dataset for Uzbek language

  • 1. Universidade da Coruna, CITIC, Grupo LYS, Depto. de Computacion y Tecnologıas de la Informacion, Facultade de Informatica
  • 2. Urgench State University
  • 3. National University of Uzbekistan named after Mirzo Ulugbek


It is collected text data from 9 Uzbek news websites and press portals that included news articles and press releases. These websites were selected to cover various categories such as politics, sports, entertainment, technology, and others. In total, we collected 512,750 articles with over 120 million words accross 15 distinct categories, which provides a large and diverse corpus for text classification. It is worth noting that all the text in the corpus is written in the Latin script.

Categories (with the name in Uzbek): 

  • Local (Mahalliy)
  • World (Dunyo)
  • Sport (Sport)
  • Society (Jamiyat)
  • Law (Qonunchilik)
  • Tech (Texnologiya)
  • Culture (Madaniyat)
  • Politics (Siyosat)
  • Economics (Iqtisodiyot)
  • Auto (Avto)
  • Health (Salomatlik)
  • Crime (Jinoyat)
  • Photo (Foto)
  • Women (Ayollar)
  • Culinary (Pazandachilik)


When you reference this article, please be sure to cite it using the following address:


@inproceedings{Kuriyozov2023TextCD, title={Text classification dataset and analysis for Uzbek language}, author={Elmurod Kuriyozov and Ulugbek Salaev and Sanatbek Matlatipov and Gayrat Matlatipov}, year={2023} } 


Kuriyozov, E., Salaev, U., Matlatipov, S., & Matlatipov, G. (2023). Text classification dataset and analysis for Uzbek language.


Files (564.3 MB)

Name Size Download all
564.3 MB Preview Download

Additional details