Text classification dataset for Uzbek language

Kuriyozov Elmurod; Ulugbek Salaev; Sanatbek Matlatipov; Gayrat Matlatipov

doi:10.5281/zenodo.7677431

Published February 25, 2023 | Version 0.0.1

Conference paper Open

Text classification dataset for Uzbek language

1. Universidade da Coruna, CITIC, Grupo LYS, Depto. de Computacion y Tecnologıas de la Informacion, Facultade de Informatica
2. Urgench State University
3. National University of Uzbekistan named after Mirzo Ulugbek

It is collected text data from 9 Uzbek news websites and press portals that included news articles and press releases. These websites were selected to cover various categories such as politics, sports, entertainment, technology, and others. In total, we collected 512,750 articles with over 120 million words accross 15 distinct categories, which provides a large and diverse corpus for text classification. It is worth noting that all the text in the corpus is written in the Latin script.

Categories (with the name in Uzbek):

Local (Mahalliy)
World (Dunyo)
Sport (Sport)
Society (Jamiyat)
Law (Qonunchilik)
Tech (Texnologiya)
Culture (Madaniyat)
Politics (Siyosat)
Economics (Iqtisodiyot)
Auto (Avto)
Health (Salomatlik)
Crime (Jinoyat)
Photo (Foto)
Women (Ayollar)
Culinary (Pazandachilik)

When you reference this article, please be sure to cite it using the following address:

BibTex

@inproceedings{Kuriyozov2023TextCD, title={Text classification dataset and analysis for Uzbek language}, author={Elmurod Kuriyozov and Ulugbek Salaev and Sanatbek Matlatipov and Gayrat Matlatipov}, year={2023} }

APA:

Kuriyozov, E., Salaev, U., Matlatipov, S., & Matlatipov, G. (2023). Text classification dataset and analysis for Uzbek language.

Files

Uzbek_News_Dataset.zip

Files (564.3 MB)

Name	Size
Uzbek_News_Dataset.zip md5:fd50d6f0a0cd1bb24d44b1e1607b0320	564.3 MB	Preview Download

Additional details

https://arxiv.org/ftp/arxiv/papers/2302/2302.14494.pdf

	All versions	This version
Views	1,476	1,466
Downloads	359	357
Data volume	283.3 GB	282.1 GB

Text classification dataset for Uzbek language

Authors/Creators

Description

Files

Uzbek_News_Dataset.zip

Files (564.3 MB)

Additional details

References