Published May 29, 2025 | Version V 1.1
Dataset Open

ARPD: The Academic Arabic Research Papers Dataset (corpus).

  • 1. ROR icon King Abdulaziz University
  • 2. Department of Information Technology
  • 1. King Abdulaziz University
  • 2. Department of Information Technology
  • 3. Faculty of Computers and Information Sciences, Mansoura University

Description

ARPD: The Academic Arabic Research Papers Dataset.

This corpus/dataset contributes to the Arabic language field by providing a novel Academic dataset that can be utilized for various purposes, such as NLP models and conducting text analysis. Also, this dataset contains papers from seven science fields written in Arabic. These fields are Arabic, religion, art, law, education, and agriculture. So, the dataset consists of seven classes based on these science fields.  In addition, this dataset boasts a significant number of 2,011 documents. Also, the dataset was published in different formats, such as PDF files, Text files, and CSV files, to benefit the Arabic research area.

 

Arabic Article class

# papers

# words 

Art

284

1934477

Law

292

2968015

Business

296

3536048

Religion

288

4098553

Agricultural

308

1288720

Arabic

242

2849792

Education

301

3342715

Total

2011

5734804421

The dataset contains:

1. All the PDF files are categorized into their respective classes. (PDF.zip)

2. The text files obtained after converting the PDF and processing. (TXT.zip)

3. The text file is a revised version after applying specific processing steps. The preprocessing step includes Arabic normalization for alef, teh, and ligature, removes tashkeel, harakat, tweel, and shadda, and stops word removal. (TXTWithPreprocess.zip)

4. The CSV file contains two columns: the first column represents the paper, and the second column includes the class. (ARPD-withoutPreprocess.csv , preprocessed_arpd.csv)

Files

preprocessed_arpd.csv

Files (5.1 GB)

Name Size Download all
md5:b6c9589579bb45b94f97876e446ecbb4
234.2 MB Preview Download
md5:db6d403ff0460d7351deca13cdd9f1b9
4.6 GB Preview Download
md5:86d5d7a285297e39ff1f6fa7d4a2caa4
165.9 MB Preview Download
md5:06f7dd282f50a5428c59cf2de2f4675e
59.7 MB Preview Download
md5:70c78c59b728b622a2f0322374def37a
35.3 MB Preview Download

Additional details

Dates

Updated
2025-05-29