ARPD: The Academic Arabic Research Papers Dataset (corpus).
Authors/Creators
Contributors
Supervisors:
- 1. King Abdulaziz University
- 2. Department of Information Technology
- 3. Faculty of Computers and Information Sciences, Mansoura University
Description
ARPD: The Academic Arabic Research Papers Dataset.
This corpus/dataset contributes to the Arabic language field by providing a novel Academic dataset that can be utilized for various purposes, such as NLP models and conducting text analysis. Also, this dataset contains papers from seven science fields written in Arabic. These fields are Arabic, religion, art, law, education, and agriculture. So, the dataset consists of seven classes based on these science fields. In addition, this dataset boasts a significant number of 2,011 documents. Also, the dataset was published in different formats, such as PDF files, Text files, and CSV files, to benefit the Arabic research area.
|
Arabic Article class |
# papers |
# words |
|
Art |
284 |
1934477 |
|
Law |
292 |
2968015 |
|
Business |
296 |
3536048 |
|
Religion |
288 |
4098553 |
|
Agricultural |
308 |
1288720 |
|
Arabic |
242 |
2849792 |
|
Education |
301 |
3342715 |
|
Total |
2011 |
5734804421 |
The dataset contains:
1. All the PDF files are categorized into their respective classes. (PDF.zip)
2. The text files obtained after converting the PDF and processing. (TXT.zip)
3. The text file is a revised version after applying specific processing steps. The preprocessing step includes Arabic normalization for alef, teh, and ligature, removes tashkeel, harakat, tweel, and shadda, and stops word removal. (TXTWithPreprocess.zip)
4. The CSV file contains two columns: the first column represents the paper, and the second column includes the class. (ARPD-withoutPreprocess.csv , preprocessed_arpd.csv)
Files
preprocessed_arpd.csv
Files
(5.1 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:b6c9589579bb45b94f97876e446ecbb4
|
234.2 MB | Preview Download |
|
md5:db6d403ff0460d7351deca13cdd9f1b9
|
4.6 GB | Preview Download |
|
md5:86d5d7a285297e39ff1f6fa7d4a2caa4
|
165.9 MB | Preview Download |
|
md5:06f7dd282f50a5428c59cf2de2f4675e
|
59.7 MB | Preview Download |
|
md5:70c78c59b728b622a2f0322374def37a
|
35.3 MB | Preview Download |
Additional details
Dates
- Updated
-
2025-05-29