A Comprehensive Dataset for Webpage Classification (Part 3: Benign 2)
Description
This dataset, split across three parts due to Zenodo's size constraints, serves as a fundamental resource for enhancing webpage classification techniques. It encompasses 1,069,715 URLs, each annotated with labels to signify their categorization into Malicious, Benign, or Adult content, and further into 20 detailed sublabels for granular analysis. The dataset is designed to facilitate the evaluation and benchmarking of machine learning models, notably Stochastic Gradient Descent (SGD) and Support Vector Classifier (SVC), across a variety of tokenization methods and input types, including URLs, raw HTML, and parsed HTML content.
The primary objective of assembling this dataset is to support research into effective webpage classification, thereby improving content prioritization and filtering in web crawling applications. It has been meticulously curated to provide a robust framework for studying the impact of different feature representation techniques on classification accuracy.
The dataset is structured as JSON lines (jsonl) files, with each entry detailing a URL's label, sublabel, source, status code, and HTML content. This comprehensive dataset is divided into three parts due to size constraints on Zenodo, each targeting specific content categories to ensure ease of use and accessibility for researchers:
- Part 1: Adult & Malicious encompasses URLs classified under Adult and Malicious categories, offering insights into content that requires stringent filtering.
- Part 2: Benign 1 and Part 3: Benign 2 cover benign URLs, facilitating the study of safe web content and its classification nuances.
We also created a .csv file without the HTML content so it is easier to work with URLs only, this .csv file contains the next columns `['uid', 'url', 'label', 'sublabel']`
By providing this dataset, we aim to contribute significantly to the field of webpage classification, offering a valuable asset for researchers and practitioners looking to advance the state of web crawling technology and its applications.
JSON line format for each line:
{"url": "<URL>", "label": "<Label>", "sublabel": "<Sublabel>", "source": "<Source of this URL>", "status_code": <Status Code>, "html": "<HTML Textual Content>"}
Other parts of this dataset:
- A Comprehensive Dataset for Webpage Classification (Part 1: Adult & Malicious)
- A Comprehensive Dataset for Webpage Classification (Part 2: Benign 1)
Citation
if you use this dataset, please cite us:
Al-Maamari, M., Istaiti, M., Zerhoudi, S., Dinzinger, M., Granitzer, M. and Mitrovic, J., A COMPREHENSIVE DATASET FOR WEBPAGE CLASSIFICATION.
https://ca-roll.github.io/downloads/A_Comprehensive_Dataset_for_Webpage_Classification.pdf
Granitzer, M., Voigt, S., Fathima, N.A., Golasowski, M., Guetl, C., Hecking, T., Hendriksen, G., Hiemstra, D., Martinovič, J., Mitrović, J. and Mlakar, I., 2023. Impact and development of an Open Web Index for open web search. Journal of the Association for Information Science and Technology.
https://doi.org/10.1002/asi.24818
Files
A_Comprehensive_Dataset_for_Webpage_Classification.pdf
Additional details
Funding
Dates
- Available
-
2023-03-05