A Comprehensive Dataset for Webpage Classification (Part 3: Benign 2)

Al-Maamari, Mohammed; Istaiti, Mahmoud; Zerhoudi, Saber; Dinzinger, Michael; Granitzer, Michael; Mitrovic, Jelena

doi:10.5281/zenodo.10795437

Published March 7, 2024 | Version v1

Dataset Open

A Comprehensive Dataset for Webpage Classification (Part 3: Benign 2)

This dataset, split across three parts due to Zenodo's size constraints, serves as a fundamental resource for enhancing webpage classification techniques. It encompasses 1,069,715 URLs, each annotated with labels to signify their categorization into Malicious, Benign, or Adult content, and further into 20 detailed sublabels for granular analysis. The dataset is designed to facilitate the evaluation and benchmarking of machine learning models, notably Stochastic Gradient Descent (SGD) and Support Vector Classifier (SVC), across a variety of tokenization methods and input types, including URLs, raw HTML, and parsed HTML content.

The primary objective of assembling this dataset is to support research into effective webpage classification, thereby improving content prioritization and filtering in web crawling applications. It has been meticulously curated to provide a robust framework for studying the impact of different feature representation techniques on classification accuracy.

The dataset is structured as JSON lines (jsonl) files, with each entry detailing a URL's label, sublabel, source, status code, and HTML content. This comprehensive dataset is divided into three parts due to size constraints on Zenodo, each targeting specific content categories to ensure ease of use and accessibility for researchers:

Part 1: Adult & Malicious encompasses URLs classified under Adult and Malicious categories, offering insights into content that requires stringent filtering.
Part 2: Benign 1 and Part 3: Benign 2 cover benign URLs, facilitating the study of safe web content and its classification nuances.

We also created a .csv file without the HTML content so it is easier to work with URLs only, this .csv file contains the next columns `['uid', 'url', 'label', 'sublabel']`

By providing this dataset, we aim to contribute significantly to the field of webpage classification, offering a valuable asset for researchers and practitioners looking to advance the state of web crawling technology and its applications.

JSON line format for each line:

{"url": "<URL>", "label": "<Label>", "sublabel": "<Sublabel>", "source": "<Source of this URL>", "status_code": <Status Code>, "html": "<HTML Textual Content>"}

Other parts of this dataset:

Citation

if you use this dataset, please cite us:

Al-Maamari, M., Istaiti, M., Zerhoudi, S., Dinzinger, M., Granitzer, M. and Mitrovic, J., A COMPREHENSIVE DATASET FOR WEBPAGE CLASSIFICATION.

https://ca-roll.github.io/downloads/A_Comprehensive_Dataset_for_Webpage_Classification.pdf

Granitzer, M., Voigt, S., Fathima, N.A., Golasowski, M., Guetl, C., Hecking, T., Hendriksen, G., Hiemstra, D., Martinovič, J., Mitrović, J. and Mlakar, I., 2023. Impact and development of an Open Web Index for open web search. Journal of the Association for Information Science and Technology.

https://doi.org/10.1002/asi.24818

Files

A_Comprehensive_Dataset_for_Webpage_Classification.pdf

Files (38.5 GB)

Name	Size	Download all
A_Comprehensive_Dataset_for_Webpage_Classification.pdf md5:9d6202d2cb9726e2711d2f82c9bd470d	235.7 kB	Preview Download
benign_OWS_URL_HTML_DS_part_2.jsonl md5:2e34bd71def78006e29247edb639bcfa	38.4 GB	Download
OWS_URL_DS.csv md5:9464c466a5b3aa20b9dc7fe4edb3643d	94.1 MB	Preview Download

Additional details

European Commission
OpenWebSearch.EU - Piloting a Cooperative Open Web Search Infrastructure to Support Europe's Digital Sovereignty 101070014

Available: 2023-03-05

	All versions	This version
Views	193	193
Downloads	276	276
Data volume	4.4 TB	4.4 TB

A_Comprehensive_Dataset_for_Webpage_Classification.pdf

Files (38.5 GB)

Funding

Dates

A Comprehensive Dataset for Webpage Classification (Part 3: Benign 2)

Authors/Creators

Description

Files

A_Comprehensive_Dataset_for_Webpage_Classification.pdf

Files (38.5 GB)

Additional details

Funding

Dates