Published March 7, 2024 | Version v1
Dataset Open

A Comprehensive Dataset for Webpage Classification (Part 3: Benign 2)

Description

This dataset, split across three parts due to Zenodo's size constraints, serves as a fundamental resource for enhancing webpage classification techniques. It encompasses 1,069,715 URLs, each annotated with labels to signify their categorization into Malicious, Benign, or Adult content, and further into 20 detailed sublabels for granular analysis. The dataset is designed to facilitate the evaluation and benchmarking of machine learning models, notably Stochastic Gradient Descent (SGD) and Support Vector Classifier (SVC), across a variety of tokenization methods and input types, including URLs, raw HTML, and parsed HTML content.

The primary objective of assembling this dataset is to support research into effective webpage classification, thereby improving content prioritization and filtering in web crawling applications. It has been meticulously curated to provide a robust framework for studying the impact of different feature representation techniques on classification accuracy.

The dataset is structured as JSON lines (jsonl) files, with each entry detailing a URL's label, sublabel, source, status code, and HTML content. This comprehensive dataset is divided into three parts due to size constraints on Zenodo, each targeting specific content categories to ensure ease of use and accessibility for researchers:

  • Part 1: Adult & Malicious encompasses URLs classified under Adult and Malicious categories, offering insights into content that requires stringent filtering.
  • Part 2: Benign 1 and Part 3: Benign 2 cover benign URLs, facilitating the study of safe web content and its classification nuances.

We also created a .csv file without the HTML content so it is easier to work with URLs only, this .csv file contains the next columns `['uid', 'url', 'label', 'sublabel']`

By providing this dataset, we aim to contribute significantly to the field of webpage classification, offering a valuable asset for researchers and practitioners looking to advance the state of web crawling technology and its applications.

 

JSON line format for each line:

{"url": "<URL>", "label": "<Label>", "sublabel": "<Sublabel>", "source": "<Source of this URL>", "status_code": <Status Code>, "html": "<HTML Textual Content>"}


Other parts of this dataset:

 

Citation

if you use this dataset, please cite us:

Al-Maamari, M., Istaiti, M., Zerhoudi, S., Dinzinger, M., Granitzer, M. and Mitrovic, J., A COMPREHENSIVE DATASET FOR WEBPAGE CLASSIFICATION.

https://ca-roll.github.io/downloads/A_Comprehensive_Dataset_for_Webpage_Classification.pdf

 

Granitzer, M., Voigt, S., Fathima, N.A., Golasowski, M., Guetl, C., Hecking, T., Hendriksen, G., Hiemstra, D., Martinovič, J., Mitrović, J. and Mlakar, I., 2023. Impact and development of an Open Web Index for open web search. Journal of the Association for Information Science and Technology.

https://doi.org/10.1002/asi.24818

Files

A_Comprehensive_Dataset_for_Webpage_Classification.pdf

Files (38.5 GB)

Name Size Download all
md5:9d6202d2cb9726e2711d2f82c9bd470d
235.7 kB Preview Download
md5:2e34bd71def78006e29247edb639bcfa
38.4 GB Download
md5:9464c466a5b3aa20b9dc7fe4edb3643d
94.1 MB Preview Download

Additional details

Funding

European Commission
OpenWebSearch.EU - Piloting a Cooperative Open Web Search Infrastructure to Support Europe's Digital Sovereignty 101070014

Dates

Available
2023-03-05