Published April 27, 2024 | Version v1
Dataset Open

Extracted external domains from Wikipedia dump - 20/03/2024

Description

This dataset contains 6,459,779 distinct domains derived from the external links section of Wikipedia pages.

The external links section of a page such as OpenWeb  contains only one link.

The primary objective of assembling this dataset is to improve content prioritization and filtering in web crawling techniques.

The dataset is structured as a text file, with each line representing a distinct domain.


Files

20240320_external_links_wiki_domains.txt

Files (135.2 MB)

Name Size Download all
md5:18d336d352a1db217f99c603e54947cd
135.2 MB Preview Download

Additional details

Funding

European Commission
OpenWebSearch.EU - Piloting a Cooperative Open Web Search Infrastructure to Support Europe's Digital Sovereignty 101070014