Published March 1, 2021 | Version v1
Dataset Open

Sources for a reproducible IT blog corpus

Description

The dataset entail homepages for several hundred IT-blogs and websites which have been hand-picked with the intention to represent discourses dedicated to questions at the intersection of technology and society from Germany and the United States.

The corresponding text collection can be reproduced with a method to duplicate the data by updating its contents and downloading it to the user’s local machine: see https://zenodo.org/record/4552529 and https://github.com/adbar/trafilatura.

Online searches on the text corpus are also available: https://www.dwds.de/d/korpora/it_blogs

Paper "A Reproducible IT-Blog Corpus": doi.org/10.5334/johd.35

Files

IT-Blogs-DE-Homepages.txt

Files (17.8 kB)

Name Size Download all
md5:4295d0768dfda17d1e3d239f00d17d04
14.0 kB Preview Download
md5:12f1c9032580e5d653e1b31c4b407ff6
3.8 kB Preview Download