Report Open Access

BlogForever D2.6: Data Extraction Methodology

Stepanyan, K.; Gkotsis, G.; Pincent, E.; Banos, V.; Davis, R.

This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform.

Files (5.0 MB)
Name Size
5.0 MB Download
All versions This version
Views 172172
Downloads 14,32214,322
Data volume 72.1 GB72.1 GB
Unique views 158158
Unique downloads 14,19914,199


Cite as