BlogForever D2.6: Data Extraction Methodology

doi:10.5281/zenodo.7490

Published October 25, 2013 | Version v1

Report Open

BlogForever D2.6: Data Extraction Methodology

This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform.

Files

BlogForever_D2_6_Data_Extraction.pdf

Files (5.0 MB)

Name	Size	Download all
BlogForever_D2_6_Data_Extraction.pdf md5:d371e0e12fd0844ec104519eeae2541c	5.0 MB	Preview Download

236

Views

14K

Downloads

Show more details

	All versions	This version
Views	236	236
Downloads	14,243	14,243
Data volume	72.3 GB	72.3 GB

More info on how stats are collected....

DOI

Resource type

Report

Publisher

Zenodo

Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: March 10, 2014
Modified: August 6, 2024

BlogForever D2.6: Data Extraction Methodology

Creators

Description

Files

BlogForever_D2_6_Data_Extraction.pdf

Files (5.0 MB)