Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.
Published November 14, 2021 | Version v1
Conference paper Open

A Prescriptive Approach For Structured Information Extraction From Web Forums And Social Media

  • 1. CENTRIC, Sheffield Hallam University

Description

In this paper we present ongoing research into extracting highly structured data - such as authors, posts, the links between them, and the metadata about them - from social media and fora using a prescriptive approach, building upon simple observations and generalised rules. This method uses techniques designed around identifying content based on text features, such as text density, and combines it with simple rules derived from studying the common structures of the target web pages to infer and extract structure from structured data.

We discuss observations made from studying a number of social media web sites and forums and present the simple rules for post, content and attribute identification developed from these observations. We also present the structured format used to store the extracted data and some of the benefits of this structure. Next, we give initial experimental results, showing that the proposed approach can achieve accuracies above 90% for identifying posts, 70% for extracting content from these posts, and 50-70% for extracting additional attributes about the posts and their authors. We highlight factors influencing these results, before finally detailing the next steps for this research.

Our research shows that it is possible to achieve reasonable levels of accuracy for extracting structured data using an approach that requires no training and is transferable between different social media and web forums with no additional input necessary. This approach thus promises considerable efficiency gains compared to the training involved with current machine learning-based approaches, whilst maintaining reasonable performance.

Files

A Prescriptive Approach For Structured Information Extraction From Web Forums And Social Media.pdf