Published July 13, 2022 | Version v1
Presentation Open

The bibliography jungle: using random forests to predict reference sections in dissertations

  • 1. University of Leipzig
  • 2. Helmholtz-Centre for Environmental Research

Description

Cited-works-lists in Humanities dissertations are typically the result of five years of work. However, despite the long-standing tradition of reference mining, no research has systematically untapped the bibliographic data of existing electronic thesis collections. One of the main reasons for this is the difficulty of creating a tagged golden standard for the around 300 pages long theses. In this short paper, we propose a page-based random forest (RF) prediction approach which uses a new corpus of Literary Studies Dissertations from Germany. Moreover, we will explain the handcrafted but computationally informed feature-selection process. The evaluation demonstrates that this method achieves an F1 score of 0.88 on this new dataset. In addition, it has the advantage of being derived from an interpretable model, where feature relevance for prediction is clear, and which also incorporates a simplified annotation process.

Files

Ulite_Workshop.pdf

Files (1.8 MB)

Name Size Download all
md5:0f2db39224a1a68fe37d2a26c5ac4f24
1.8 MB Preview Download