Published July 25, 2017 | Version v1
Journal article Open

Crowdsourcing Data Enhancements to Improve Named Entity Recognition in the Biodiversity Heritage Library

Creators

  • 1. Museum of Comparative Zoology, Harvard University, Cambridge, MA, United States of America

Description

The Biodiversity Heritage Library'sĀ holdings include dozens of manuscript collections that are largely hidden due to minimal descriptive metadata and the absence of machine readable facsimiles. Transcription projects for collections are time consuming, intellectually intensive, and expensive for an organization to facilitate. Crowdsourcing has been identified as a sustainable model for generating transcriptions for large collections and institutions with diverse holdings and may improve data collection from a diverse range of users to enhance descriptive metadata. Manuscript transcriptions also must fit into the larger BHL objective of producing and making available large scale datasets for users and researchers to study and manipulate. While generating only full-text transcriptions will improve discoverability, items remain isolated from other literature and collections. The GNA machine learning and named entity recognition algorithms allow BHL to extract, index, and attach page records to scientific names in order to create additional access points to biodiversity literature. These tools depend on imperfect optical character recognition (OCR), which contain spelling and layout errors and outdated naming conventions and can be added to BHL's larger crowdsourcing platform to solicit corrections. Transcriptions, similarly, introduce antiquated taxonomic data and common names and must be optimized for use with GNA's name finding tools. Improving named entity recognition by correcting OCR output and enhancing transcribed manuscript items will allow BHL to better connect content across collections and ultimately provide a broader and more complete picture of biodiversity.

Files

BISS_article_17354.pdf

Files (58.8 kB)

Name Size Download all
md5:c9ac0c17caf4bd5bd2e2f8e2c27b4572
51.9 kB Preview Download
md5:e8d2fb5d960d82d87d7cfbf3092bcef4
6.9 kB Preview Download

Linked records