Published September 26, 2023 | Version v1
Conference paper Open

Large Synthetic Data from the arχiv for OCR Post Correction of Historic Scientific Articles

  • 1. School of Information Sciences, University of Illinois, Urbana-Champaign
  • 2. Harvard-Smithsonian Center for Astrophysics

Description

Scientific articles published prior to the “age of digitization” (∼1997) require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We develop a pipeline for the generation of a synthetic ground truth/OCR dataset to correct the OCR results of the astrophysics literature holdings of the NASA Astrophysics Data System (ADS). By mining the arχiv we create, to the authors’ knowledge, the largest scientific synthetic ground truth/OCR post correction dataset of 203,354,393 character pairs. We provide baseline models trained with this dataset and find the mean improvement in character and word error rates of 7.71% and 18.82% for historical OCR text, respectively. When used to classify parts of sentences as inline math, we find a classification F1 score of 77.82%. Interactive dashboards to explore the dataset are available online: https://readingtimemachine.github.io/projects/1-ocr-groundtruth-may2023, and data and code, within the limitations of our agreement with the arχiv, are hosted on GitHub: https://github.com/ReadingTimeMachine/ocr_post_correction.

NOTE: as of 01/01/2024 all new versions of data and model weights will be hosted on the Reading Time Machine HuggingFace with the tag "sgt-ocr": https://huggingface.co/ReadingTimeMachine

Files

historical_groundtruths.zip

Files (7.2 GB)

Name Size Download all
md5:f97065a6d0b8485ff424f8eb5ee4bb8f
3.7 MB Preview Download
md5:c843f6699da82610ab6dd2c7cc4130a2
6.6 GB Preview Download
md5:87dcb56fd0b13d8cf85649fccfe7df19
12.1 MB Preview Download
md5:0e36163489e328a5bf877a1df56dd636
608.0 MB Preview Download
md5:0b8fc64081a6166320f262020ae341e9
12.1 MB Preview Download

Additional details

Related works

Is cited by
arXiv:2309.11549 (arXiv)