Published May 26, 2025 | Version v1
Poster Open

Page images classification for content-specific data processing

  • 1. ROR icon Charles University
  • 2. ROR icon Czech Academy of Sciences, Institute of Archaeology, Prague

Description

Digitization projects in archaeology often generate vast quantities of page images from historical documents, presenting significant challenges for manual sorting and analysis. These archives contain diverse content, including various text types (handwritten, typed, printed), graphical elements (drawings, maps, photos), and layouts (plain text, tables, forms). Efficiently processing this heterogeneous data requires automated methods to categorize pages based on their content, enabling tailored downstream analysis pipelines. This project addresses this need by developing and evaluating an image classification system specifically designed for historical document pages, leveraging advancements in artificial intelligence and machine learning. We utilized a Vision Transformer (ViT) model, specifically fine-tuning the google/vit-base-patch16-224 architecture on a custom dataset relevant to archival and potentially archaeological materials. The dataset consists of 8950 manually annotated page images sourced from historical archival documents, categorized into 11 distinct classes based on the presence and type oftext (handwritten, printed, typed), graphical elements (drawings, photos), and the presence of tabular layouts or forms. These categories were chosen to facilitate content-specific processing workflows, separating pages requiring different analysis techniques (e.g., OCR for text, image analysis for graphics). The model was trained using standard deep learning practices, including image augmentations like color jittering and Gaussian blur, and evaluated on a held-out test set. The fine-tuned model (ufal/vit-historical-page, available on Hugging Face) achieves high classification accuracy on the evaluation set. This demonstrates the effectiveness of ViT models for accurately sorting complex historical page images. This automated classification system offers a powerful tool for researchers and archivists, streamlining the initial processing stages of large digital archives and enabling more efficient, content-aware analysis crucial for digital archaeology methods.

Files

poster.pdf

Files (1.1 MB)

Name Size Download all
md5:a5722836d4d8333f8589a16d359aad23
1.1 MB Preview Download

Additional details

Funding

European Commission
ATRIUM - Advancing FronTier Research In the Arts and hUManities 101132163

Software

Repository URL
https://github.com/ARUP-CAS/atrium-page-classification
Programming language
Python , Shell
Development Status
Active