Page images classification for content-specific data processing

Lutsai, Kateryna; Křivánková, Dana

doi:10.5281/zenodo.18390587

Published May 26, 2025 | Version v1

Poster Open

Page images classification for content-specific data processing

1. Charles University
2. Czech Academy of Sciences, Institute of Archaeology, Prague

Digitization projects in archaeology often generate vast quantities of page images from historical documents, presenting significant challenges for manual sorting and analysis. These archives contain diverse content, including various text types (handwritten, typed, printed), graphical elements (drawings, maps, photos), and layouts (plain text, tables, forms). Efficiently processing this heterogeneous data requires automated methods to categorize pages based on their content, enabling tailored downstream analysis pipelines. This project addresses this need by developing and evaluating an image classification system specifically designed for historical document pages, leveraging advancements in artificial intelligence and machine learning. We utilized a Vision Transformer (ViT) model, specifically fine-tuning the google/vit-base-patch16-224 architecture on a custom dataset relevant to archival and potentially archaeological materials. The dataset consists of 8950 manually annotated page images sourced from historical archival documents, categorized into 11 distinct classes based on the presence and type oftext (handwritten, printed, typed), graphical elements (drawings, photos), and the presence of tabular layouts or forms. These categories were chosen to facilitate content-specific processing workflows, separating pages requiring different analysis techniques (e.g., OCR for text, image analysis for graphics). The model was trained using standard deep learning practices, including image augmentations like color jittering and Gaussian blur, and evaluated on a held-out test set. The fine-tuned model (ufal/vit-historical-page, available on Hugging Face) achieves high classification accuracy on the evaluation set. This demonstrates the effectiveness of ViT models for accurately sorting complex historical page images. This automated classification system offers a powerful tool for researchers and archivists, streamlining the initial processing stages of large digital archives and enabling more efficient, content-aware analysis crucial for digital archaeology methods.

Files

poster.pdf

Files (1.1 MB)

Name	Size	Download all
poster.pdf md5:a5722836d4d8333f8589a16d359aad23	1.1 MB	Preview Download

Additional details

European Commission
ATRIUM - Advancing FronTier Research In the Arts and hUManities 101132163

Repository URL: https://github.com/ARUP-CAS/atrium-page-classification
Programming language: Python , Shell
Development Status: Active

	All versions	This version
Views	45	45
Downloads	23	23
Data volume	31.7 MB	31.7 MB

poster.pdf

Files (1.1 MB)

Funding

Software

Page images classification for content-specific data processing

Authors/Creators

Description

Files

poster.pdf

Files (1.1 MB)

Additional details

Funding

Software