Page images classification for content-specific data processing
Authors/Creators
Description
Digitization projects in archaeology often generate vast quantities of page images from historical documents, presenting significant challenges for manual sorting and analysis. These archives contain diverse content, including various text types (handwritten, typed, printed), graphical elements (drawings, maps, photos), and layouts (plain text, tables, forms). Efficiently processing this heterogeneous data requires automated methods to categorize pages based on their content, enabling tailored downstream analysis pipelines. This project addresses this need by developing and evaluating an image classification system specifically designed for historical document pages, leveraging advancements in artificial intelligence and machine learning. We utilized a Vision Transformer (ViT) model, specifically fine-tuning the google/vit-base-patch16-224 architecture on a custom dataset relevant to archival and potentially archaeological materials. The dataset consists of 8950 manually annotated page images sourced from historical archival documents, categorized into 11 distinct classes based on the presence and type oftext (handwritten, printed, typed), graphical elements (drawings, photos), and the presence of tabular layouts or forms. These categories were chosen to facilitate content-specific processing workflows, separating pages requiring different analysis techniques (e.g., OCR for text, image analysis for graphics). The model was trained using standard deep learning practices, including image augmentations like color jittering and Gaussian blur, and evaluated on a held-out test set. The fine-tuned model (ufal/vit-historical-page, available on Hugging Face) achieves high classification accuracy on the evaluation set. This demonstrates the effectiveness of ViT models for accurately sorting complex historical page images. This automated classification system offers a powerful tool for researchers and archivists, streamlining the initial processing stages of large digital archives and enabling more efficient, content-aware analysis crucial for digital archaeology methods.
Files
poster.pdf
Files
(1.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:a5722836d4d8333f8589a16d359aad23
|
1.1 MB | Preview Download |
Additional details
Funding
Software
- Repository URL
- https://github.com/ARUP-CAS/atrium-page-classification
- Programming language
- Python , Shell
- Development Status
- Active