Published December 20, 2025 | Version 1.1.0
Software Open

Batch OCR Pipeline for Archival Materials using Mistral AI

  • 1. Research Library for the History of Education

Description

A Jupyter-based batch OCR pipeline with updated documentation for processing documents (PDF, DOCX, PPTX) and images (PNG, JPEG, etc.) using the Mistral OCR API.

The pipeline supports various document types including printed text, forms, tables, and handwritten content (modern scripts; historical German scripts like Kurrent remain challenging).

Features

  • Structured text extraction (headings, paragraphs, footnotes) in Markdown format
  • Processing of forms, invoices, and documents with mixed layouts
  • Handwriting recognition for annotations and cursive text
  • Automatic PDF splitting for large documents (>50 MB or >1000 pages)
  • SQLite checkpoint system for resumable processing
  • Zero Data Retention by default for Mistral OCR
  • Always using the latest version of Mistral OCR (currently mistral-ocr-2512)
  • Multilingual OCR support (scripts and languages across all continents, including German, Arabic, Chinese, and many others)

Quick Setup

  1. Open the repository in a Jupyter-compatible environment (e.g., VSCode with Jupyter extension)
  2. Get a Mistral API key from Console 
  3. Rename .env.template to .env and add your API key
  4. Place your documents (PDF, DOCX, PPTX) or images (PNG, JPEG) in data/input/
  5. Run notebooks/ocr_pipeline.ipynb

Output

  • Markdown files with preserved document structure
  • Plain text files for downstream processing
  • JSON metadata (model version, page count, processing time)

Cost

  • Experiment accounts available without cost via console.mistral.ai
  • Approximately $1-2 USD per 1000 pages (Mistral OCR API)

Note: Documentation has been updated to reflect the current pipeline implementation. Previous documentation contained outdated workflows that are no longer part of this pipeline.

 

Files

ma-wi-lo/pubs-1.1.0.zip

Files (39.8 kB)

Name Size Download all
md5:0730b5aa0405563daf8d5fc732075136
39.8 kB Preview Download

Additional details

Related works

Is supplement to
Software: https://github.com/ma-wi-lo/pubs/tree/1.1.0 (URL)

Software

Repository URL
https://github.com/ma-wi-lo/pubs
Programming language
Python , Jupyter Notebook
Development Status
Active