Published December 20, 2025
| Version 1.1.0
Software
Open
Batch OCR Pipeline for Archival Materials using Mistral AI
Description
A Jupyter-based batch OCR pipeline with updated documentation for processing documents (PDF, DOCX, PPTX) and images (PNG, JPEG, etc.) using the Mistral OCR API.
The pipeline supports various document types including printed text, forms, tables, and handwritten content (modern scripts; historical German scripts like Kurrent remain challenging).
Features
- Structured text extraction (headings, paragraphs, footnotes) in Markdown format
- Processing of forms, invoices, and documents with mixed layouts
- Handwriting recognition for annotations and cursive text
- Automatic PDF splitting for large documents (>50 MB or >1000 pages)
- SQLite checkpoint system for resumable processing
- Zero Data Retention by default for Mistral OCR
- Always using the latest version of Mistral OCR (currently mistral-ocr-2512)
- Multilingual OCR support (scripts and languages across all continents, including German, Arabic, Chinese, and many others)
Quick Setup
- Open the repository in a Jupyter-compatible environment (e.g., VSCode with Jupyter extension)
- Get a Mistral API key from Console
- Rename .env.template to .env and add your API key
- Place your documents (PDF, DOCX, PPTX) or images (PNG, JPEG) in data/input/
- Run notebooks/ocr_pipeline.ipynb
Output
- Markdown files with preserved document structure
- Plain text files for downstream processing
- JSON metadata (model version, page count, processing time)
Cost
- Experiment accounts available without cost via console.mistral.ai
- Approximately $1-2 USD per 1000 pages (Mistral OCR API)
Note: Documentation has been updated to reflect the current pipeline implementation. Previous documentation contained outdated workflows that are no longer part of this pipeline.
Files
ma-wi-lo/pubs-1.1.0.zip
Files
(39.8 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:0730b5aa0405563daf8d5fc732075136
|
39.8 kB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/ma-wi-lo/pubs/tree/1.1.0 (URL)
Software
- Repository URL
- https://github.com/ma-wi-lo/pubs
- Programming language
- Python , Jupyter Notebook
- Development Status
- Active