There is a newer version of the record available.

Published October 6, 2020 | Version 1.1
Software Open

pdf2blocks

  • 1. TSCF - INRAE

Description

This python script converts pdf file written in french into html file. 

The conversion consists in organizing the textual content of a pdf file into separate blocks. Each of these blocks will be transformed into an html  section: H1, H2, P, FigCaption, Footer, Header.


This program uses pdftohtml and pdftotext, two tools of the poppler bookstore (https://poppler.freedesktop.org/). Tables extraction is done with the camelot library (https://pypi.org/project/camelot-py/).

It's run from the command line:

python pdf2blocks.py /link/to/file.pdf

The result is written on standard output.

The algorithme is described in french into the README.md file of the archive.

Notes

There is a git related to this code : https://gitlab.irstea.fr/copain/pdf2blocs

Files

pdf2blocs-master-src-2020-10-06.zip

Files (1.1 MB)

Name Size Download all
md5:469c33ae69f680b4af8b9c193bbf091c
170.1 kB Preview Download
md5:e61dfd36765ffadc114bb02b2d217f9d
931.8 kB Preview Download