pdf2blocks
Description
This python script converts pdf file written in french into html file.
The conversion consists in organizing the textual content of a pdf file into separate blocks. Each of these blocks will be transformed into an html section: H1, H2, P, FigCaption, Footer, Header.
This program uses pdftohtml and pdftotext, two tools of the poppler bookstore (https://poppler.freedesktop.org/). Tables extraction is done with the camelot library (https://pypi.org/project/camelot-py/).
It's run from the command line:
python pdf2blocks.py /link/to/file.pdf
The result is written on standard output.
The algorithme is described in french into the README.md file of the archive.
Notes
Files
pdf2blocs-master-src-2020-10-06.zip
Files
(1.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:469c33ae69f680b4af8b9c193bbf091c
|
170.1 kB | Preview Download |
|
md5:e61dfd36765ffadc114bb02b2d217f9d
|
931.8 kB | Preview Download |