Converting PDFs to Text with R

Schweinberger, Martin

doi:10.5281/zenodo.19332925

Published March 28, 2026 | Version 2026.03.28

Other Open

Converting PDFs to Text with R

Schweinberger, Martin¹

1. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia

This how-to tutorial covers the extraction of text from PDF files in R, including digital PDF text extraction, optical character recognition (OCR) for scanned documents, and the batch processing of PDF collections for use in downstream text analysis. It is aimed at researchers in corpus linguistics and digital humanities who need to convert PDF documents into plain text for computational analysis. This tutorial is part of the Language Technology and Data Analysis Laboratory (LADAL), a free, open-access research infrastructure at the University of Queensland. LADAL provides tutorials, tools, and courses for researchers working with language data. All materials are freely available at https://ladal.edu.au and are part of the Language Data Commons of Australia (LDaCA), funded by ARDC and NCRIS.

Files

Files (338.7 kB)

Name	Size	Download all
pdf2txt.html md5:d11543ad9f017f7e6d26e2a0e4d3d2be	338.7 kB	Download

Additional details

Is new version of: Other: https://slcladal.github.io/pdf2txt.html (URL)
Is part of: Other: https://ladal.edu.au (URL); Other: https://www.ldaca.edu.au (URL)
Is supplement to: Other: https://ladal.edu.au/tutorials/pdf2txt/pdf2txt.html (URL); Software: https://github.com/SLCLADAL/ladal (URL)

	All versions	This version
Views	14	14
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Converting PDFs to Text with R

Authors/Creators

Description

Files

Files (338.7 kB)

Additional details

Related works