Published March 28, 2026 | Version 2026.03.28
Other Open

Converting PDFs to Text with R

Authors/Creators

  • 1. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia

Description

This how-to tutorial covers the extraction of text from PDF files in R, including digital PDF text extraction, optical character recognition (OCR) for scanned documents, and the batch processing of PDF collections for use in downstream text analysis. It is aimed at researchers in corpus linguistics and digital humanities who need to convert PDF documents into plain text for computational analysis. This tutorial is part of the Language Technology and Data Analysis Laboratory (LADAL), a free, open-access research infrastructure at the University of Queensland. LADAL provides tutorials, tools, and courses for researchers working with language data. All materials are freely available at https://ladal.edu.au and are part of the Language Data Commons of Australia (LDaCA), funded by ARDC and NCRIS.

Files

Files (338.7 kB)

Name Size Download all
md5:d11543ad9f017f7e6d26e2a0e4d3d2be
338.7 kB Download

Additional details

Related works