Published March 28, 2026
| Version 2026.03.28
Other
Open
Converting PDFs to Text with R
Authors/Creators
- 1. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia
Description
This how-to tutorial covers the extraction of text from PDF files in R, including digital PDF text extraction, optical character recognition (OCR) for scanned documents, and the batch processing of PDF collections for use in downstream text analysis. It is aimed at researchers in corpus linguistics and digital humanities who need to convert PDF documents into plain text for computational analysis.
This tutorial is part of the Language Technology and Data Analysis Laboratory (LADAL), a free, open-access research infrastructure at the University of Queensland. LADAL provides tutorials, tools, and courses for researchers working with language data. All materials are freely available at https://ladal.edu.au and are part of the Language Data Commons of Australia (LDaCA), funded by ARDC and NCRIS.
Files
Files
(338.7 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:d11543ad9f017f7e6d26e2a0e4d3d2be
|
338.7 kB | Download |
Additional details
Related works
- Is new version of
- Other: https://slcladal.github.io/pdf2txt.html (URL)
- Is part of
- Other: https://ladal.edu.au (URL)
- Other: https://www.ldaca.edu.au (URL)
- Is supplement to
- Other: https://ladal.edu.au/tutorials/pdf2txt/pdf2txt.html (URL)
- Software: https://github.com/SLCLADAL/ladal (URL)