A dataset for temporal analysis of files related to the JFK case
Description
This dataset contains the content of the subset of all files with a correct publication date from the 2017 release of files related to the JFK case (retrieved from https://www.archives.gov/research/jfk/2017-release). This content was extracted from the source PDF files using the R OCR libraries tesseract and pdftools.
The code to derive the dataset is given as follows:
### BEGIN R DATA PROCESSING SCRIPT
library(tesseract)
library(pdftools)
pdfs <- list.files("[path to your output directory containing all PDF files]")
meta <- read.csv2("[path to your input directory]/jfkrelease-2017-dce65d0ec70a54d5744de17d280f3ad2.csv",header = T,sep = ',') #the meta file containing all metadata for the PDF files (e.g. publication date)
meta
meta.clean <- meta[-which(meta
for(i in 1:nrow(meta.clean)){
meta.clean
if(nchar(meta.clean$Doc.Date[i])<10){
meta.clean
}
}
meta.clean
meta.clean <- meta.clean[order(meta.clean$Doc.Date),]
docs <- data.frame(content=character(0),dpub=character(0),stringsAsFactors = F)
for(i in 1:nrow(meta.clean)){
#for(i in 1:3){
pdf_prop <- pdftools::pdf_info(paste0("[path to your output directory]/",tolower(meta.clean
tmp_files <- c(tmp_files,paste0("/home/STAFF/luczakma/RProjects/JFK/data/tmp/",k))
}
img_file <- pdftools::pdf_convert(paste0("[path to your output directory]/",tolower(meta.clean
}
write.table(docs,"[path to your output directory]/documents.csv", row.names = F)
### END R DATA PROCESSING SCRIPT
Files
documents.csv
Files
(34.0 MB)
Name | Size | Download all |
---|---|---|
md5:56eca210fa5aa109b1e10d0859ac9ca8
|
34.0 MB | Preview Download |