A dataset for temporal analysis of files related to the JFK case

Luczak-Roesch, Markus

doi:10.5281/zenodo.1042154

Published November 5, 2017 | Version 1.0

Dataset Open

A dataset for temporal analysis of files related to the JFK case

Luczak-Roesch, Markus¹

1. Victoria University of Wellington

This dataset contains the content of the subset of all files with a correct publication date from the 2017 release of files related to the JFK case (retrieved from https://www.archives.gov/research/jfk/2017-release). This content was extracted from the source PDF files using the R OCR libraries tesseract and pdftools.

The code to derive the dataset is given as follows:

### BEGIN R DATA PROCESSING SCRIPT

library(tesseract)
library(pdftools)

pdfs <- list.files("[path to your output directory containing all PDF files]")

meta <- read.csv2("[path to your input directory]/jfkrelease-2017-dce65d0ec70a54d5744de17d280f3ad2.csv",header = T,sep = ',') #the meta file containing all metadata for the PDF files (e.g. publication date)

meta $D o c . D a t e < - a s . c h a r a c t e r (m e t a$ Doc.Date)

meta.clean <- meta[-which(meta $D o c . D a t e == " " | g r e p l (" / 0000 ", m e t a$ Doc.Date)),]
for(i in 1:nrow(meta.clean)){
meta.clean $D o c . D a t e [i] < - g s u b (" 00 ", " 01 ", m e t a . c l e a n$ Doc.Date[i])

if(nchar(meta.clean$Doc.Date[i])<10){
meta.clean $D o c . D a t e [i] < - f o r m a t (s t r p t i m e (m e t a . c l e a n$ Doc.Date[i],format = "%d/%m/%y"),"%m/%d/%Y")
}

}

meta.clean $D o c . D a t e < - s t r p t i m e (m e t a . c l e a n$ Doc.Date,format = "%m/%d/%Y")

meta.clean <- meta.clean[order(meta.clean$Doc.Date),]

docs <- data.frame(content=character(0),dpub=character(0),stringsAsFactors = F)
for(i in 1:nrow(meta.clean)){
#for(i in 1:3){
pdf_prop <- pdftools::pdf_info(paste0("[path to your output directory]/",tolower(meta.clean $F i l e . N a m e [i]))) t m p_{f} i l e s < - c () f o r (k i n 1 : p d f_{p} r o p$ pages){
tmp_files <- c(tmp_files,paste0("/home/STAFF/luczakma/RProjects/JFK/data/tmp/",k))
}

img_file <- pdftools::pdf_convert(paste0("[path to your output directory]/",tolower(meta.clean $You can't use 'macro parameter character #' in math mode$ Doc.Date[i],"%Y/%m/%d"),stringsAsFactors = F),stringsAsFactors = F)
}

write.table(docs,"[path to your output directory]/documents.csv", row.names = F)

### END R DATA PROCESSING SCRIPT

Files

documents.csv

Files (34.0 MB)

Name	Size	Download all
documents.csv md5:56eca210fa5aa109b1e10d0859ac9ca8	34.0 MB	Preview Download

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	6,359	292
Downloads	6,458	222
Data volume	304.7 GB	7.7 GB

A dataset for temporal analysis of files related to the JFK case

Creators

Description

Files

documents.csv

Files (34.0 MB)