EXTRACTING TEXT FROM THE PDF FILE (PNA 3/II)

1. Open the pdf file in Safari (the Firefox and Chrome browers were tried, but they did not maintain text formatting)
2. Copy all text to Word (text formatting was maintained but typeface not)
	- Remove page headers and footers manually
	- With find-and-replace, add '*' to the beginning of lines starting with bold text (i.e. the names of persons)
	- With find-and-replace, add '<<' to the beginning of lines with font size 9 (name of the person who wrote the PNA entry)
	- Remove line breaks manually when a line ends in full stop but the sentence does not end
3. Save the file as rtf
4. Open the file with TextEdit
	- Remove linebreaks (automatically) when a line does not end with full stop
	- Move around, by hand, text snippets that ended up in a wrong place when two columns in the pdf file were transformed to one column
	- make sure that the "name line" is a single line of text and the description of each individual is a single line of text

This process took a lot of manual work, and small corrections to the text file have been made many times!

Cannot be reproduced!

Before using the file to create our final network, we had to handle with some issues that were caused by the conversion from the pdf file to a text file.
First, the š and Š signs consisted of two different elements in the converted text file. We replaced them with the standard Unicode signs for š and Š.
Second, the macrons used to indicate long vowels in Akkadian (ā, ē, ī, ū) could not be retained in the conversion from the pdf file to a text file. For consistency, we also removed circumflexes from personal names (â, ê, î, û -> a, e, i, u). Accordingly, all the vowels in the personal names from PNA 3/II are short in our networks.
We used the following command:
sed 's/š/š/g' names.csv | sed 's/Š/Š/g' | sed 's/î/i/g' | sed 's/â/a/g' | sed 's/ê/e/g' | sed 's/û/u/g'


