Liesbeth Keijser
2020-01-21
<p>The National Archives of the Netherlands and Noord-Hollands Archief conducted a project using the Transkribus HTR (Handwritten Text Recognition) platform. The aim was to semi automatically transcribe 2 million pages of old Dutch texts.</p>
<p>The transcribed archives are 17<sup>th</sup> and 18<sup>th</sup> century documents from the Dutch East-Asia Company (VOC). And 19th century notarial deeds from Noord-Hollands Archief and other archives in the provinces.</p>
<p>In order to train the HTR software a team produced transcriptions of approximately 6000 scans. The scans are randomly selected from the dataset. With the transcriptions a model is trained that can recognize more than 90% of the characters correctly. Transkribus transcribed the 2 million scans automatically using the trained model.</p>
<p>The following Transkribus HTR+ model has been trained for the text recognition: "IJsberg". More information about the model can be found <a href="https://transkribus.eu/wiki/images/d/d6/Public_Models_in_Transkribus.pdf">here</a>. See the chapter "Dutch Handwriting". However, the Transkribus team retrained the model with <a href="https://readcoop.eu/try-out-transkribus-new-recognition-software-pylaia/">PyLaia</a> technology, which improved the HTR+ model. This PyLaia model is not publicly available.</p>
<p>Later on, 1 million extra scans concerning the West India Company (WIC) were transcribed automatically without adding extra ground truth or training. These archives are from the 17<sup>th</sup> and 18<sup>th</sup> century.</p>
<p>The datasets published in Zenodo contain the ground truth (scans in JPG, transcription in PAGE XML) and the HTR results (in PAGE XML and TXT). See the overview below. Scroll to the bottom of the page to download the actual files.</p>
<p><strong>Disclaimer</strong>: due to the languages (English, French and German) used in the archive of "1.05.21, Dutch series Guyana" and the bad state of the archive itself (waterdamage, discolouration and ink damage) HTR results are often of poor quality and not usable for research.</p>
<p>--------------------------------------------------------------</p>
<p><strong>Dataset HTR</strong><br>
Dataset, name archive, number archive, inventory numbers, link to inventory)</p>
<p>HTR results 1.05.01.01, Oude WIC, 1.05.01.01, 1-87, <a href="https://www.nationaalarchief.nl/onderzoeken/archief/1.05.01.01/invnr/%40A..?query=1.05.01.01&search-type=inventory">EAD</a><br>
HTR results 1.05.01.02, Tweede WIC, 1.05.01.02, 1-1382, <a href="https://www.nationaalarchief.nl/onderzoeken/archief/1.05.01.02/invnr/%40VII~1324C2?query=1.05.01.01&search-type=inventory">EAD</a> <br>
HTR results 1.05.02, Raad der Koloniën, 1.05.02, 1-192, <a href="https://www.nationaalarchief.nl/onderzoeken/archief/1.05.02/invnr/%40A?query=1.05.02&search-type=inventory">EAD</a><br>
HTR results 1.05.03, Sociëteit van Suriname, 1.05.03, 1-566, <a href="https://www.nationaalarchief.nl/onderzoeken/archief/1.05.03/invnr/%40A?query=1.05.03&search-type=inventory">EAD</a><br>
HTR results 1.05.05, Sociëteit van Berbice, 1.05.05, 1-445, <a href="https://www.nationaalarchief.nl/onderzoeken/archief/1.05.05/invnr/%40I?query=1.05.05&search-type=inventory">EAD</a><br>
HTR results 1.05.06, Verspreide West-Indische stukken, 1.05.06, 1-1413, <a href="https://www.nationaalarchief.nl/onderzoeken/archief/1.05.06/invnr/%401?query=1.05.06&search-type=inventory">EAD</a><br>
HTR results 1.05.21, Dutch series Guyana, 1.05.21, AB.1.1-BB.7.1, <a href="https://www.nationaalarchief.nl/onderzoeken/archief/1.05.21/invnr/%401.?query=1.05.21&search-type=inventory">EAD</a><br>
HTR results 2.01.28.01, West-Indisch comité, 2.01.28.01, 1-254, <a href="https://www.nationaalarchief.nl/onderzoeken/archief/2.01.28.01/invnr/%40I?query=2.01.28.01&search-type=inventory">EAD</a><br>
HTR results 2.01.28.02, Raad der Amerikaanse Bezittingen, 2.01.28.02, 1-264, <a href="https://www.nationaalarchief.nl/onderzoeken/archief/2.01.28.02/invnr/%40I.?query=2.01.28.02&search-type=inventory">EAD</a><br>
</p>
<p><strong>Dataset Ground Truth</strong><br>
(Name archive, number archive, inventory numbers, link to inventory, type of dataset)</p>
<p>Dataset: Notarial deeds Ground Truths of the trainingset</p>
<ul>
<li>Oud notarieel archief Haarlem, 1617, 495 random scans from 5-813, <a href="https://noord-hollandsarchief.nl/bronnen/archieven?mivast=236&mizig=210&miadt=236&micode=1617&milang=nl&miview=inv2">EAD</a>, GT Transcriptions</li>
<li>Nieuw notarieel archief Haarlem, 1972, 952 random scans from 1593-1805, <a href="https://noord-hollandsarchief.nl/bronnen/archieven?mivast=236&mizig=210&miadt=236&micode=1972&milang=nl&miview=inv2">EAD</a>, GT Transcriptions</li>
<li>(And 168 transcripties from 7 other archives.)<br>
</li>
</ul>
<p>Dataset: Notarial deeds Images of the trainingset,</p>
<ul>
<li>Nieuw notarieel archief Haarlem, 1972, 952 random scans from 1593-1805, <a href="https://noord-hollandsarchief.nl/bronnen/archieven?mivast=236&mizig=210&miadt=236&micode=1972&milang=nl&miview=inv2">EAD</a>, GT Scans</li>
<li>Oud notarieel archief Haarlem, 1617, 495 random scans from 5-813, <a href="https://noord-hollandsarchief.nl/bronnen/archieven?mivast=236&mizig=210&miadt=236&micode=1617&milang=nl&miview=inv2">EAD</a>, GT Scans</li>
<li>(And 168 scans from 7 other archives.)</li>
</ul>
<p><br>
Dataset: VOC Ground Truths of the trainingset,<br>
VOC, 1.04.02, 4735 random scans from 7527-9540, <a href="https://www.nationaalarchief.nl/onderzoeken/archief/1.04.02/invnr/%40Deel%20I?query=1.04.02&search-type=inventory">EAD</a>, GT Transcriptions</p>
<p><br>
Dataset: VOC Images of the trainingset,<br>
VOC, 1.04.02, 4735 random scans from 7527-9540, <a href="https://www.nationaalarchief.nl/onderzoeken/archief/1.04.02/invnr/%40Deel%20I?query=1.04.02&search-type=inventory">EAD</a>, GT Scans</p>
<p>--------------------------------------------------------------</p>
<p>Version 3.0: The first HTR results from the VOC-collection are available in .txt format, Inventory numbers 7527-9540.</p>
<p>Version 3.1: The HTR results from the VOC-collection are also available in PAGE xml format. </p>
<p>Version 4.0: About 30 missing inventory numbers have been added to the VOC transcriptions. The HTR results of the Notarial Deeds from the NHA archives have been added. An example on full text searchable research can be found here (Dutch): <a href="https://kia.pleio.nl/groups/view/55812425/htr-en-ocr/blog/view/55814752/reconstructie-van-een-verijdelde-slavenopstand-met-behulp-van-automatische-handschriftherkenning-en-text-mining">https://kia.pleio.nl/groups/view/55812425/htr-en-ocr/blog/view/55814752/reconstructie-van-een-verijdelde-slavenopstand-met-behulp-van-automatische-handschriftherkenning-en-text-mining</a></p>
<p>Version 5.0: Around a million pages of HTR results of the following archives have been added.</p>
https://doi.org/10.5281/zenodo.4159268
oai:zenodo.org:4159268
odt
Zenodo
https://doi.org/10.5281/zenodo.3517776
info:eu-repo/semantics/openAccess
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
Transciptions
Verenigde Oost-Indische Compagnie
West-Indische Compagnie
Notarial deeds
Nationaal Archief
Noord-Hollands Archief
Transkribus
6000 ground truth of VOC and notarial deeds 3.000.000 HTR of VOC, WIC and notarial deeds
info:eu-repo/semantics/other