Planned intervention: On Wednesday April 3rd 05:30 UTC Zenodo will be unavailable for up to 2-10 minutes to perform a storage cluster upgrade.

There is a newer version of the record available.

Published January 21, 2020 | Version 5.0
Dataset Open

6000 ground truth of VOC and notarial deeds 3.000.000 HTR of VOC, WIC and notarial deeds

  • 1. National Archive Netherlands

Description

The National Archives of the Netherlands and Noord-Hollands Archief conducted a project using the Transkribus HTR (Handwritten Text Recognition) platform. The aim was to semi automatically transcribe 2 million pages of old Dutch texts.

The transcribed archives are 17th and 18th century documents from the Dutch East-Asia Company (VOC). And 19th century notarial deeds from Noord-Hollands Archief and other archives in the provinces.

In order to train the HTR software a team produced transcriptions of approximately 6000 scans. The scans are randomly selected from the dataset. With the transcriptions a model is trained that can recognize more than 90% of the characters correctly. Transkribus transcribed the 2 million scans automatically using the trained model.

The following Transkribus HTR+ model has been trained for the text recognition: "IJsberg". More information about the model can be found here. See the chapter "Dutch Handwriting". However, the Transkribus team retrained the model with PyLaia technology, which improved the HTR+ model. This PyLaia model is not publicly available.

Later on, 1 million extra scans concerning the West India Company (WIC) were transcribed automatically without adding extra ground truth or training. These archives are from the 17th and 18th century.

The datasets published in Zenodo contain the ground truth (scans in JPG, transcription in PAGE XML) and the HTR results (in PAGE XML and TXT). See the overview below. Scroll to the bottom of the page to download the actual files.

Disclaimer: due to the languages (English, French and German) used in the archive of "1.05.21, Dutch series Guyana" and the bad state of the archive itself (waterdamage, discolouration and ink damage) HTR results are often of poor quality and not usable for research.

--------------------------------------------------------------

Dataset HTR
Dataset, name archive, number archive, inventory numbers, link to inventory)

HTR results 1.05.01.01, Oude WIC, 1.05.01.01, 1-87,  EAD
HTR results 1.05.01.02, Tweede WIC, 1.05.01.02, 1-1382, EAD     
HTR results 1.05.02, Raad der Koloniën, 1.05.02, 1-192, EAD
HTR results 1.05.03, Sociëteit van Suriname, 1.05.03, 1-566, EAD
HTR results 1.05.05, Sociëteit van Berbice, 1.05.05, 1-445,    EAD
HTR results 1.05.06, Verspreide West-Indische stukken, 1.05.06, 1-1413, EAD
HTR results 1.05.21, Dutch series Guyana, 1.05.21, AB.1.1-BB.7.1, EAD
HTR results 2.01.28.01, West-Indisch comité, 2.01.28.01, 1-254, EAD
HTR results 2.01.28.02, Raad der Amerikaanse Bezittingen, 2.01.28.02, 1-264, EAD
 

Dataset Ground Truth
(Name archive, number archive, inventory numbers, link to inventory, type of dataset)

Dataset: Notarial deeds Ground Truths of the trainingset

  • Oud notarieel archief Haarlem, 1617, 495 random scans from 5-813, EAD, GT Transcriptions
  • Nieuw notarieel archief Haarlem, 1972, 952 random scans from 1593-1805, EAD, GT Transcriptions
  • (And 168 transcripties from 7 other archives.)
     

Dataset: Notarial deeds Images of the trainingset,

  • Nieuw notarieel archief Haarlem, 1972, 952 random scans from 1593-1805, EAD, GT Scans
  • Oud notarieel archief Haarlem, 1617, 495 random scans from 5-813, EAD, GT Scans
  • (And 168 scans from 7 other archives.)


Dataset: VOC Ground Truths of the trainingset,
VOC, 1.04.02, 4735 random  scans from 7527-9540, EAD, GT Transcriptions


Dataset: VOC Images of the trainingset,
VOC, 1.04.02, 4735 random  scans from 7527-9540, EAD, GT Scans

--------------------------------------------------------------

Version 3.0: The first HTR results from the VOC-collection are available in .txt format, Inventory numbers 7527-9540.

Version 3.1: The HTR results from the VOC-collection are also available in PAGE xml format. 

Version 4.0: About 30 missing inventory numbers have been added to the VOC transcriptions. The HTR results of the Notarial Deeds from the NHA archives have been added. An example on full text searchable research can be found here (Dutch): https://kia.pleio.nl/groups/view/55812425/htr-en-ocr/blog/view/55814752/reconstructie-van-een-verijdelde-slavenopstand-met-behulp-van-automatische-handschriftherkenning-en-text-mining

Version 5.0: Around a million pages of HTR results of the following archives have been added.

Files

HTR results 1.05.01.01 PAGE.zip

Files (84.7 GB)

Name Size Download all
md5:646d1b9dc02990542ccf02d5c2dd76a3
887.9 MB Preview Download
md5:7fd6ffdcdfdba8202295ea21e77aef9f
38.2 MB Preview Download
md5:e03ed8dd91a29fe58ebbe71d57887de2
9.6 GB Preview Download
md5:f483930bcf6aa4e7664a567dc4e48bfb
369.2 MB Preview Download
md5:aeb777b6dace9c7d2ebcb0bb30c3c12d
955.7 MB Preview Download
md5:51300d9572875b6bfdf2bd871ba1a0c3
36.5 MB Preview Download
md5:d782fb43c1eb54f6bb2a25ea39766563
4.8 GB Preview Download
md5:565ffe019bd30a75972b23188fdad8ac
180.0 MB Preview Download
md5:4cd2f7bd53301ef7e5570f3fda467b09
2.9 GB Preview Download
md5:3fd4e387e03a5c51159af0b8a1d1b982
109.5 MB Preview Download
md5:512736ba7965b34782b832bd5ebad48f
432.0 MB Preview Download
md5:afbbf32b9e7e0118d24b5cae4260087e
18.4 MB Preview Download
md5:478aaf4738570c01e3b81103c083d8c4
2.7 GB Preview Download
md5:d201d030163070c75cbf3d994eb8edff
111.5 MB Preview Download
md5:634475ad53ea76350e17699f5e7a8c11
1.4 GB Preview Download
md5:f12d5c2f22743563da543262d6441d56
53.6 MB Preview Download
md5:84a01edcc9f2d87a9ac63bcf8673bc7e
1.5 GB Preview Download
md5:0ca623f664b2955772d2ea11109a9ba3
60.5 MB Preview Download
md5:46cef4b6dd3002fdd1a060edc2d013bc
3.3 GB Preview Download
md5:d8ee4b47dd640d6cf9b264b800c914ca
133.4 MB Preview Download
md5:98abdaafec31000590390d66d457be2f
10.2 GB Preview Download
md5:80716cb025b328e0309d6e2b00f1d289
412.9 MB Preview Download
md5:63765847db358429cdf8c12305085eb4
17.9 GB Preview Download
md5:23431e2779e9cadce798adba47ff57ea
710.8 MB Preview Download
md5:8b8d2fa465c8d1dad71d0fd5817f93da
54.5 MB Download
md5:eaf0c2bf0033cb129d5c747a2be039ea
7.2 GB Download
md5:ca68bf64bd40af7593090d1425766e94
144.2 MB Download
md5:84dcb6e819ad897e81ac26150c428e3e
18.5 GB Download