Dataset Open Access

6000 ground truth of VOC and notarial deeds 3.000.000 HTR of VOC, WIC and notarial deeds

Liesbeth Keijser

The National Archives of the Netherlands and Noord-Hollands Archief conducted a project using the Transkribus HTR (Handwritten Text Recognition) platform. The aim was to semi automatically transcribe 2 million pages of old Dutch texts.

The transcribed archives are 17th and 18th century documents from the Dutch East-Asia Company (VOC). And 19th century notarial deeds from Noord-Hollands Archief and other archives in the provinces.

In order to train the HTR software a team produced transcriptions of approximately 6000 scans. The scans are randomly selected from the dataset. With the transcriptions a model is trained that can recognize more than 90% of the characters correctly. Transkribus transcribed the 2 million scans automatically using the trained model.

The following Transkribus HTR+ model has been trained for the text recognition: "IJsberg". More information about the model can be found here. See the chapter "Dutch Handwriting". However, the Transkribus team retrained the model with PyLaia technology, which improved the HTR+ model. This PyLaia model is not publicly available.

Later on, 1 million extra scans concerning the West India Company (WIC) were transcribed automatically without adding extra ground truth or training. These archives are from the 17th and 18th century.

The datasets published in Zenodo contain the ground truth (scans in JPG, transcription in PAGE XML) and the HTR results (in PAGE XML and TXT). See the overview below. Scroll to the bottom of the page to download the actual files.

For more information on how the Dutch National Archive innovate on digital accessibility click here.

For open data access of scans and inventories of the National Archives click here.

Disclaimer: due to a variety of languages used and the bad state of the documents the HTR results of "1.05.21, Dutch series Guyana" can be of poor quality.

--------------------------------------------------------------

Dataset HTR
(Dataset, name archive, number archive, inventory numbers, link to inventory)

The National Archives of the Netherlands
HTR results VOC, VOC, 1.04.02, 7527-9540, EAD
HTR results 1.04.02, Oost-Indische Testamenten, 1.04.02, 6847-6897, EAD 
HTR results 1.05.01.01, Oude WIC, 1.05.01.01, 1-87,  EAD
HTR results 1.05.01.02, Tweede WIC, 1.05.01.02, 1-1382, EAD     
HTR results 1.05.02, Raad der Koloniën, 1.05.02, 1-192, EAD
HTR results 1.05.03, Sociëteit van Suriname, 1.05.03, 1-566, EAD
HTR results 1.05.05, Sociëteit van Berbice, 1.05.05, 1-445,    EAD
HTR results 1.05.06, Verspreide West-Indische stukken, 1.05.06, 1-1413, EAD
HTR results 1.05.21, Dutch series Guyana, 1.05.21, AB.1.1-BB.7.1, EAD
HTR results 2.01.28.01, West-Indisch comité, 2.01.28.01, 1-254, EAD
HTR results 2.01.28.02, Raad der Amerikaanse Bezittingen, 2.01.28.02, 1-264, EAD

Noord-Hollands archief
HTR results NHA Notarial 1617, Oud notarieel archief Haarlem, 1617,1593-1805, EAD
HTR results NHA Notarial 1972, Nieuw notarieel archief Haarlem, 1972, 5-813, EAD

Brabants Historisch Informatie Centrum
HTR results BHIC 7048 , Notarissen in Boxmeer, 1814-1935, 7048, 1-103, 162, EAD
HTR results BHIC 7128 , Notarissen in Grave, 1648-1935, 7128, 140-266, EAD
HTR results BHIC 7637 , Notarissen in Sint-Oedenrode, 1642-1935, 7637, 17-78A, EAD

Gelders Archief
HTR results GA 0168, Notariële Archieven 1811-1925, 168, 64-69, 943-960, 1366-1395, 2472-2501, 3481-3485, 3904-3926, EAD

Groninger Archieven
HTR results GRA 85, Notarissen te Appingedam (standplaats 1), 1811-1935, 85, 2-157, EAD
HTR results GRA 86, Notarissen te Appingedam (standplaats 2), 1812-1922, 86, 2-71, EAD

Historisch Centrum Overijssel
HTR results HCO 0122, Notarissen in Overijssel, 122, 5-48, 2044-2073, 3019-3047, 3733-3775, EAD

The Utrecht Archives
HTR results HUA 34-1, Notarissen in de provincie Utrecht, 1617-1895, 34-1, 928-930, 2209-2330, EAD

Regionaal Historisch Centrum Limburg
HTR results RHCL 09.009, Notarissen in de Arrondissementen Maastricht en Roermond, 1896-1905, 09.009, 9147-9279, EAD

Tresoar
HTR results Tresoar 26, Notarieel archief, 26, 1001-9028 (met hiaten), EAD

Zeeuws Archief
HTR results ZA 13.2, Notariële Archieven Zeeland 1906-1915, (1886) 1906-1915 (1925), 13.2, 1152-1163, 1261-1320, EAD

Drents Archief
HTR results DA 114.10, Notaris jhr.mr. J.A.G.van der Wijck te Assen, 114.10, 4-7, EAD
HTR results DA 114.11, Notaris mr. D.A.M.de Fremery te Assen, 114.11, 1, EAD
HTR results DA 114.18, Notaris mr. Warmolt van Roijen te Borger, 114.18, 2-7, EAD
HTR results DA 114.19, Notaris mr. Ernst Sigismund. Cornets de Groot te Borger, 114.29, 2-8, EAD
HTR results DA 114.22, Notaris mr. Albertus Slingenberg te Coevorden, 114.22, 1-24, EAD
HTR results DA 114.23, Notaris mr. Gozewienus Weys te Coevorden, 114.23, 4-13, EAD
HTR results DA 114.28, Notaris mr. Johannes Beckeringh van Loenen te Dwingeloo, 114.29, 7-13, EAD
HTR results DA 114.39, Notaris mr. Gerrit ten Raa ten Gieten, 114.39, 6-14, EAD
HTR results DA 114.45, Notaris mr. Hendrik Jan Carsten te Hoogeveen, 114.45, 17-36, EAD
HTR results DA 114.54, Notaris mr. Warmold Lunsingh Tonckens te Meppel, 114.54, 18-26, EAD

 

Dataset Ground Truth
(Name archive, number archive, inventory numbers, link to inventory, type of dataset)

Dataset: Notarial deeds Ground Truths of the trainingset

  • Oud notarieel archief Haarlem, 1617, 495 random scans from 1593-1805, EAD, GT Transcriptions
  • Nieuw notarieel archief Haarlem, 1972, 952 random scans from 5-813, EAD, GT Transcriptions
  • (And 168 transcripties from 7 other archives.)

Dataset: Notarial deeds Images of the trainingset,

  • Nieuw notarieel archief Haarlem, 1972, 952 random scans from 5-813, EAD, GT Scans
  • Oud notarieel archief Haarlem, 1617, 495 random scans from 1593-1805, EAD, GT Scans
  • (And 168 scans from 7 other archives.)


Dataset: VOC Ground Truths of the trainingset,
VOC, 1.04.02, 4735 random  scans from 7527-9540, EAD, GT Transcriptions


Dataset: VOC Images of the trainingset,
VOC, 1.04.02, 4735 random  scans from 7527-9540, EAD, GT Scans

--------------------------------------------------------------

Version 3.0: The first HTR results from the VOC-collection are available in .txt format, Inventory numbers 7527-9540.

Version 3.1: The HTR results from the VOC-collection are also available in PAGE xml format. 

Version 4.0: About 30 missing inventory numbers have been added to the VOC transcriptions. The HTR results of the Notarial Deeds from the NHA archives have been added. An example on full text searchable research can be found here (Dutch): https://kia.pleio.nl/groups/view/55812425/htr-en-ocr/blog/view/55814752/reconstructie-van-een-verijdelde-slavenopstand-met-behulp-van-automatische-handschriftherkenning-en-text-mining

Version 5.0: Around a million pages of HTR results of the following archives have been added.

Version 6.0: The HTR results of Oost-Indische Testamenten have been added. 

Version 7.0: The HTR results of the Brabants Historisch Informatie Centrum, Gelders Archief, Groninger Archieven, Historisch Centrum Overijssel, The Utrecht Archives, Regionaal Historisch Centrum Limburg, Tresoar, Zeeuws Archief and Drents Archief have been added.

Files (100.4 GB)
Name Size
HTR results 1.04.02 Oost-Indische Testamenten PAGE.zip
md5:4e8e59402d4d07e95219e9714e49eac7
726.8 MB Download
HTR results 1.04.02 Oost-Indische Testamenten TXT.zip
md5:75f4cd6e6680ede52e80592e6502c90d
42.3 MB Download
HTR results 1.05.01.01 PAGE.zip
md5:646d1b9dc02990542ccf02d5c2dd76a3
887.9 MB Download
HTR results 1.05.01.01 TXT.zip
md5:7fd6ffdcdfdba8202295ea21e77aef9f
38.2 MB Download
HTR results 1.05.01.02 PAGE.zip
md5:e03ed8dd91a29fe58ebbe71d57887de2
9.6 GB Download
HTR results 1.05.01.02 TXT.zip
md5:f483930bcf6aa4e7664a567dc4e48bfb
369.2 MB Download
HTR results 1.05.02 PAGE.zip
md5:aeb777b6dace9c7d2ebcb0bb30c3c12d
955.7 MB Download
HTR results 1.05.02 TXT.zip
md5:51300d9572875b6bfdf2bd871ba1a0c3
36.5 MB Download
HTR results 1.05.03 PAGE.zip
md5:d782fb43c1eb54f6bb2a25ea39766563
4.8 GB Download
HTR results 1.05.03 TXT.zip
md5:565ffe019bd30a75972b23188fdad8ac
180.0 MB Download
HTR results 1.05.05 PAGE.zip
md5:4cd2f7bd53301ef7e5570f3fda467b09
2.9 GB Download
HTR results 1.05.05 TXT.zip
md5:3fd4e387e03a5c51159af0b8a1d1b982
109.5 MB Download
HTR results 1.05.06 PAGE.zip
md5:512736ba7965b34782b832bd5ebad48f
432.0 MB Download
HTR results 1.05.06 TXT.zip
md5:afbbf32b9e7e0118d24b5cae4260087e
18.4 MB Download
HTR results 1.05.21 PAGE.zip
md5:478aaf4738570c01e3b81103c083d8c4
2.7 GB Download
HTR results 1.05.21 TXT.zip
md5:d201d030163070c75cbf3d994eb8edff
111.5 MB Download
HTR results 2.01.28.01 PAGE.zip
md5:634475ad53ea76350e17699f5e7a8c11
1.4 GB Download
HTR results 2.01.28.01 TXT.zip
md5:f12d5c2f22743563da543262d6441d56
53.6 MB Download
HTR results 2.01.28.02 PAGE.zip
md5:84a01edcc9f2d87a9ac63bcf8673bc7e
1.5 GB Download
HTR results 2.01.28.02 TXT.zip
md5:0ca623f664b2955772d2ea11109a9ba3
60.5 MB Download
HTR results BHIC 7048 PAGE.zip
md5:f9d7fb65cf1f8196382724a879701bb0
743.1 MB Download
HTR results BHIC 7048 TXT.zip
md5:e3416be039531ce0a30dc29689ef583e
31.1 MB Download
HTR results BHIC 7128 PAGE.zip
md5:4514e4a8728f8cbfec53aac06b59c695
539.9 MB Download
HTR results BHIC 7128 TXT.zip
md5:fe780be0f0181b7821e9d0d24453c7ff
22.9 MB Download
HTR results BHIC 7637 PAGE.zip
md5:a643cc0a38a6a3b75a32d850d6695761
567.5 MB Download
HTR results BHIC 7637 TXT.zip
md5:3025357ba6f6fd6a3fcbccf5cbb06621
23.5 MB Download
HTR results DA 0114.10 PAGE.zip
md5:5b21411a6476a013b71c38f25b5c9770
77.7 MB Download
HTR results DA 0114.10 TXT.zip
md5:00cf23298139c9ca77fd4ac9f86a95f3
2.9 MB Download
HTR results DA 0114.11 PAGE.zip
md5:c41e5fb2d5d09f4c0d8acff1b4abc4fc
20.5 MB Download
HTR results DA 0114.11 TXT.zip
md5:67c87a4916f2cce2a59ab2e2cfe195a9
752.4 kB Download
HTR results DA 0114.18 PAGE.zip
md5:6eb1aee202dad6e05d65968136114bb9
128.3 MB Download
HTR results DA 0114.18 TXT.zip
md5:a290afd49bed0eb7fb4f3cedac7db1a3
5.3 MB Download
HTR results DA 0114.19 PAGE.zip
md5:2eac56dfd21b301b7ed5cf8094c1d269
136.3 MB Download
HTR results DA 0114.19 TXT.zip
md5:62ddee439cde0d0e020fbb1609b3af11
5.4 MB Download
HTR results DA 0114.22 PAGE.zip
md5:fc2fe1e53054f61751e84bc5fc79666c
392.8 MB Download
HTR results DA 0114.22 TXT.zip
md5:5b721e240e5db5304f7e864b8b6b3042
10.9 MB Download
HTR results DA 0114.23 PAGE.zip
md5:0c27cc15981f5e8f53ab356070658f2b
252.9 MB Download
HTR results DA 0114.23 TXT.zip
md5:e6562f086f41c9ce26489bf27b24b8f1
10.0 MB Download
HTR results DA 0114.28 PAGE.zip
md5:5b5eaf61e86e811ec02e35a2d09dc6b5
136.6 MB Download
HTR results DA 0114.28 TXT.zip
md5:cad4c0b87755ed215e830adf4290a105
5.4 MB Download
HTR results DA 0114.39 PAGE.zip
md5:b30060a743711f6a258ab29086ad1062
137.2 MB Download
HTR results DA 0114.39 TXT.zip
md5:00d4eee34abcfc3d2ee2438adda5aafc
5.1 MB Download
HTR results DA 0114.45 PAGE.zip
md5:b38fbfade0ed9af41e7272d1c9be4f34
272.3 MB Download
HTR results DA 0114.45 TXT.zip
md5:38e199b46dbf9327d1c0dcff2f8bba8a
10.8 MB Download
HTR results DA 0114.54 PAGE.zip
md5:afd2cce342762c7a35c493292c3bfad2
129.5 MB Download
HTR results DA 0114.54 TXT.zip
md5:1f94615a8c6d75cc3342e6a3cc6e50d7
4.9 MB Download
HTR results GA 0168 PAGE.zip
md5:e715acff3390a74391543a9d7d0108f3
1.8 GB Download
HTR results GA 0168 TXT.zip
md5:fe929d60ff49df4011e7cc856933128e
75.1 MB Download
HTR results GRA 85 PAGE.zip
md5:6f796d02fd3d1694220dd4e460228ad1
1.4 GB Download
HTR results GRA 85 TXT.zip
md5:15d2e4bae23d15d227507c85d9194f98
62.6 MB Download
HTR results GRA 86 PAGE.zip
md5:68accb61b6909d096e3baed0e778cde1
513.4 MB Download
HTR results GRA 86 TXT.zip
md5:cb957861f965a65965bf28bc27253f8a
23.2 MB Download
HTR results HCO 0122 PAGE.zip
md5:b221cc0fce3ea748db479312fb0528ab
1.1 GB Download
HTR results HCO 0122 TXT.zip
md5:1ab3f8321880dd28d242c00aae77a134
44.8 MB Download
HTR results HUA 34-1 PAGE.zip
md5:447b77e577cb60cbf0eb9845e49b7ba2
1.0 GB Download
HTR results HUA 34-1 TXT.zip
md5:0fea68060db3d04532338dcdf7424d47
44.1 MB Download
HTR results NHA Notarial 1617 PAGE.zip
md5:46cef4b6dd3002fdd1a060edc2d013bc
3.3 GB Download
HTR results NHA Notarial 1617 TXT.zip
md5:d8ee4b47dd640d6cf9b264b800c914ca
133.4 MB Download
HTR results NHA Notarial 1972 PAGE.zip
md5:98abdaafec31000590390d66d457be2f
10.2 GB Download
HTR results NHA Notarial 1972 TXT.zip
md5:80716cb025b328e0309d6e2b00f1d289
412.9 MB Download
HTR results RHCL 09.009 PAGE.zip
md5:1f005a8cb03d0a1599fec75c88e496da
1.9 GB Download
HTR results RHCL 09.009 TXT.zip
md5:3d539aeb447e012a251c3cb76ba8873a
87.2 MB Download
HTR results Tresoar 26 PAGE.zip
md5:f570c9cfb161f5d53bfe8ac8397d6194
1.9 GB Download
HTR results Tresoar 26 TXT.zip
md5:fd236c191a5808d27e5e9cb17a919889
82.9 MB Download
HTR results VOC PAGE.zip
md5:63765847db358429cdf8c12305085eb4
17.9 GB Download
HTR results VOC TXT.zip
md5:23431e2779e9cadce798adba47ff57ea
710.8 MB Download
HTR results ZA 13.2 PAGE.zip
md5:58b243f61a4cacba25801bd816fc2e83
1.3 GB Download
HTR results ZA 13.2 TXT.zip
md5:a2ea8e8e3421e9b43a6401a8c19749fa
51.7 MB Download
Notarial deeds Ground Truths of the trainingset in PAGE xml.7z
md5:8b8d2fa465c8d1dad71d0fd5817f93da
54.5 MB Download
Notarial deeds Images of the training set.7z
md5:eaf0c2bf0033cb129d5c747a2be039ea
7.2 GB Download
VOC Ground truths of the trainingset in PAGE xml.7z
md5:ca68bf64bd40af7593090d1425766e94
144.2 MB Download
VOC Images of the trainingset.7z
md5:84dcb6e819ad897e81ac26150c428e3e
18.5 GB Download
8,350
2,114
views
downloads
All versions This version
Views 8,3501,757
Downloads 2,114189
Data volume 9.8 TB218.0 GB
Unique views 6,7721,456
Unique downloads 1,054119

Share

Cite as