6000 ground truth of VOC and notarial deeds 3.000.000 HTR of VOC, WIC and notarial deeds
Description
The National Archives of the Netherlands and Noord-Hollands Archief conducted a project using the Transkribus HTR (Handwritten Text Recognition) platform. The aim was to semi automatically transcribe 2 million pages of old Dutch texts.
The transcribed archives are 17th and 18th century documents from the Dutch East-India Company (VOC). And 19th century notarial deeds from Noord-Hollands Archief and other archives in the provinces.
In order to train the HTR software a team produced transcriptions of approximately 6000 scans. The scans are randomly selected from the dataset. With the transcriptions a model is trained that can recognize more than 90% of the characters correctly. Transkribus transcribed the 2 million scans automatically using the trained model.
The following Transkribus HTR+ model has been trained for the text recognition: "IJsberg". More information about the model can be found here. See the chapter "Dutch Handwriting". However, the Transkribus team retrained the model with PyLaia technology, which improved the HTR+ model. This PyLaia model is not publicly available.
Later on, 1 million extra scans concerning the West India Company (WIC) were transcribed automatically without adding extra ground truth or training. These archives are from the 17th and 18th century.
The Loghi Handwritten Text Recognition Toolkit has been added to the arsenal of the Nation Archives of the Netherlands. 1.05.11.14, Notarissen Suriname tot 1828 [digitaal duplicaat] has been processed with this tooling.
The datasets published in Zenodo contain the ground truth (scans in JPG, transcription in PAGE XML) and the HTR results (in PAGE XML and TXT). See the overview below. Scroll to the bottom of the page to download the actual files.
For more information on how the Dutch National Archive innovate on digital accessibility click here.
For open data access of scans and inventories of the National Archives click here.
Disclaimer: due to a variety of languages used and the bad state of the documents the HTR results of "1.05.21, Dutch series Guyana" can be of poor quality.
--------------------------------------------------------------
Dataset HTR
(Dataset, name archive, number archive, inventory numbers, link to inventory)
The National Archives of the Netherlands
HTR results VOC, VOC, 1.04.02, 7527-9540, EAD
HTR results 1.04.02, Oost-Indische Testamenten, 1.04.02, 6847-6897, EAD
HTR results 1.05.01.01, Oude WIC, 1.05.01.01, 1-87, EAD
HTR results 1.05.01.02, Tweede WIC, 1.05.01.02, 1-1382, EAD
HTR results 1.05.02, Raad der Koloniën, 1.05.02, 1-192, EAD
HTR results 1.05.03, Sociëteit van Suriname, 1.05.03, 1-566, EAD
HTR results 1.05.05, Sociëteit van Berbice, 1.05.05, 1-445, EAD
HTR results 1.05.06, Verspreide West-Indische stukken, 1.05.06, 1-1413, EAD
HTR results 1.05.21, Dutch series Guyana, 1.05.21, AB.1.1-BB.7.1, EAD
HTR results 2.01.28.01, West-Indisch comité, 2.01.28.01, 1-254, EAD
HTR results 2.01.28.02, Raad der Amerikaanse Bezittingen, 2.01.28.02, 1-264, EAD
HTR results 1.05.11.14, Notarissen Suriname tot 1828 [digitaal duplicaat], EAD
HTR results 2.10.02, Koloniën, EAD (indices only)
Noord-Hollands archief
HTR results NHA Notarial 1617, Oud notarieel archief Haarlem, 1617,1593-1805, EAD
HTR results NHA Notarial 1972, Nieuw notarieel archief Haarlem, 1972, 5-813, EAD
Brabants Historisch Informatie Centrum
HTR results BHIC 7048 , Notarissen in Boxmeer, 1814-1935, 7048, 1-103, 162, EAD
HTR results BHIC 7128 , Notarissen in Grave, 1648-1935, 7128, 140-266, EAD
HTR results BHIC 7637 , Notarissen in Sint-Oedenrode, 1642-1935, 7637, 17-78A, EAD
Gelders Archief
HTR results GA 0168, Notariële Archieven 1811-1925, 168, 64-69, 943-960, 1366-1395, 2472-2501, 3481-3485, 3904-3926, EAD
Groninger Archieven
HTR results GRA 85, Notarissen te Appingedam (standplaats 1), 1811-1935, 85, 2-157, EAD
HTR results GRA 86, Notarissen te Appingedam (standplaats 2), 1812-1922, 86, 2-71, EAD
Historisch Centrum Overijssel
HTR results HCO 0122, Notarissen in Overijssel, 122, 5-48, 2044-2073, 3019-3047, 3733-3775, EAD
The Utrecht Archives
HTR results HUA 34-1, Notarissen in de provincie Utrecht, 1617-1895, 34-1, 928-930, 2209-2330, EAD
Regionaal Historisch Centrum Limburg
HTR results RHCL 09.009, Notarissen in de Arrondissementen Maastricht en Roermond, 1896-1905, 09.009, 9147-9279, EAD
Tresoar
HTR results Tresoar 26, Notarieel archief, 26, 1001-9028 (met hiaten), EAD
Zeeuws Archief
HTR results ZA 13.2, Notariële Archieven Zeeland 1906-1915, (1886) 1906-1915 (1925), 13.2, 1152-1163, 1261-1320, EAD
Drents Archief
HTR results DA 114.10, Notaris jhr.mr. J.A.G.van der Wijck te Assen, 114.10, 4-7, EAD
HTR results DA 114.11, Notaris mr. D.A.M.de Fremery te Assen, 114.11, 1, EAD
HTR results DA 114.18, Notaris mr. Warmolt van Roijen te Borger, 114.18, 2-7, EAD
HTR results DA 114.19, Notaris mr. Ernst Sigismund. Cornets de Groot te Borger, 114.29, 2-8, EAD
HTR results DA 114.22, Notaris mr. Albertus Slingenberg te Coevorden, 114.22, 1-24, EAD
HTR results DA 114.23, Notaris mr. Gozewienus Weys te Coevorden, 114.23, 4-13, EAD
HTR results DA 114.28, Notaris mr. Johannes Beckeringh van Loenen te Dwingeloo, 114.29, 7-13, EAD
HTR results DA 114.39, Notaris mr. Gerrit ten Raa ten Gieten, 114.39, 6-14, EAD
HTR results DA 114.45, Notaris mr. Hendrik Jan Carsten te Hoogeveen, 114.45, 17-36, EAD
HTR results DA 114.54, Notaris mr. Warmold Lunsingh Tonckens te Meppel, 114.54, 18-26, EAD
Dataset Ground Truth
(Name archive, number archive, inventory numbers, link to inventory, type of dataset)
Dataset: Notarial deeds Ground Truths of the trainingset
- Oud notarieel archief Haarlem, 1617, 495 random scans from 1593-1805, EAD, GT Transcriptions
- Nieuw notarieel archief Haarlem, 1972, 952 random scans from 5-813, EAD, GT Transcriptions
- (And 168 transcripties from 7 other archives.)
Dataset: Notarial deeds Images of the trainingset,
- Nieuw notarieel archief Haarlem, 1972, 952 random scans from 5-813, EAD, GT Scans
- Oud notarieel archief Haarlem, 1617, 495 random scans from 1593-1805, EAD, GT Scans
- (And 168 scans from 7 other archives.)
Dataset: VOC Ground Truths of the trainingset,
VOC, 1.04.02, 4735 random scans from 7527-9540, EAD, GT Transcriptions
Dataset: VOC Images of the trainingset,
VOC, 1.04.02, 4735 random scans from 7527-9540, EAD, GT Scans
--------------------------------------------------------------
Version 3.0: The first HTR results from the VOC-collection are available in .txt format, Inventory numbers 7527-9540.
Version 3.1: The HTR results from the VOC-collection are also available in PAGE xml format.
Version 4.0: About 30 missing inventory numbers have been added to the VOC transcriptions. The HTR results of the Notarial Deeds from the NHA archives have been added. An example on full text searchable research can be found here (Dutch): https://kia.pleio.nl/groups/view/55812425/htr-en-ocr/blog/view/55814752/reconstructie-van-een-verijdelde-slavenopstand-met-behulp-van-automatische-handschriftherkenning-en-text-mining
Version 5.0: Around a million pages of HTR results of the following archives have been added.
Version 6.0: The HTR results of Oost-Indische Testamenten have been added.
Version 7.0: The HTR results of the Brabants Historisch Informatie Centrum, Gelders Archief, Groninger Archieven, Historisch Centrum Overijssel, The Utrecht Archives, Regionaal Historisch Centrum Limburg, Tresoar, Zeeuws Archief and Drents Archief have been added.
Version 7.1: A spreadsheet "ijsberg train-val.xlsx" has been added. The division of the training- and validationset of Ground Truth of the IJsberg model can be found here
Version 8.0: HTR results of 1.05.11.14 have been added. The scans have been inferenced with Loghi.
Version 8.1: HTR results of 2.10.02 indices have been added. The scans have been inferenced with Loghi.
Files
HTR results 1.04.02 Oost-Indische Testamenten PAGE.zip
Files
(106.6 GB)
Name | Size | Download all |
---|---|---|
md5:4e8e59402d4d07e95219e9714e49eac7
|
726.8 MB | Preview Download |
md5:75f4cd6e6680ede52e80592e6502c90d
|
42.3 MB | Preview Download |
md5:646d1b9dc02990542ccf02d5c2dd76a3
|
887.9 MB | Preview Download |
md5:7fd6ffdcdfdba8202295ea21e77aef9f
|
38.2 MB | Preview Download |
md5:e03ed8dd91a29fe58ebbe71d57887de2
|
9.6 GB | Preview Download |
md5:f483930bcf6aa4e7664a567dc4e48bfb
|
369.2 MB | Preview Download |
md5:aeb777b6dace9c7d2ebcb0bb30c3c12d
|
955.7 MB | Preview Download |
md5:51300d9572875b6bfdf2bd871ba1a0c3
|
36.5 MB | Preview Download |
md5:d782fb43c1eb54f6bb2a25ea39766563
|
4.8 GB | Preview Download |
md5:565ffe019bd30a75972b23188fdad8ac
|
180.0 MB | Preview Download |
md5:4cd2f7bd53301ef7e5570f3fda467b09
|
2.9 GB | Preview Download |
md5:3fd4e387e03a5c51159af0b8a1d1b982
|
109.5 MB | Preview Download |
md5:512736ba7965b34782b832bd5ebad48f
|
432.0 MB | Preview Download |
md5:afbbf32b9e7e0118d24b5cae4260087e
|
18.4 MB | Preview Download |
md5:478aaf4738570c01e3b81103c083d8c4
|
2.7 GB | Preview Download |
md5:d201d030163070c75cbf3d994eb8edff
|
111.5 MB | Preview Download |
md5:634475ad53ea76350e17699f5e7a8c11
|
1.4 GB | Preview Download |
md5:f12d5c2f22743563da543262d6441d56
|
53.6 MB | Preview Download |
md5:84a01edcc9f2d87a9ac63bcf8673bc7e
|
1.5 GB | Preview Download |
md5:0ca623f664b2955772d2ea11109a9ba3
|
60.5 MB | Preview Download |
md5:f9d7fb65cf1f8196382724a879701bb0
|
743.1 MB | Preview Download |
md5:e3416be039531ce0a30dc29689ef583e
|
31.1 MB | Preview Download |
md5:4514e4a8728f8cbfec53aac06b59c695
|
539.9 MB | Preview Download |
md5:fe780be0f0181b7821e9d0d24453c7ff
|
22.9 MB | Preview Download |
md5:a643cc0a38a6a3b75a32d850d6695761
|
567.5 MB | Preview Download |
md5:3025357ba6f6fd6a3fcbccf5cbb06621
|
23.5 MB | Preview Download |
md5:5b21411a6476a013b71c38f25b5c9770
|
77.7 MB | Preview Download |
md5:00cf23298139c9ca77fd4ac9f86a95f3
|
2.9 MB | Preview Download |
md5:c41e5fb2d5d09f4c0d8acff1b4abc4fc
|
20.5 MB | Preview Download |
md5:67c87a4916f2cce2a59ab2e2cfe195a9
|
752.4 kB | Preview Download |
md5:6eb1aee202dad6e05d65968136114bb9
|
128.3 MB | Preview Download |
md5:a290afd49bed0eb7fb4f3cedac7db1a3
|
5.3 MB | Preview Download |
md5:2eac56dfd21b301b7ed5cf8094c1d269
|
136.3 MB | Preview Download |
md5:62ddee439cde0d0e020fbb1609b3af11
|
5.4 MB | Preview Download |
md5:fc2fe1e53054f61751e84bc5fc79666c
|
392.8 MB | Preview Download |
md5:5b721e240e5db5304f7e864b8b6b3042
|
10.9 MB | Preview Download |
md5:0c27cc15981f5e8f53ab356070658f2b
|
252.9 MB | Preview Download |
md5:e6562f086f41c9ce26489bf27b24b8f1
|
10.0 MB | Preview Download |
md5:5b5eaf61e86e811ec02e35a2d09dc6b5
|
136.6 MB | Preview Download |
md5:cad4c0b87755ed215e830adf4290a105
|
5.4 MB | Preview Download |
md5:b30060a743711f6a258ab29086ad1062
|
137.2 MB | Preview Download |
md5:00d4eee34abcfc3d2ee2438adda5aafc
|
5.1 MB | Preview Download |
md5:b38fbfade0ed9af41e7272d1c9be4f34
|
272.3 MB | Preview Download |
md5:38e199b46dbf9327d1c0dcff2f8bba8a
|
10.8 MB | Preview Download |
md5:afd2cce342762c7a35c493292c3bfad2
|
129.5 MB | Preview Download |
md5:1f94615a8c6d75cc3342e6a3cc6e50d7
|
4.9 MB | Preview Download |
md5:e715acff3390a74391543a9d7d0108f3
|
1.8 GB | Preview Download |
md5:fe929d60ff49df4011e7cc856933128e
|
75.1 MB | Preview Download |
md5:6f796d02fd3d1694220dd4e460228ad1
|
1.4 GB | Preview Download |
md5:15d2e4bae23d15d227507c85d9194f98
|
62.6 MB | Preview Download |
md5:68accb61b6909d096e3baed0e778cde1
|
513.4 MB | Preview Download |
md5:cb957861f965a65965bf28bc27253f8a
|
23.2 MB | Preview Download |
md5:b221cc0fce3ea748db479312fb0528ab
|
1.1 GB | Preview Download |
md5:1ab3f8321880dd28d242c00aae77a134
|
44.8 MB | Preview Download |
md5:447b77e577cb60cbf0eb9845e49b7ba2
|
1.0 GB | Preview Download |
md5:0fea68060db3d04532338dcdf7424d47
|
44.1 MB | Preview Download |
md5:46cef4b6dd3002fdd1a060edc2d013bc
|
3.3 GB | Preview Download |
md5:d8ee4b47dd640d6cf9b264b800c914ca
|
133.4 MB | Preview Download |
md5:98abdaafec31000590390d66d457be2f
|
10.2 GB | Preview Download |
md5:80716cb025b328e0309d6e2b00f1d289
|
412.9 MB | Preview Download |
md5:1f005a8cb03d0a1599fec75c88e496da
|
1.9 GB | Preview Download |
md5:3d539aeb447e012a251c3cb76ba8873a
|
87.2 MB | Preview Download |
md5:f570c9cfb161f5d53bfe8ac8397d6194
|
1.9 GB | Preview Download |
md5:fd236c191a5808d27e5e9cb17a919889
|
82.9 MB | Preview Download |
md5:63765847db358429cdf8c12305085eb4
|
17.9 GB | Preview Download |
md5:23431e2779e9cadce798adba47ff57ea
|
710.8 MB | Preview Download |
md5:58b243f61a4cacba25801bd816fc2e83
|
1.3 GB | Preview Download |
md5:a2ea8e8e3421e9b43a6401a8c19749fa
|
51.7 MB | Preview Download |
md5:970565e5f3d6cb97be7f6e55b61c2502
|
3.6 GB | Preview Download |
md5:55a3f63583ef293ee2b654e317ec3e4e
|
2.6 GB | Preview Download |
md5:bcd7b9195230729d37097cd480ab27e7
|
96.0 kB | Download |
md5:8b8d2fa465c8d1dad71d0fd5817f93da
|
54.5 MB | Download |
md5:eaf0c2bf0033cb129d5c747a2be039ea
|
7.2 GB | Download |
md5:ca68bf64bd40af7593090d1425766e94
|
144.2 MB | Download |
md5:84dcb6e819ad897e81ac26150c428e3e
|
18.5 GB | Download |