Dataset Open Access

Scans and transcriptions of the VOC and the Haarlem notarial deeds archives

Liesbeth Keijser

The National Archives of the Netherlands and the Noord-Hollands Archief started a collaboration with the Transkribus HTR (Handwritten Text Recognition) platform in order to semi automatically transcribe 2 million pages of old Dutch texts. The archives are 17th and 18th century material from the Dutch East-Asia Company (VOC) and 19th century notarial deeds from the city of Haarlem.
In order to train the HTR software, human made transciptions had to be made. 

These datasets contain the scans (.jpg images) with the transcriptions in ALTO xml format (word level) that have been made in order to train the HTR-model for text recognition.

The first set contains scans and transcriptions from the Verenigde Oost-Indische Compagnie (VOC) archive, it's inventory can be found here: http://www.gahetna.nl/archievenoverzicht/pdf/NL-HaNA_1.04.02.ead.pdf

Inventory numbers
The transcipts are samples of the following inventory numbers: 7527-9540

Country/place
Dutch Indies (modern day Indonesia) / Batavia (modern day Jakarta)

Language
Dutch

Number of transcriptions
4735 (mostly split)

-------------------------------------------------------------

The second set contains scans and transcriptions from the Notarial deeds of Haarlem, it's inventories can be found here:
https://noord-hollandsarchief.nl/bronnen/archieven?mivast=236&mizig=210&miadt=236&micode=1972&milang=nl&miview=inv2
https://noord-hollandsarchief.nl/bronnen/archieven?mivast=236&mizig=210&miadt=236&micode=1617&milang=nl&miview=inv2

This set also contains scans and transcriptions from other notarial archives, from Dutch provinces. They are however few in number.

Inventory numbers
The transcipts are samples of the following inventory numbers: 1617_1593 until 1617_1805 and 1972_5 until 1972_813

Country/place
The Netherlands / Haarlem

Language
Dutch and sometimes French

Number of transcriptions
1615 (mostly spread)

-------------------------------------------------------------

The following HTR model was used for recognition: "IJsberg". More information about the model van be found here: https://transkribus.eu/wiki/images/d/d6/Public_Models_in_Transkribus.pdf. See the chapter "Dutch Handwriting".

-------------------------------------------------------------

Update: upon request, PageXML files of the transcriptions have been added and are seperately downloadable.

Version 3.0: The first HTR results from the VOC-collection are available in .txt format, Inventory numbers 7527-9540.

Version 3.1: The HTR results from the VOC-collection are also available in PAGE xml format. 

Version 4.0: About 30 missing inventory numbers have been added to the VOC transcriptions. The HTR results of the Notarial Deeds from the NHA archives have been added. An example on full text searchable research can be found here (Dutch): https://kia.pleio.nl/groups/view/55812425/htr-en-ocr/blog/view/55814752/reconstructie-van-een-verijdelde-slavenopstand-met-behulp-van-automatische-handschriftherkenning-en-text-mining

Files (58.5 GB)
Name Size
HTR results NHA Notarial 1617 PAGE.zip
md5:46cef4b6dd3002fdd1a060edc2d013bc
3.3 GB Download
HTR results NHA Notarial 1617 TXT.zip
md5:d8ee4b47dd640d6cf9b264b800c914ca
133.4 MB Download
HTR results NHA Notarial 1972 PAGE.zip
md5:98abdaafec31000590390d66d457be2f
10.2 GB Download
HTR results NHA Notarial 1972 TXT.zip
md5:80716cb025b328e0309d6e2b00f1d289
412.9 MB Download
HTR results VOC PAGE.zip
md5:63765847db358429cdf8c12305085eb4
17.9 GB Download
HTR results VOC TXT.zip
md5:23431e2779e9cadce798adba47ff57ea
710.8 MB Download
Notarial deeds Ground Truths of the trainingset in PAGE xml.7z
md5:8b8d2fa465c8d1dad71d0fd5817f93da
54.5 MB Download
Notarial deeds Images of the training set.7z
md5:eaf0c2bf0033cb129d5c747a2be039ea
7.2 GB Download
VOC Ground truths of the trainingset in PAGE xml.7z
md5:ca68bf64bd40af7593090d1425766e94
144.2 MB Download
VOC Images of the trainingset.7z
md5:84dcb6e819ad897e81ac26150c428e3e
18.5 GB Download
2,269
582
views
downloads
All versions This version
Views 2,269627
Downloads 582189
Data volume 4.9 TB1.2 TB
Unique views 1,868537
Unique downloads 30886

Share

Cite as