DARE Database
Authors/Creators
Description
The DARE Database is a set of handwritten character dates derived from different historical sources from Sweden and Denmark. Additional details are available on our GitHub and on arXiv.
There are seven splits provided in this dataset representing the different data sources. Each folder contains the respective minipics and their labels split into test and training files. The number of files and tokens are:
Train images: 2,876,752
Test images: 152,414
Total number of images: 3,029,166
Total number of tokens: 9,682,027
Which is further explained in the following table:
| Datasets | Sequence | Training Observations | Test Observations |
|---|---|---|---|
| Death Certificates (1) | DD-MM-YYYY | 11,627 | 1,000 |
| Death Certificates (2) | DD-MM-YYYY | 155,439 | 8,338 |
| Police Records (1) | DD-MM-YY | 1,006,199 | 53,488 |
| Police Records (2) | DD-MM-YY | 326,478 | 17,103 |
| Swedish Records Birth Dates | DD-MM-YY | 597,756 | 31,389 |
| Swedish Records Death Dates | DD-MM | 547,813 | 28,803 |
| Funeral Records | DD-MM | 231,440 | 12,293 |
Note that for data restriction reasons, the CIHVR images are excluded (as we do not have permission to publicly share those).
The only exception to our images consisting purely of digits arise from the month in the date sequences which sometimes is written with alphabetic characters, e.g., "February" or "Feb". The original images are acquired from Copenhagen Archives, the National Archives of Denmark, and Lund University. The minipics are created using Coherent Point Drift to extract the regions of interest from the source documents.
One comment about the Swedish cause of death records is that a lot of these are labelled as either empty or partly empty. Partly empty, e.g., ' 29-" ' represents that the cell with respect to the month is in fact not empty but rather that the month is the same as above. It is quite common in many historical tabulated records that they use a special mark for notating the same as above. The other cells labelled as ' ,-,-, ' for birth dates or ' ,-, ' for death dates are completely empty cells and could be excluded for pure digit recognition models. However, for transcribing historical records, empty cells are frequently represented and should be taken into account one way or another.
Note: If you want to download a small sample to see how the DARE Database is structured, visit our DARE sample Zenodo page.
Files
DARE.zip
Files
(39.2 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:976408a621d716f08cb54b9624953ef2
|
39.2 GB | Preview Download |