Published July 6, 2021 | Version 1.0.0
Dataset Open

Noisy OCR Dataset (NOD)

  • 1. Norwegian Defence Research Establishment

Description

This dataset contains 18,504 images of English and Arabic documents with ground truth for use in OCR benchmarking. It consists of two collections, "Old Books" (English) and "Yarmouk" (Arabic), each of which contains an image set reproduced in 44 versions with different types and degrees of artificially generated noise. The dataset was originally developed for Hegghammer (2021).

Source images

The seed of the English collection was the "Old Books Dataset" (Barcha 2017), a set of 322 page scans from English-language books printed between 1853 and 1920. The seed of the Arabic collection was a randomly selected subset of 100 pages from the "Yarmouk Arabic OCR Dataset" (Abu Doush et al. 2018), which consists of 4,587 Arabic Wikipedia articles printed to paper and scanned to PDF.

Artificial noise application

The dataset was created as follows:
- First a greyscale version of each image was created, so that there were two versions (colour and greyscale) with no added noise. 
- Then six ideal types of image noise --- "blur", "weak ink", "salt and pepper", "watermark", "scribbles", and "ink stains" --- were applied both to the colour version and the binary version of the images, thus creating 12 additional versions of each image. The R code used to generate the noise is included in the repository.
- Lastly, all available combinations of *two* noise filters were applied to the colour and binary images, for an additional 30 versions. 

This yielded a total of 44 image versions divided into three categories of noise intensity: 2 versions with no added noise, 12 versions with one layer of noise, and 30 versions with two layers of noise. This amounted to an English corpus of 14,168 documents and an Arabic corpus of 4,400 documents. 

The compressed archive is ~26 GiB, and the uncompressed version is ~193 GiB. See this link for how to unzip .tar.lzma files. 

References:

Barcha, Pedro. 2017. “Old Books Dataset.” GitHub Repository. GitHub. https:
//github.com/PedroBarcha/old-books-dataset.

Doush, Iyad Abu, Faisal AlKhateeb, and Anwaar Hamdi Gharibeh. 2018. “Yarmouk
Arabic OCR Dataset.” In 2018 8th International Conference on Computer Science
and Information Technology (CSIT)
, 150–54. IEEE.

Hegghammer, Thomas. 2021. "OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment". Socarxiv. https://osf.io/preprints/socarxiv/6zfvs

Files

Files (27.6 GB)

Name Size Download all
md5:f7f37148ccd1fc07f39a7d7a88f1f2a0
247.1 kB Download
md5:3d3b2974d6d46ac5bc0f6cf914dd5d27
3.7 kB Download
md5:806d1726e06c786e25b832bc03955c29
259.4 MB Download
md5:aa3d251405f15e0ba3282229e8d4058e
11.6 MB Download
md5:4f73ec4c03eae437509a5cc32e18b2d1
479.4 MB Download
md5:301071e4c362d5c1eae856b84b07f6aa
217.4 MB Download
md5:f755e0ca3a93191e13ca68d42e6ccee7
1.4 GB Download
md5:96ff56724cd82e57ddd7e9e05f3dc243
343.5 MB Download
md5:370836e2d3423685c617b8be34777ba4
346.7 MB Download
md5:964cf1c780594da02c2acbbecd03d403
249.0 MB Download
md5:9a690e961bae39dae3d6326e61e389bb
6.4 MB Download
md5:dac24c4c73055774a0d678700afa0938
11.6 MB Download
md5:3536436049bd0fc01ed6fe639e827583
44.7 MB Download
md5:6de1d6a9872204b88a0241487b589862
68.8 MB Download
md5:b9e043b605c52550fbac8995e4632e5d
67.2 MB Download
md5:a459c091f4f9e40a984f4402f3914afb
11.2 MB Download
md5:3c0aac1793b9c86f3d76672584b71faa
434.2 MB Download
md5:324dfa565cb658c50590bcb7845290e5
1.4 GB Download
md5:ddb15874f863b80523b602d72bf4f675
449.6 MB Download
md5:79d96c03990e9b48de82329498bff9fe
455.0 MB Download
md5:bea35c17210bfb058640143e58973690
452.0 MB Download
md5:e13366f12fc3e163fe7f4c4c931e8a02
1.4 GB Download
md5:850b2c6f8edc98969e5d7dc97b543089
324.2 MB Download
md5:05ad23d6dae11acb09dfa703bbfd10e5
329.1 MB Download
md5:9dae0d443bf9b995dd57e727b2ddbeb1
212.0 MB Download
md5:79025855f8ddccbdbe275b0fdf975793
2.4 GB Download
md5:ac4e85631871f2ec8507bb549f7834ed
2.4 GB Download
md5:6b5f122f8cd8b4f788401f0308517617
1.3 GB Download
md5:a81f2929628f9850f462a91f72864300
291.2 MB Download
md5:c5dc894894fb08366564abd804f35c05
316.3 MB Download
md5:5931309198ea855b6f3faf8a77e57872
319.6 MB Download
md5:ec705632c64a49b7e3fc3be3d32ae805
6.0 MB Download
md5:d489845d0bfece5b018d36228c5955c8
39.3 MB Download
md5:e2cfadce3701dcaa1a53f18dc57d4133
34.9 MB Download
md5:7a9655f6b5c0bf334a3c87989715f883
37.8 MB Download
md5:cb88325700205b9eaa66d175ac1e60bb
6.6 MB Download
md5:fdb5b5581ee5f798e3b5ff4c2eabba16
44.6 MB Download
md5:53a366c22b23dd3514159349eba4d4e6
62.9 MB Download
md5:0a308203ad97a6546f21bb3a3c48a32b
65.7 MB Download
md5:899e5e1824f28ed9ea8985cc397ce81b
11.2 MB Download
md5:e3d96f513084365beecd2a30bb161268
237.5 MB Download
md5:eaaae7778b1bc4367a5124cbc331bfee
42.3 MB Download
md5:9ce98db2d1070d9fda38bd77f9e2a969
79.4 MB Download
md5:5c907edac69a8f62eb18eae1fcf27065
56.0 MB Download
md5:d9aa8cbc3fc784814df0f809f9bce7be
55.7 MB Download
md5:a845e80b66fea3507ad39f3c07684ff6
41.9 MB Download
md5:30d430227662a5ceb48647e0b295604c
38.6 MB Download
md5:4a8d63c46903ba2654a881a0b1168a71
103.7 MB Download
md5:f6dff9f8de93f42a1edbe97f3475a0cd
13.8 MB Download
md5:eea2540989609bff406ddeaa136c28d3
1.0 GB Download
md5:b74d14e04a2af08de8a8a6c1defed037
36.1 MB Download
md5:d023b81c89df506c73e0f117979b2d8b
36.7 MB Download
md5:8f4835f294846a0251d8d6628bb29fa8
43.3 MB Download
md5:f96dcfdf633dcc76d8839ac72853baba
88.9 MB Download
md5:0624c028dd075c3f6762be51af068b91
10.0 MB Download
md5:80f687226675dae7ab046db49b2b9a62
889.4 MB Download
md5:19e6a213928f211e7c75959075322db4
27.4 MB Download
md5:7773a09eb00b3b65367fd0068c06d86e
27.5 MB Download
md5:370c26ff10030ea60c444cf9c85c50fd
31.0 MB Download
md5:154ff78e6fa1057128a511e59d625f78
113.4 MB Download
md5:5bbc14a2605b71335e6519b4cf7e80ba
1.0 GB Download
md5:979b25d6b6ea8129e3df660a76364b0c
53.9 MB Download
md5:553425003b568ed75e0ffa76c5a11a23
54.0 MB Download
md5:dbd79028dc7fce24496eac1102bd6105
101.5 MB Download
md5:7ea4c10d9c518153d57751d808174990
1.0 GB Download
md5:f56415dbc9b2507a3352cccf6dac6bb0
20.1 MB Download
md5:269323e8880db1812f13abfff4ef7f82
20.9 MB Download
md5:82eda38a125c259c9e905a812ed2b064
17.0 MB Download
md5:d438f318aa5769ee7609268bc2f770a2
791.2 MB Download
md5:6e26a472fba78b7cbf00ec9a413dedf2
788.9 MB Download
md5:20681967a5a53f4c89ae3fce169aaea7
1.0 GB Download
md5:740e783796e4b6dec40f0bbbdcf8ac16
36.2 MB Download
md5:bc748776bd34cb001f3c7b1ffdbf4f2b
34.8 MB Download
md5:6dc6cff1f236a6a18b15baf8d52537ff
35.7 MB Download
md5:1c9a4ceae50540f0e5a10961e8255457
94.6 MB Download
md5:13f9179c583c40f75ca3e0db9a52c0c8
311.3 MB Download
md5:b46d11aa3e7822ca50a351954c2bf026
47.9 MB Download
md5:4f542220e39511b9774ec53c83696740
48.3 MB Download
md5:84e902d58c0c96e003060ef35ed38027
85.6 MB Download
md5:10d2f7eec58e964eabb80e74ab7f2c3d
293.6 MB Download
md5:fd1405ff3fa0f7dd099715f3fa531b7c
16.8 MB Download
md5:b2b75e4e5074c50ce44ff688f66e6547
17.0 MB Download
md5:8ab3efc71d8deea794f896ff153ac3f9
12.1 MB Download
md5:d1ca50a4f1c85960bdd1ff1f3d85a90f
729.2 MB Download
md5:b21c75c27c4cffee47ee7cea62dbe32b
726.2 MB Download
md5:33cd3f2de101ff4de5fa3878139bf0ac
869.3 MB Download
md5:0d63873893986c556c579912de485219
26.3 MB Download
md5:35ad96f5c98fc37fe2907634ae49fab9
22.2 MB Download
md5:ba3a2da9959e3b4d42172e4441087b21
21.7 MB Download