Dataset Open Access

Noisy OCR Dataset (NOD)

Hegghammer, Thomas

This dataset contains 18,504 images of English and Arabic documents with ground truth for use in OCR benchmarking. It consists of two collections, "Old Books" (English) and "Yarmouk" (Arabic), each of which contains an image set reproduced in 44 versions with different types and degrees of artificially generated noise. The dataset was originally developed for Hegghammer (2021).

Source images

The seed of the English collection was the "Old Books Dataset" (Barcha 2017), a set of 322 page scans from English-language books printed between 1853 and 1920. The seed of the Arabic collection was a randomly selected subset of 100 pages from the "Yarmouk Arabic OCR Dataset" (Abu Doush et al. 2018), which consists of 4,587 Arabic Wikipedia articles printed to paper and scanned to PDF.

Artificial noise application

The dataset was created as follows:
- First a greyscale version of each image was created, so that there were two versions (colour and greyscale) with no added noise. 
- Then six ideal types of image noise --- "blur", "weak ink", "salt and pepper", "watermark", "scribbles", and "ink stains" --- were applied both to the colour version and the binary version of the images, thus creating 12 additional versions of each image. The R code used to generate the noise is included in the repository.
- Lastly, all available combinations of *two* noise filters were applied to the colour and binary images, for an additional 30 versions. 

This yielded a total of 44 image versions divided into three categories of noise intensity: 2 versions with no added noise, 12 versions with one layer of noise, and 30 versions with two layers of noise. This amounted to an English corpus of 14,168 documents and an Arabic corpus of 4,400 documents. 

The compressed archive is ~26 GiB, and the uncompressed version is ~193 GiB. See this link for how to unzip .tar.lzma files. 

References:

Barcha, Pedro. 2017. “Old Books Dataset.” GitHub Repository. GitHub. https:
//github.com/PedroBarcha/old-books-dataset.

Doush, Iyad Abu, Faisal AlKhateeb, and Anwaar Hamdi Gharibeh. 2018. “Yarmouk
Arabic OCR Dataset.” In 2018 8th International Conference on Computer Science
and Information Technology (CSIT)
, 150–54. IEEE.

Hegghammer, Thomas. 2021. "OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment". Socarxiv. https://osf.io/preprints/socarxiv/6zfvs

Files (27.6 GB)
Name Size
ground_truth.tar.lzma
md5:f7f37148ccd1fc07f39a7d7a88f1f2a0
247.1 kB Download
noise_generation.R
md5:3d3b2974d6d46ac5bc0f6cf914dd5d27
3.7 kB Download
old_books_01_col.tar.lzma
md5:806d1726e06c786e25b832bc03955c29
259.4 MB Download
old_books_02_bin.tar.lzma
md5:aa3d251405f15e0ba3282229e8d4058e
11.6 MB Download
old_books_03_col_blur.tar.lzma
md5:4f73ec4c03eae437509a5cc32e18b2d1
479.4 MB Download
old_books_04_col_weak.tar.lzma
md5:301071e4c362d5c1eae856b84b07f6aa
217.4 MB Download
old_books_05_col_snp.tar.lzma
md5:f755e0ca3a93191e13ca68d42e6ccee7
1.4 GB Download
old_books_06_col_wm.tar.lzma
md5:96ff56724cd82e57ddd7e9e05f3dc243
343.5 MB Download
old_books_07_col_scrib.tar.lzma
md5:370836e2d3423685c617b8be34777ba4
346.7 MB Download
old_books_08_col_ink.tar.lzma
md5:964cf1c780594da02c2acbbecd03d403
249.0 MB Download
old_books_09_bin_blur.tar.lzma
md5:9a690e961bae39dae3d6326e61e389bb
6.4 MB Download
old_books_10_bin_weak.tar.lzma
md5:dac24c4c73055774a0d678700afa0938
11.6 MB Download
old_books_11_bin_snp.tar.lzma
md5:3536436049bd0fc01ed6fe639e827583
44.7 MB Download
old_books_12_bin_wm.tar.lzma
md5:6de1d6a9872204b88a0241487b589862
68.8 MB Download
old_books_13_bin_scrib.tar.lzma
md5:b9e043b605c52550fbac8995e4632e5d
67.2 MB Download
old_books_14_bin_ink.tar.lzma
md5:a459c091f4f9e40a984f4402f3914afb
11.2 MB Download
old_books_15_col_blur_weak.tar.lzma
md5:3c0aac1793b9c86f3d76672584b71faa
434.2 MB Download
old_books_16_col_blur_snp.tar.lzma
md5:324dfa565cb658c50590bcb7845290e5
1.4 GB Download
old_books_17_col_blur_wm.tar.lzma
md5:ddb15874f863b80523b602d72bf4f675
449.6 MB Download
old_books_18_col_blur_scrib.tar.lzma
md5:79d96c03990e9b48de82329498bff9fe
455.0 MB Download
old_books_19_col_blur_ink.tar.lzma
md5:bea35c17210bfb058640143e58973690
452.0 MB Download
old_books_20_col_weak_snp.tar.lzma
md5:e13366f12fc3e163fe7f4c4c931e8a02
1.4 GB Download
old_books_21_col_weak_wm.tar.lzma
md5:850b2c6f8edc98969e5d7dc97b543089
324.2 MB Download
old_books_22_col_weak_scrib.tar.lzma
md5:05ad23d6dae11acb09dfa703bbfd10e5
329.1 MB Download
old_books_23_col_weak_ink.tar.lzma
md5:9dae0d443bf9b995dd57e727b2ddbeb1
212.0 MB Download
old_books_24_col_snp_wm.tar.lzma
md5:79025855f8ddccbdbe275b0fdf975793
2.4 GB Download
old_books_25_col_snp_scrib.tar.lzma
md5:ac4e85631871f2ec8507bb549f7834ed
2.4 GB Download
old_books_26_col_snp_ink.tar.lzma
md5:6b5f122f8cd8b4f788401f0308517617
1.3 GB Download
old_books_27_col_wm_scrib.tar.lzma
md5:a81f2929628f9850f462a91f72864300
291.2 MB Download
old_books_28_col_wm_ink.tar.lzma
md5:c5dc894894fb08366564abd804f35c05
316.3 MB Download
old_books_29_col_scrib_ink.tar.lzma
md5:5931309198ea855b6f3faf8a77e57872
319.6 MB Download
old_books_30_bin_blur_weak.tar.lzma
md5:ec705632c64a49b7e3fc3be3d32ae805
6.0 MB Download
old_books_31_bin_blur_snp.tar.lzma
md5:d489845d0bfece5b018d36228c5955c8
39.3 MB Download
old_books_32_bin_blur_wm.tar.lzma
md5:e2cfadce3701dcaa1a53f18dc57d4133
34.9 MB Download
old_books_33_bin_blur_scrib.tar.lzma
md5:7a9655f6b5c0bf334a3c87989715f883
37.8 MB Download
old_books_34_bin_blur_ink.tar.lzma
md5:cb88325700205b9eaa66d175ac1e60bb
6.6 MB Download
old_books_35_bin_weak_snp.tar.lzma
md5:fdb5b5581ee5f798e3b5ff4c2eabba16
44.6 MB Download
old_books_36_bin_weak_wm.tar.lzma
md5:53a366c22b23dd3514159349eba4d4e6
62.9 MB Download
old_books_37_bin_weak_scrib.tar.lzma
md5:0a308203ad97a6546f21bb3a3c48a32b
65.7 MB Download
old_books_38_bin_weak_ink.tar.lzma
md5:899e5e1824f28ed9ea8985cc397ce81b
11.2 MB Download
old_books_40_bin_snp_scrib.tar.lzma
md5:e3d96f513084365beecd2a30bb161268
237.5 MB Download
old_books_41_bin_snp_ink.tar.lzma
md5:eaaae7778b1bc4367a5124cbc331bfee
42.3 MB Download
old_books_42_bin_wm_scrib.tar.lzma
md5:9ce98db2d1070d9fda38bd77f9e2a969
79.4 MB Download
old_books_43_bin_wm_ink.tar.lzma
md5:5c907edac69a8f62eb18eae1fcf27065
56.0 MB Download
old_books_44_bin_scrib_ink.tar.lzma
md5:d9aa8cbc3fc784814df0f809f9bce7be
55.7 MB Download
yarmouk_01_col.tar.lzma
md5:a845e80b66fea3507ad39f3c07684ff6
41.9 MB Download
yarmouk_02_bin.tar.lzma
md5:30d430227662a5ceb48647e0b295604c
38.6 MB Download
yarmouk_03_col_blur.tar.lzma
md5:4a8d63c46903ba2654a881a0b1168a71
103.7 MB Download
yarmouk_04_col_weak.tar.lzma
md5:f6dff9f8de93f42a1edbe97f3475a0cd
13.8 MB Download
yarmouk_05_col_snp.tar.lzma
md5:eea2540989609bff406ddeaa136c28d3
1.0 GB Download
yarmouk_06_col_wm.tar.lzma
md5:b74d14e04a2af08de8a8a6c1defed037
36.1 MB Download
yarmouk_07_col_scrib.tar.lzma
md5:d023b81c89df506c73e0f117979b2d8b
36.7 MB Download
yarmouk_08_col_ink.tar.lzma
md5:8f4835f294846a0251d8d6628bb29fa8
43.3 MB Download
yarmouk_09_bin_blur.tar.lzma
md5:f96dcfdf633dcc76d8839ac72853baba
88.9 MB Download
yarmouk_10_bin_weak.tar.lzma
md5:0624c028dd075c3f6762be51af068b91
10.0 MB Download
yarmouk_11_bin_snp.tar.lzma
md5:80f687226675dae7ab046db49b2b9a62
889.4 MB Download
yarmouk_12_bin_wm.tar.lzma
md5:19e6a213928f211e7c75959075322db4
27.4 MB Download
yarmouk_13_bin_scrib.tar.lzma
md5:7773a09eb00b3b65367fd0068c06d86e
27.5 MB Download
yarmouk_14_bin_ink.tar.lzma
md5:370c26ff10030ea60c444cf9c85c50fd
31.0 MB Download
yarmouk_15_col_blur_weak.tar.lzma
md5:154ff78e6fa1057128a511e59d625f78
113.4 MB Download
yarmouk_16_col_blur_snp.tar.lzma
md5:5bbc14a2605b71335e6519b4cf7e80ba
1.0 GB Download
yarmouk_17_col_blur_wm.tar.lzma
md5:979b25d6b6ea8129e3df660a76364b0c
53.9 MB Download
yarmouk_18_col_blur_scrib.tar.lzma
md5:553425003b568ed75e0ffa76c5a11a23
54.0 MB Download
yarmouk_19_col_blur_ink.tar.lzma
md5:dbd79028dc7fce24496eac1102bd6105
101.5 MB Download
yarmouk_20_col_weak_snp.tar.lzma
md5:7ea4c10d9c518153d57751d808174990
1.0 GB Download
yarmouk_21_col_weak_wm.tar.lzma
md5:f56415dbc9b2507a3352cccf6dac6bb0
20.1 MB Download
yarmouk_22_col_weak_scrib.tar.lzma
md5:269323e8880db1812f13abfff4ef7f82
20.9 MB Download
yarmouk_23_col_weak_ink.tar.lzma
md5:82eda38a125c259c9e905a812ed2b064
17.0 MB Download
yarmouk_24_col_snp_wm.tar.lzma
md5:d438f318aa5769ee7609268bc2f770a2
791.2 MB Download
yarmouk_25_col_snp_scrib.tar.lzma
md5:6e26a472fba78b7cbf00ec9a413dedf2
788.9 MB Download
yarmouk_26_col_snp_ink.tar.lzma
md5:20681967a5a53f4c89ae3fce169aaea7
1.0 GB Download
yarmouk_27_col_wm_scrib.tar.lzma
md5:740e783796e4b6dec40f0bbbdcf8ac16
36.2 MB Download
yarmouk_28_col_wm_ink.tar.lzma
md5:bc748776bd34cb001f3c7b1ffdbf4f2b
34.8 MB Download
yarmouk_29_col_scrib_ink.tar.lzma
md5:6dc6cff1f236a6a18b15baf8d52537ff
35.7 MB Download
yarmouk_30_bin_blur_weak.tar.lzma
md5:1c9a4ceae50540f0e5a10961e8255457
94.6 MB Download
yarmouk_31_bin_blur_snp.tar.lzma
md5:13f9179c583c40f75ca3e0db9a52c0c8
311.3 MB Download
yarmouk_32_bin_blur_wm.tar.lzma
md5:b46d11aa3e7822ca50a351954c2bf026
47.9 MB Download
yarmouk_33_bin_blur_scrib.tar.lzma
md5:4f542220e39511b9774ec53c83696740
48.3 MB Download
yarmouk_34_bin_blur_ink.tar.lzma
md5:84e902d58c0c96e003060ef35ed38027
85.6 MB Download
yarmouk_35_bin_weak_snp.tar.lzma
md5:10d2f7eec58e964eabb80e74ab7f2c3d
293.6 MB Download
yarmouk_36_bin_weak_wm.tar.lzma
md5:fd1405ff3fa0f7dd099715f3fa531b7c
16.8 MB Download
yarmouk_37_bin_weak_scrib.tar.lzma
md5:b2b75e4e5074c50ce44ff688f66e6547
17.0 MB Download
yarmouk_38_bin_weak_ink.tar.lzma
md5:8ab3efc71d8deea794f896ff153ac3f9
12.1 MB Download
yarmouk_39_bin_snp_wm.tar.lzma
md5:d1ca50a4f1c85960bdd1ff1f3d85a90f
729.2 MB Download
yarmouk_40_bin_snp_scrib.tar.lzma
md5:b21c75c27c4cffee47ee7cea62dbe32b
726.2 MB Download
yarmouk_41_bin_snp_ink.tar.lzma
md5:33cd3f2de101ff4de5fa3878139bf0ac
869.3 MB Download
yarmouk_42_bin_wm_scrib.tar.lzma
md5:0d63873893986c556c579912de485219
26.3 MB Download
yarmouk_43_bin_wm_ink.tar.lzma
md5:35ad96f5c98fc37fe2907634ae49fab9
22.2 MB Download
yarmouk_44_bin_scrib_ink.tar.lzma
md5:ba3a2da9959e3b4d42172e4441087b21
21.7 MB Download
23
2
views
downloads
All versions This version
Views 2323
Downloads 22
Data volume 250.9 kB250.9 kB
Unique views 1616
Unique downloads 22

Share

Cite as