Published January 26, 2021 | Version 1.0
Dataset Open

1QIsaa data collection (binarized images, feature files, and plotting scripts) for writer identification test using artificial intelligence and image-based pattern recognition techniques

  • 1. University of Groningen

Description

The Great Isaiah Scroll (1QIsaa) data set for writer identification

This data set is collected for the ERC project:
The Hands that Wrote the Bible: Digital Palaeography and Scribal Culture of the Dead Sea Scrolls
PI: Mladen Popović
Grant agreement ID: 640497

Project website: https://cordis.europa.eu/project/id/640497

Copyright (c)     University of Groningen, 2021. All rights reserved.
Disclaimer and copyright notice for all data contained on this .tar.gz file:

1) permission is hereby granted to use the data for research purposes. It is not allowed to distribute this data for commercial purposes.

2) provider gives no express or implied warranty of any kind, and any implied warranties of merchantability and fitness for purpose are disclaimed.

3) provider shall not be liable for any direct, indirect, special, incidental, or consequential damages arising out of any use of this data.

4) the user should refer to the first public article on this data set:

Popović, M., Dhali, M. A., & Schomaker, L. (2020). Artificial intelligence-based writer identification generates new evidence for the unknown scribes of the Dead Sea Scrolls exemplified by the Great Isaiah Scroll (1QIsaa). arXiv preprint arXiv:2010.14476.

BibTeX:

@article{popovic2020artificial,
  title={Artificial intelligence based writer identification generates new evidence for the unknown scribes of the Dead Sea Scrolls exemplified by the Great Isaiah Scroll (1QIsaa)},
  author={Popovi{\'c}, Mladen and Dhali, Maruf A and Schomaker, Lambert},
  journal={arXiv preprint arXiv:2010.14476},
  year={2020}
}

5) the recipient should refrain from proliferating the data set to third parties external to his/her local research group. Please refer interested researchers to this site for obtaining their own copy.

Organisation of the data:

The .tar.gz file contains three directories: images, features, and plots. The included 'README' file contains all the instructions.

The 'images' directory contains NetPBM images of the columns of 1QIsaa. The NetPBM format is chosen because of its simplicity. Additionally, there is no doubt about lossy compression in the processing chain. There are two images for each of the Great Isaiah Scroll columns: one is the direct binarized output from the BiNet (arxiv.org/abs/1911.07930) system, and the other one is the manually cleaned version of the binarized output.   The file names for the direct binarized output are of the format '1QIsaa_col<columnnr>.pbm', for example, '1QIsaa_col15.pbm'. And, for the cleaned version, the format is '1QIsaa_col<columnnr>_cleaned.pbm', for example, '1QIsaa_col15_cleaned.pbm'. Note: the image files are not in a separate directory; they will be extracted in the same place. However, due to the unique naming, there is no problem extracting them in one single directory.

The 'features' directory contains feature files computed for each of the column images. There are two types of feature files: Hinge and Adjoined. They are distinguishable by their extension, for example, '1QIsaa_col15_cleaned.hinge' and '1QIsaa_col15_cleaned.adjoined'. They are also arranged in separate directories for ease of use.

The 'plots' directory contains a simple python script to perform PCA on the feature files and then visualize them in a 3D plot. The file takes the location of feature files as an input. The 'README_plot' file contains examples of how-to-run in the terminal.

Brief description:
According to ImageMagick's' identify' tool, the original images are in grayscale (.jpg) from Brill collection, in '8-bit Gray 256c'.  These images pass through multiple preprocessing measures to become suitable for pattern recognition-based techniques. The first step in preprocessing is the image-binarization technique. In order to prevent any classification of the text-column images based on irrelevant background patterns, a specific binarization technique (BiNet) was applied, keeping the original ink traces intact. After performing the binarization, the images were cleaned further by removing the adjacent columns that partially appear on the target columns' images. Finally, few minor affine transformations and stretching corrections were performed in a restrictive manner. These corrections are also targeted for aligning the texts where the text lines get twisted due to the leather writing surface's degradation. Hence, the clean images are there in the directory along with the direct binarized images. No effort has been made to obtain a balanced set in any way.

Tools:
Binarization:
The BiNet tool is available for scientific use upon request (m.a.dhal(at)rug.nl)

Image Morphing:
In the original article, data augmentation was performed using image morphing. The tool is available on GitHub:
https://github.com/GrHound/imagemorph.c

Features for writer identification:
Lambert Schomaker
http://www.ai.rug.nl/~lambert/allographic-fraglet-codebooks/allographic-fraglet-codebooks.html
http://www.ai.rug.nl/~lambert/hinge/hinge-transform.html
1. L. Schomaker & M. Bulacu (2004). Automatic writer identification using connected-component contours and edge-based features of upper-case Western script. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 26(6), June 2004, pp. 787 - 798.
2. Bulacu, M. & Schomaker, L.R.B. (2007). Text-independent Writer Identification and Verification Using Textural and Allographic Features,  IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), Special Issue - Biometrics: Progress and Directions, April, 29(4), p. 701-717.

 
The features (hinge, fraglets) have been combined in a single MS Windows application, GIWIS, which is available for scientific use upon request (l.r.b.schomaker(at)rug.nl)

If you have any question, please contact us:
Maruf A. Dhali <m.a.dhali(at)rug.nl>
Lambert Schomaker <l.r.b.schomaker(at)rug.nl>
Mladen Popović <m.popovic(at)rug.nl>

Please cite our papers if you use this data set:
1. Popović, M., Dhali, M. A., & Schomaker, L. (2020). Artificial intelligence based writer identification generates new evidence for the unknown scribes of the Dead Sea Scrolls exemplified by the Great Isaiah Scroll (1QIsaa). arXiv preprint arXiv:2010.14476.
2. Dhali, M. A., de Wit, J. W., & Schomaker, L. (2019). Binet: Degraded-manuscript binarization in diverse document textures and layouts using deep encoder-decoder networks. arXiv preprint arXiv:1911.07930.

Files

Files (14.7 MB)

Name Size Download all
md5:5e92e30b5c4668c726b60378fde896a3
14.7 MB Download

Additional details

Related works

Is cited by
Preprint: arXiv:2010.14476 (arXiv)
References
Preprint: arXiv:1911.07930 (arXiv)

Funding

European Commission
HandsandBible - The Hands that Wrote the Bible: Digital Palaeography and Scribal Culture of the Dead Sea Scrolls 640497