Published June 5, 2024 | Version 2023.1.8
Dataset Open

KITAB Text Reuse Data

  • 1. Northeastern University
  • 2. Aga Khan University
  • 3. ROR icon Universität Hamburg
  • 4. University of London

Description

KITAB Text Reuse Data

 

KITAB is funded by the European Research Council under the European Union’s Horizon 2020 research and innovation programme, awarded to the KITAB project (Grant Agreement No. 772989, PI Sarah Bowen Savant), hosted at Aga Khan University, London. In addition, it has received funding from the Qatar National Library to aid in the adaptation of the passim algorithm for Arabic.

KITAB’s text reuse data is generated by running passim on the OpenITI corpus (DOI: 10.5281/zenodo.3082463). Each version is the output of a separate run and the version number corresponds to the corpus releases.

To prepare the corpus for a passim run, we normalize texts and remove most of the non-Arabic characters and then chunk the texts into passages of 300 words (using the non-Arabic characters, including white space) in length. The chunks, called milestones, are identified by unique ids. This dataset represents the reuse cases that have been identified among milestones. 

The text reuse dataset consists of folders for each book. Each folder includes CSV files of the text reuse cases (alignments) between the corresponding book and all other books with which passim has found instances of reuses. The files have the below naming convention, using the book ids:

    <bookVersionID1>_<bookVersionID2>.csv
    (e.g., ‘Shamela000001.mARkdown_Shamela000002.csv’).


The CSV files are not the immediate output of passim, rather the result of the post-processing step. The folder structure is as below (for a total of four books, for example).

    bookVersionID1
        |- bookVersionID1_bookVersionID4.csv
        |- bookVersionID1_bookVersionID3.csv
    bookVersionID4
        |-bookVersionID4_bookVersionID3.csv


Where we do not have any CSV files in any of the folders, it means that the passim algorithm has not been able to find any text reuse cases for that specific book. In the above example, we can not find any folder or CSV files for bookVresionID2, that means no reuse cases are detected between book2 and of the other three books.

To save computational resources, we generate text reuse data uni-directionally, which means a pair of documents is compared only once (document1 to document2, not document2 to document1). 

The alignments the CSV files are a list of records. Each record shows a pair of matched passages between two books together with statistics, such as the algorithm score, and contextual information, such as the start and end positions of aligned passages so that one can find those passages in the books. A description of the alignment fields is given in the release notes.

For each dataset, we also generate statistical data on the alignments between the book pairs. The data is published in an application that facilitates search, filtering, and visualizations. The link to the corresponding application is given in the release notes.

Note on Release Numbering: Version 2020.1.1—where 2020 is the year of the release, the first dotted number—.1—is the ordinal release number in 2020, and the second dotted number—.1—is the overall release number. The first dotted number will reset every year, while the second one will continue on increasing.

Note: The very first release of the KITAB text reuse data (2019.1.1) is published here as it was too big to publish on Zenodo. To receive more information on the complete datasets please contact us via kitab-project@outlook.com (or other team members). 

Future releases may include part of the generated data if the size of whole data is too big to publish on Zenodo. However, the data is open access for anyone to use. We provide the detailed information on the datasets in the corresponding release notes.

Files

KITAB-TextReuse-pairewise_2023-1-8.zip

Files (10.2 GB)

Name Size Download all
md5:25541a744fc154dc7e272bdd41689be3
10.1 GB Preview Download
md5:43c072bbf7ef1a273359b5a5c566e2e2
164.5 MB Download
md5:748b1cd65a6ba1f409687f35f1473e80
47.8 kB Preview Download

Additional details

Funding

European Commission
KITAB – Exploring Cultural Memory in the Pre-Modern Islamic World (700–1500): Knowledge, Information Technology, and the Arabic Book 772989