KITAB Text Reuse Data

David Smith; Sarah Bowen Savant; Maxim Romanov; Ryan Muther; Masoumeh Seydi; Sohail Merchant

doi:10.5281/zenodo.11501559

Published June 5, 2024 | Version 2023.1.8

Dataset Open

KITAB Text Reuse Data

1. Northeastern University
2. Aga Khan University
3. Universität Hamburg
4. University of London

KITAB is funded by the European Research Council under the European Union’s Horizon 2020 research and innovation programme, awarded to the KITAB project (Grant Agreement No. 772989, PI Sarah Bowen Savant), hosted at Aga Khan University, London. In addition, it has received funding from the Qatar National Library to aid in the adaptation of the passim algorithm for Arabic.

KITAB’s text reuse data is generated by running passim on the OpenITI corpus (DOI: 10.5281/zenodo.3082463). Each version is the output of a separate run and the version number corresponds to the corpus releases.

To prepare the corpus for a passim run, we normalize texts and remove most of the non-Arabic characters and then chunk the texts into passages of 300 words (using the non-Arabic characters, including white space) in length. The chunks, called milestones, are identified by unique ids. This dataset represents the reuse cases that have been identified among milestones.

The text reuse dataset consists of folders for each book. Each folder includes CSV files of the text reuse cases (alignments) between the corresponding book and all other books with which passim has found instances of reuses. The files have the below naming convention, using the book ids:

<bookVersionID1>_<bookVersionID2>.csv
(e.g., ‘Shamela000001.mARkdown_Shamela000002.csv’).

The CSV files are not the immediate output of passim, rather the result of the post-processing step. The folder structure is as below (for a total of four books, for example).

    bookVersionID1
      |- bookVersionID1_bookVersionID4.csv
      |- bookVersionID1_bookVersionID3.csv
    bookVersionID4
      |-bookVersionID4_bookVersionID3.csv

Where we do not have any CSV files in any of the folders, it means that the passim algorithm has not been able to find any text reuse cases for that specific book. In the above example, we can not find any folder or CSV files for bookVresionID2, that means no reuse cases are detected between book2 and of the other three books.

To save computational resources, we generate text reuse data uni-directionally, which means a pair of documents is compared only once (document1 to document2, not document2 to document1).

The alignments the CSV files are a list of records. Each record shows a pair of matched passages between two books together with statistics, such as the algorithm score, and contextual information, such as the start and end positions of aligned passages so that one can find those passages in the books. A description of the alignment fields is given in the release notes.

For each dataset, we also generate statistical data on the alignments between the book pairs. The data is published in an application that facilitates search, filtering, and visualizations. The link to the corresponding application is given in the release notes.

Note on Release Numbering: Version 2020.1.1—where 2020 is the year of the release, the first dotted number—.1—is the ordinal release number in 2020, and the second dotted number—.1—is the overall release number. The first dotted number will reset every year, while the second one will continue on increasing.

Note: The very first release of the KITAB text reuse data (2019.1.1) is published here as it was too big to publish on Zenodo. To receive more information on the complete datasets please contact us via kitab-project@outlook.com (or other team members).

Future releases may include part of the generated data if the size of whole data is too big to publish on Zenodo. However, the data is open access for anyone to use. We provide the detailed information on the datasets in the corresponding release notes.

Files

KITAB-TextReuse-pairewise_2023-1-8.zip

Files (10.2 GB)

Name	Size
KITAB-TextReuse-pairewise_2023-1-8.zip md5:25541a744fc154dc7e272bdd41689be3	10.1 GB	Preview Download
KITAB-TextReuse-stats_2023-1-8.csv.gz md5:43c072bbf7ef1a273359b5a5c566e2e2	164.5 MB	Download
KITAB-TextReuse_releaseNotes_2023-1-8.pdf md5:748b1cd65a6ba1f409687f35f1473e80	47.8 kB	Preview Download

Additional details

European Commission
KITAB - Exploring Cultural Memory in the Pre-Modern Islamic World (700–1500): Knowledge, Information Technology, and the Arabic Book 772989

	All versions	This version
Views	1,407	886
Downloads	979	947
Data volume	3.4 TB	1.8 TB

KITAB Text Reuse Data

Authors/Creators

Description

Files

KITAB-TextReuse-pairewise_2023-1-8.zip

Files (10.2 GB)

Additional details

Funding