Published July 6, 2021 | Version v1
Dataset Open

People versus Books

  • 1. Aga Khan University
  • 2. University of Leipzig

Contributors

Researcher:

  • 1. Northeastern University

Description

This explanation pertains to the data prepared for Non sola scriptura: Essays on the Qur’an and Islam in Honour of William A. Graham (Routledge), Chapter by Sarah Bowen Savant, “People versus Books.”

We are releasing data that was used to create for the chapter, Graphs 1 and 2 and also Tables 1-3.

Note: All the data files (except the text in number 3) are in TSV format (Tab Separated Values) and any text editor or tabular data editor, such as Excel can deal with it. 

  1. “IsnadFractions_PeopleversusBooks”. This file represents a filtered version of an output from Ryan Muther’s isnād classifier algorithm. Muther ran the algorithm in July 2020, based on the Version 2020.1.2 release of the corpus, available at: http://doi.org/10.5281/zenodo.3891466. The data file includes:

    • author: the name of the author.

    • died: death date of author. NB: Especially the early dates cannot be relied on.  

    • title: the title of the author’s book, from the OpenITI Corpus.

    • length: length of the book, measured in word-tokens.

    • isnad_fraction: the percentage of the book’s word-tokens that are made up of isnāds.

  2. “GALTags_PeopleversusBooks”. Books in the OpenITI were mapped by Walid A. Akef in 2018 to:

    Brockelmann, Carl, History of the Arabic Written Traditions, trans. Joep Lameer, 2 vols and 3 supplements, Leiden: Brill, 2016-2018. 

    The file includes the following columns:

    • id: book id, from the OpenITI Corpus.

    • gal_tags: the GAL tags, also used in the OpenITI Corpus 

  3. “0571IbnCasakir.TarikhDimashq.JK000916-ara1.mARkdown”. The Ibn ʿAsākir text file, from the Version 2020.1.2 release of the OpenITI Corpus.

  4. “NamedEntities_PeopleversusBooks”. This is a very first effort at working on named entities in Ibn ʿAsākir’s Taʾrīkh Madīnat Dimashq and represents only a tiny fraction of the surface forms of names. Most of the names pertain to persons who transmitted from Ibn Saʿd. There may be some duplicate surface forms (which does not affect the method). We use this list to replace the surface forms with transliterated values. The column description is as below:

    • name: the normalized name.

    • ar_name: the Arabic name, which are the surface forms.

    • status: true (T)/false (F) values to include/exclude the cases in the replacement process. We have used true values.
  5. “SplittingTerms_PeopleversusBooks”. We started with a list of transmissive terms that R. Kevin Jaques originated and then added more terms, which include the various normalized forms of the same term. We used this list to split isnāds into names.

  6. “IbnSadIsnads_PeopleversusBooks”. This file includes the pieces of texts that the algorithm tags as isnāds in the text. We extracted the tagged pieces and made a list of isnāds. Almost all of the isnāds start with a transmissive term. We use this file to extract the names and clean some rows to generate a data table that we can use for clustering. Below are the brief description of the column:

    • text_ID: this contains the book id from the OpenITI Corpus. This column can be ignored as we are using it for one text in this project. However, it is required in the collection of isnāds from multiple texts.

    • id: a unique identifier assigned to each isnād. The isnād classifier algorithm assigns this id and can be used to identify each isnād in the text when required.

    • isnad_text: the isnād that we extract from the text. 

    • length: length of the extracted isnād in tokens

  7. “IsnadNames_PeopleversusBooks”. This file is the isnāds list (number 5 on this list) splitted by the transmissive terms (number 4 on this list) in order to extract the names in the isnāds. ‌The column are the same as below:
    • text_ID: this contains the book id from the OpenITI corpus. This column can be ignored as we are using it for one text in this project. However, it is required in the collection of isnāds from multiple texts.
    • isnad_text: this column is the isnād that  we extract from the text.
    • ibnSad_cnt: number of times that the name Ibn Saʿd is mentioned in the corresponding isnād.
    • name_at_position_X: the rest of the columns in this table include the pieces of the isnād that we get after splitting the isnāds with a list of terms. Each column contains a name or any string that appears between two transmissive terms. Some cells are empty and it is because we probably miss some transmissive terms.

  8. “IbnSadClusters_PeopleversusBooks”. This file includes clusters of isnāds of length six (i.e. isnāds that  include six names). We have used the affinity propagation (AP) clustering algorithm based on the Levenstein similarity score of the names. Below is the column description:

    • frequency: the frequency of the isnād in the data

    • cluster_id: the id of the cluster to which the isnād belongs

    • nameX: columns C to H include the names in isnād at position 1 to 6, running back to Muhammad b. Saʿd at position 6.
  9. “JK000916-ara1.mARkdown_Shamela0001686-ara1.completed”. This is the passim output from the February 2020 run (which used the same version of the corpus; Version 2020.1.2). For definition of fields in this file, please see number 10.
  10. “PassimCol-Definition_PeopleversusBooks”.  Description of the columns in passim outputs.

Files

PeopleversusBooks_Data.zip

Files (18.2 MB)

Name Size Download all
md5:b785d55548224b850accee1b3c14fdc4
18.2 MB Preview Download

Additional details

Funding

KITAB – Exploring Cultural Memory in the Pre-Modern Islamic World (700–1500): Knowledge, Information Technology, and the Arabic Book 772989
European Commission