Published August 5, 2022 | Version 1
Dataset Open

OCR model for Pracalit for Sanskrit and Newar MSS 16th to 19th C., Ground Truth

  • 1. SOAS University of London

Description

Ground truth data (png and xml files) for a an OCR model. Will be continually updated.

Originally trained on Transkribus with a PyLaia model created from ground truth data based on transcripts into Pracalit Unicode of four Nepalese manuscripts. The manuscripts used to create this model are Staatsbibliothek zu Berlin's Hitopadeśa (MIK I 4851) (mixed Newar and Sanskrit dating to 1561) and Vetālapañcaviṃśati (HS. Or. 6414) (Newar dating to 1675) as well as Cambridge Digital Library's Avalokiteśvaraguṇakāraṇḍavyūha (MS Add. 1322) (Sanskrit, 18th century) and the Royal Asiatic Society Online Collection's Madhyamasvayaṃbhūpurāṇa (RAS Hodgson MS 23) (Newar and Sanskrit dating to c. 1800).

The training was done on 441 pages and validation on 242 pages.

This model does not recognise spacing, except for large gaps (i.e. for pictures or string holes). Newar word divider markers may not be represented or may be transcribed as virama. In general, the model is made for MSS with scriptio continua and will transcribe into scriptio continua into Pracalit Unicode.

Transcription was performed by Dr Alexander O'Neill (SOAS University of London). Transcription of the Vetālapañcaviṃśati (HS. Or. 6414) and Madhyamasvayaṃbhūpurāṇa (RAS Hodgson MS 23) was aided by unpublished materials provided by Dr Felix Otter (Philipps-Universität Marburg), as well as the published transcription in Shakya, Min Bahadur, and Shanta Harsha Bajracharya, eds. "Svayambhū Purāṇa." Lalitpur: Nagarjuna Institute of Exact Methods, 2001. The transcription of Avalokiteśvaraguṇakāraṇḍavyūha (MS Add. 1322) was aided by the transcription provided by the Digital Sanskrit Buddhist Canon Project based on Lokesh Chandra, "Guṇakāraṇḍavyūhasūtram," New Delhi: International Academy of Indian Culture, 1999.

Files

export_job_3435367.zip

Files (503.9 MB)

Name Size Download all
md5:76d464a6b656482b7c0594c0cb37fb18
503.9 MB Preview Download