Published July 1, 2025 | Version 1.0
Dataset Open

Patrologia Graeca (OCRized and analyzed texts)

  • 1. ROR icon UCLouvain
  • 2. Calfa / École nationale des chartes-PSL
  • 3. GREgORI / UCLouvain

Description

The CGPG project (Calfa GRE*g*ORI Patrologia Graeca), led by Jean-Marie Auwers (UCLouvain), aims to OCRize the remaining non-digital versions of the Patrologia Graeca volumes. The project relies on the expertise of GREgORI and Calfa.

The project is sponsored by the ASBL *Byzantion*, the Fondation *Sedes Sapientiae*, the Institut *Religions, Spiritualités, Cultures, Sociétés* (RSCS, UCLouvain) and the Centre d'études orientales (CIOL, UCLouvain) and by a generous donor who wishes to remain anonymous. Other sponsors have recently expressed their willingness to support the project.

Webpage of the project

This repository contains the sketch engine XML files, with linguistic markups.

Raw data are available on Github : https://github.com/calfa-co/Patrologia-Graeca

For an optimal use in Sketch Engine, configure the corpus (Manage Corpus/Configure/Expert settings) as below

DOCSTRUCTURE "doc"
ENCODING "UTF-8"
INFO ""
LANGUAGE "Ancient Greek"
NAME "CGPG_20250629"
PATH "/corpora/ca/user_data/sso_1392/manatee/cgpg_20250629"
VERTICAL "| ca_getvertical '/corpora/ca/user_data/sso_1392/registry/cgpg_20250629' 'docx'"
ATTRIBUTE "word" {
    MAPTO "lemma"
}
ATTRIBUTE "intuitive_form" {
}
ATTRIBUTE "lemma" {
}
ATTRIBUTE "intuitive_lemma" {
}
ATTRIBUTE "pos" {
}
ATTRIBUTE "headword" {
}
STRUCTURE "w" {
    DEFAULTLOCALE "C"
    ENCODING "UTF-8"
    LANGUAGE ""
    NESTED ""
    ATTRIBUTE "id" {
        DYNLIB ""
        DYNTYPE "index"
        ENCODING "UTF-8"
        LOCALE "C"
        MULTISEP ","
        MULTIVALUE "n"
        TYPE "MD_MI"
    }
}
STRUCTURE "doc" {
    DEFAULTLOCALE "C"
    ENCODING "UTF-8"
    LANGUAGE ""
    NESTED ""
    ATTRIBUTE "id" {
        DYNLIB ""
        DYNTYPE "index"
        ENCODING "UTF-8"
        LOCALE "C"
        MULTISEP ","
        MULTIVALUE "n"
        TYPE "MD_MI"
    }
}
STRUCTURE "docx" {
    DEFAULTLOCALE "C"
    ENCODING "UTF-8"
    LANGUAGE ""
    NESTED ""
    ATTRIBUTE "id" {
        DYNLIB ""
        DYNTYPE "index"
        ENCODING "UTF-8"
        LABEL "File ID"
        LOCALE "C"
        MULTISEP ","
        MULTIVALUE "n"
        TYPE "MD_MI"
        UNIQUE "1"
    }
    ATTRIBUTE "filename" {
        DYNLIB ""
        DYNTYPE "index"
        ENCODING "UTF-8"
        LABEL "File name"
        LOCALE "C"
        MULTISEP ","
        MULTIVALUE "n"
        TYPE "MD_MI"
    }
}

 

 

Bibliography

  1. KINDT B., AUWERS J.-M., La Fondation Sedes Sapientiae soutient le projet de valorisation numérique de la Patrologie grecque, dans Bulletin de la Fondation Sedes Sapientiae, 45 (janvier 2024), p. 19-21 (https://cdn.uclouvain.be/groups/cms-editors-teco/angelique/fondation-sedes-sapientiae/UCL-TECO-Sedes Sapientiae-Bulletin 2024-WEB.pdf).
  2. KINDT B., VIDAL-GORÈNE C., DELLE DONNE S., Analyse automatique du grec ancien par réseau de neurones. Évaluation sur le corpus De Thessalonica Capta, dans BABELAO, 10-11 (2022), p. 525-550 (https://ojs.uclouvain.be/index.php/babelao/article/view/65073).
  3. KINDT B., VIDAL-GORÈNE C., From manuscript to tagged corpora. An automated process for Ancient Armenian or other under resourced languages of the Christian East, in Armeniaca. International Journal of Armenian Studies, 1 (2022), p. 73-96 (https://edizionicafoscari.unive.it/en/edizioni4/riviste/armeniaca/2022/1/from-manuscript-to-tagged-corpora/).
  4. VIDAL-GORÈNE C., CAFIERO F., KINDT B., Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac, 2025, published online on the HAL Science ouverte portal (https://hal.science/hal-05119485).
  5. VIDAL-GORÈNE C., La reconnaissance automatique d'écriture à l'épreuve des langues peu dotées, Programming Historian en français, 5 (2023) (https://doi.org/10.46430/phfr0023).
  6. VIDAL-GORÈNE C., Reconhecimento automático de manuscritos para o teste de idiomas não latinos, O Programming Historian em portugês, 5 (2024) (https://doi.org/10.46430/phpt0046).

Files

PG.zip

Files (47.8 MB)

Name Size Download all
md5:7b8d398f7699859e50a9db75bd61de25
47.8 MB Preview Download

Additional details

Related works