There is a newer version of the record available.

Published December 9, 2025 | Version 0.6
Dataset Open

GreLa

Authors/Creators

Description

This repository contains the code for creating, maintaining, and enriching the GreLa corpus. The corpus is primarily available via a public web API (see below), which we recommend as the main access point. For long-term archival and offline use, we also provide the underlying database file (>8 GB), split into smaller chunks for easier upload and download. 

GreLa is a comprehensive corpus of Greek and Latin literature from the 8th c. BCE to the 17th c. CE.
It currently contains more than 11,000 works, 21,000,000 sentences, and 350,000,000 tokens.

GreLa is constructed as a merge of the following corpora:

Corpus statistics

grela_source works_N sentences_N tokens_N
lagt 2,160 2,095,265 38,223,149
cc 7,819 14,229,691 254,770,887
noscemus 975 4,637,231 54,542,448
emlap 100 444,211 6,477,016
vulgate 73 35,254 603,091

GreLa is implemented as a relational database with three main tables: works, sentences, and tokens.
The schema links tables through:

  • grela_id — unique ID for each work (built as <subcorpus>_<work-id>, e.g., cc_12710)
  • sentence_id — unique ID for each sentence (<grela_id>_<position>, e.g., cc_12710_0, cc_12710_1)

Querying the corpus

The tokens table allows searching by lemma, POS, and positional information (char_start, char_end).
Where available, the ref JSON attribute encodes textual reference metadata (such as book/chapter/verse for biblical or otherwise structured texts). This varies significantly across subcorpora.

The sentences table supports efficient search for multi-word string patterns in raw text.

The works table contains rich metadata for each work. The fields not_before and not_after express a chronological interval. Ancient texts often require such interval dating, and GreLa supports temporal uncertainty using Monte Carlo modeling as described in this paper.
Following this method, each work is also assigned a date_random point estimate sampled from its interval.

Additionally, the works table provides identifiers such as:

  • author_viaf
  • author_wd (Wikidata QID)
  • author_gnd

as well as subcorpus-specific metadata stored uniformly in the subcorpus_specific_metadata JSON field.

GreLa uses DuckDB, an efficient column-oriented analytical database engine optimized for complex queries over large datasets.

Database Schema Documentation

Table: sentences

Column Name Data Type Is Nullable Default Value
sentence_id VARCHAR YES N/A
grela_id VARCHAR YES N/A
position INTEGER YES N/A
sent_text VARCHAR YES N/A

Table: tokens

Column Name Data Type Is Nullable Default Value
sentence_id VARCHAR YES N/A
grela_id VARCHAR YES N/A
token_text VARCHAR YES N/A
lemma VARCHAR YES N/A
pos VARCHAR YES N/A
ref JSON YES N/A
char_start INTEGER YES N/A
char_end INTEGER YES N/A
token_id BIGINT YES N/A

Table: works

Column Name Data Type Is Nullable Default Value
grela_source VARCHAR YES N/A
grela_id VARCHAR YES N/A
author VARCHAR YES N/A
title VARCHAR YES N/A
not_before INTEGER YES N/A
not_after INTEGER YES N/A
date_random INTEGER YES N/A
provenience VARCHAR YES N/A
place_publication VARCHAR YES N/A
place_geonames VARCHAR YES N/A
author_viaf VARCHAR YES N/A
author_wd VARCHAR YES N/A
author_gnd VARCHAR YES N/A
title_viaf VARCHAR YES N/A
subcorpus_specific_metadata JSON YES N/A

Getting Started

GreLa is accessible via a public web API.
To get started, see the introductory Colab notebook:

👉 GreLa API – getting started

License

The GreLa code, schema, and derived metadata are released under
CC BY-SA 4.0 (see LICENSE.md).

The underlying texts and some annotations inherit the licences of the source corpora:

  • LAGT (Perseus, First 1K Greek, GLAUx, OGA): CC BY-SA 4.0
  • Corpus Corporum: mix of CC BY-SA 4.0 and public-domain texts
  • NOSCEMUS: CC BY 4.0
  • EMLAP: CC BY-SA 4.0
  • latin-lemmatized-texts (Vulgate): public-domain text, CC BY-SA 4.0 annotations

When reusing GreLa data, please:

  1. Cite GreLa and the relevant source corpus (LAGT, Corpus Corporum, NOSCEMUS, EMLAP, latin-lemmatized-texts, …).
  2. Follow both the GreLa CC BY-SA 4.0 licence and the licence(s) of the original corpus for the texts you use.

Version History

  • 0.6

    • input data in unified format
    • EMLAP extended to all 100 works
    • CC input derived from lemmatized XML with ref metadata
    • works table enriched with VIAF, Wikidata ID, GND
    • subcorpus-specific attributes unified into subcorpus_specific_metadata
  • 0.5

    • various minor improvements
  • 0.4

    • significantly improved Greek sentence and token segmentation
    • added ref attribute for Greek works
  • 0.1

    • first version of GreLa

Roadmap

  • ref attribute documentation
  • add collaborators as coauthors based on agreement
  • document licences for all source corpora
  • more identifiers for works and authors (e.g., PHI IDs for Latin texts)
  • provenance metadata for Latin texts
  • standardized spatial metadata for works and authors
  • ULTIMATE GOAL: a bilingual (Greek and Latin) database-wide token-level and sentence-level contextual embeddings, based on a fine-tuned BERT model allowing (1) diachronic word sense induction & disambiguation and (2) fast retrieval of similar passages, paraphrases, and allusions across the two languages

Files

CCS-ZCU/GreLa-v0.6.zip

Files (8.3 GB)

Name Size Download all
md5:0dbf8ff3831bf55e591966216999e317
16.9 MB Preview Download
md5:228b77bf43949c0f0df79e091a582d79
104.9 MB Download
md5:41d06df806377a9aabb56672e1cd641a
104.9 MB Download
md5:96798811e34de45cfd64fee8c5cb844c
104.9 MB Download
md5:dc47aa968372c414b32c3bbe80af86a2
104.9 MB Download
md5:7fb622ef6b8eb5eb6afea07bf3f51c72
104.9 MB Download
md5:dd1d2742e50120ad1f20621fe46ddaaa
104.9 MB Download
md5:2733fb9b156e0e4e0b5d1bf6e9cc10e7
104.9 MB Download
md5:4021b5dd863440fc04eec1808d0e7ce5
104.9 MB Download
md5:18028a71ba678da6bbd5b2fcdae29e44
104.9 MB Download
md5:7c2476ba78b3c2e878ed1558bffa84a2
104.9 MB Download
md5:138667db583b370f8481ab780379a44c
104.9 MB Download
md5:111d1cbefc8ab7808eb20fa9b8140567
104.9 MB Download
md5:12ea1bf0d01cd7e82ba934e349fa132e
104.9 MB Download
md5:c3cd8140497c2b25b837a3aa3238503a
104.9 MB Download
md5:b616b9853162a3dd53b154ebce2a0bf1
104.9 MB Download
md5:c26a86461ce913b84fe52561f7060357
104.9 MB Download
md5:2e9164eb0b6cba5a194994e0d9727229
104.9 MB Download
md5:f07d8db3f69b90177d9189a3909a201f
104.9 MB Download
md5:d76b99795ff5c76a5a34673319c7a00b
104.9 MB Download
md5:3c74d812125f97668417425ac5bd88c2
104.9 MB Download
md5:3081ce8a3c8d523cb6756e6e7868ccf3
104.9 MB Download
md5:bc18c17f0870def05a2903130a919a6b
104.9 MB Download
md5:a83bcc501cce821667ce5edcc6b81b82
104.9 MB Download
md5:33bd8d12b77aa3bf0407c410f6ee8320
104.9 MB Download
md5:441a47386b87ce650c6f605bcb09c885
104.9 MB Download
md5:5d0cb60b56156d7ad058fd9f7f9e87c0
104.9 MB Download
md5:60f49c33503695f6c857110d977cefde
104.9 MB Download
md5:2e0e5c391d4d538e3fd35b3001aaf086
104.9 MB Download
md5:969878be62febb09e9f15aff3e9cd00c
104.9 MB Download
md5:23b61610b46dbb329c03028c169e4229
104.9 MB Download
md5:b0c831b96903bd9a4ecddc92dba6e1cb
104.9 MB Download
md5:d62d2aa59c3901c94d461dd89ed2ef51
104.9 MB Download
md5:d119a206339c828becf4edeccd3eccb9
104.9 MB Download
md5:a313d0925b9c4286645110907b06a6fd
104.9 MB Download
md5:0e5cc0f32cf1f3538ee556f76f011a22
104.9 MB Download
md5:9993ab619112d40b421df4bc58be40a9
104.9 MB Download
md5:ab853c180dc11dff1e618702c4c68da2
104.9 MB Download
md5:cd879bc20aaa1726a8fcf8780fc04b88
104.9 MB Download
md5:93b78ba55d22ba83be20cbd5f8f75a83
104.9 MB Download
md5:bc1703080c7d54f3a25137bab21823c8
104.9 MB Download
md5:4598b368dda8f1152d8f6c796f1df376
104.9 MB Download
md5:bc59b9b23abc75e5fb147710b0f88ca0
104.9 MB Download
md5:12f5a99ed75c61cf70e5d0626b206b31
104.9 MB Download
md5:68e552748797d12453b30dff3fad1d4a
104.9 MB Download
md5:07cad2b1e4d915cfbecd49f6c8691322
104.9 MB Download
md5:5dde235725bab2eebd13f54f66b28706
104.9 MB Download
md5:3d86df258d15231c2f55e54fb2ce6605
104.9 MB Download
md5:6f7cb7bebe33f84f02cddac6be034ae4
104.9 MB Download
md5:038b90ca5e986a8d5a377f307f1d2c26
104.9 MB Download
md5:67667fee9d958ee9cd031d7acc8bb0e7
104.9 MB Download
md5:8fd97e0a6deac493ffb282742e4b391f
104.9 MB Download
md5:e6afc834f1c89284ca25d265f7dab1aa
104.9 MB Download
md5:ae6e912d7aa48cef44c173a0b04d7fd0
104.9 MB Download
md5:958ba8c0c67e584c0ad38d09d8edbf57
104.9 MB Download
md5:08b81fedeccff710bb487980d8a54355
104.9 MB Download
md5:d877f6647133467aff25959e733594c8
104.9 MB Download
md5:25817fb7be62a17c3c17157fb8b0270c
104.9 MB Download
md5:ea6e8464ed12bee29295c737ae7bd790
104.9 MB Download
md5:849d2b9d8a4a184c16b632730488ddfa
104.9 MB Download
md5:d91ded04da0b7bb861a27fd8eb23802d
104.9 MB Download
md5:641a96bf43985ca90dfb7e72486ff400
104.9 MB Download
md5:a3835282e6d22cf9eaeea3a62b5a7f9f
104.9 MB Download
md5:e9ca24786cb04893893e14c0cfbbeaa1
104.9 MB Download
md5:c15987f656f86a8e5b64400ade273a0c
104.9 MB Download
md5:7e477e2a1c0963c8ffeef64101cd5376
104.9 MB Download
md5:9e1c47b1260f21de1cfe4b34bc19b552
104.9 MB Download
md5:7211c467d6581ed816bea0bf5eeff459
104.9 MB Download
md5:315986979429fb53f2dcf915a160a5ef
104.9 MB Download
md5:8bc46b3b92e85b4a8518ae9db09ab512
104.9 MB Download
md5:b4fcaeacde8ef5c8267753331b9f8b78
104.9 MB Download
md5:922236a808f82100395f0d9118d89847
104.9 MB Download
md5:bb14a6adcbef18e7af4a287fc44f016e
104.9 MB Download
md5:270590c8146c3d6118bae5a7c07f5a24
104.9 MB Download
md5:b54ea456b29ea353edc3ce8cf0e6a6ca
104.9 MB Download
md5:9b2751fa280a43eca0fe0f77bde36fb7
104.9 MB Download
md5:403bc53fedb9efcfbfe9bd1dce46438b
104.9 MB Download
md5:db651fdb249b1d5773c88dbc2256a3fd
104.9 MB Download
md5:675623139f79cd9bd27848ce45fb1d54
104.9 MB Download
md5:0d4210f51ce6df3b3b28230dd5e6cbe3
104.9 MB Download
md5:13489494e122a18c1d98fe35cc94648e
39.9 MB Download

Additional details

Related works

Is supplement to
Software: https://github.com/CCS-ZCU/GreLa/tree/v0.6 (URL)

Software