GreLa
Authors/Creators
Description
This repository contains the code for creating, maintaining, and enriching the GreLa corpus. The corpus is primarily available via a public web API (see below), which we recommend as the main access point. For long-term archival and offline use, we also provide the underlying database file (>8 GB), split into smaller chunks for easier upload and download.
GreLa is a comprehensive corpus of Greek and Latin literature from the 8th c. BCE to the 17th c. CE.
It currently contains more than 11,000 works, 21,000,000 sentences, and 350,000,000 tokens.
GreLa is constructed as a merge of the following corpora:
- LAGT — Lemmatized Ancient Greek Texts, combining ancient Greek texts from the Perseus Digital Library, First 1,000 Years of Greek, Glaux, and OGA (v5.2; DOI: 10.5281/zenodo.17865189).
- Corpus Corporum — a comprehensive corpus of Latin literature.
- NOSCEMUS — a curated database of Early Modern scientific literature (v1; DOI: 10.5281/zenodo.15040256).
- EMLAP — Early Modern Latin Alchemical Prints (v0.7; DOI: 10.5281/zenodo.17834734).
- latin-lemmatized-texts — used here as a source for the lemmatized Vulgate.
Corpus statistics
| grela_source | works_N | sentences_N | tokens_N |
|---|---|---|---|
| lagt | 2,160 | 2,095,265 | 38,223,149 |
| cc | 7,819 | 14,229,691 | 254,770,887 |
| noscemus | 975 | 4,637,231 | 54,542,448 |
| emlap | 100 | 444,211 | 6,477,016 |
| vulgate | 73 | 35,254 | 603,091 |
GreLa is implemented as a relational database with three main tables: works, sentences, and tokens.
The schema links tables through:
grela_id— unique ID for each work (built as<subcorpus>_<work-id>, e.g.,cc_12710)sentence_id— unique ID for each sentence (<grela_id>_<position>, e.g.,cc_12710_0,cc_12710_1)
Querying the corpus
The tokens table allows searching by lemma, POS, and positional information (char_start, char_end).
Where available, the ref JSON attribute encodes textual reference metadata (such as book/chapter/verse for biblical or otherwise structured texts). This varies significantly across subcorpora.
The sentences table supports efficient search for multi-word string patterns in raw text.
The works table contains rich metadata for each work. The fields not_before and not_after express a chronological interval. Ancient texts often require such interval dating, and GreLa supports temporal uncertainty using Monte Carlo modeling as described in this paper.
Following this method, each work is also assigned a date_random point estimate sampled from its interval.
Additionally, the works table provides identifiers such as:
author_viafauthor_wd(Wikidata QID)author_gnd
as well as subcorpus-specific metadata stored uniformly in the subcorpus_specific_metadata JSON field.
GreLa uses DuckDB, an efficient column-oriented analytical database engine optimized for complex queries over large datasets.
Database Schema Documentation
Table: sentences
| Column Name | Data Type | Is Nullable | Default Value |
|---|---|---|---|
| sentence_id | VARCHAR | YES | N/A |
| grela_id | VARCHAR | YES | N/A |
| position | INTEGER | YES | N/A |
| sent_text | VARCHAR | YES | N/A |
Table: tokens
| Column Name | Data Type | Is Nullable | Default Value |
|---|---|---|---|
| sentence_id | VARCHAR | YES | N/A |
| grela_id | VARCHAR | YES | N/A |
| token_text | VARCHAR | YES | N/A |
| lemma | VARCHAR | YES | N/A |
| pos | VARCHAR | YES | N/A |
| ref | JSON | YES | N/A |
| char_start | INTEGER | YES | N/A |
| char_end | INTEGER | YES | N/A |
| token_id | BIGINT | YES | N/A |
Table: works
| Column Name | Data Type | Is Nullable | Default Value |
|---|---|---|---|
| grela_source | VARCHAR | YES | N/A |
| grela_id | VARCHAR | YES | N/A |
| author | VARCHAR | YES | N/A |
| title | VARCHAR | YES | N/A |
| not_before | INTEGER | YES | N/A |
| not_after | INTEGER | YES | N/A |
| date_random | INTEGER | YES | N/A |
| provenience | VARCHAR | YES | N/A |
| place_publication | VARCHAR | YES | N/A |
| place_geonames | VARCHAR | YES | N/A |
| author_viaf | VARCHAR | YES | N/A |
| author_wd | VARCHAR | YES | N/A |
| author_gnd | VARCHAR | YES | N/A |
| title_viaf | VARCHAR | YES | N/A |
| subcorpus_specific_metadata | JSON | YES | N/A |
Getting Started
GreLa is accessible via a public web API.
To get started, see the introductory Colab notebook:
License
The GreLa code, schema, and derived metadata are released under
CC BY-SA 4.0 (see LICENSE.md).
The underlying texts and some annotations inherit the licences of the source corpora:
- LAGT (Perseus, First 1K Greek, GLAUx, OGA): CC BY-SA 4.0
- Corpus Corporum: mix of CC BY-SA 4.0 and public-domain texts
- NOSCEMUS: CC BY 4.0
- EMLAP: CC BY-SA 4.0
- latin-lemmatized-texts (Vulgate): public-domain text, CC BY-SA 4.0 annotations
When reusing GreLa data, please:
- Cite GreLa and the relevant source corpus (LAGT, Corpus Corporum, NOSCEMUS, EMLAP, latin-lemmatized-texts, …).
- Follow both the GreLa CC BY-SA 4.0 licence and the licence(s) of the original corpus for the texts you use.
Version History
-
0.6
- input data in unified format
- EMLAP extended to all 100 works
- CC input derived from lemmatized XML with
refmetadata workstable enriched with VIAF, Wikidata ID, GND- subcorpus-specific attributes unified into
subcorpus_specific_metadata
-
0.5
- various minor improvements
-
0.4
- significantly improved Greek sentence and token segmentation
- added
refattribute for Greek works
-
0.1
- first version of GreLa
Roadmap
refattribute documentation- add collaborators as coauthors based on agreement
- document licences for all source corpora
- more identifiers for works and authors (e.g., PHI IDs for Latin texts)
- provenance metadata for Latin texts
- standardized spatial metadata for works and authors
- ULTIMATE GOAL: a bilingual (Greek and Latin) database-wide token-level and sentence-level contextual embeddings, based on a fine-tuned BERT model allowing (1) diachronic word sense induction & disambiguation and (2) fast retrieval of similar passages, paraphrases, and allusions across the two languages
Files
CCS-ZCU/GreLa-v0.6.zip
Files
(8.3 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:0dbf8ff3831bf55e591966216999e317
|
16.9 MB | Preview Download |
|
md5:228b77bf43949c0f0df79e091a582d79
|
104.9 MB | Download |
|
md5:41d06df806377a9aabb56672e1cd641a
|
104.9 MB | Download |
|
md5:96798811e34de45cfd64fee8c5cb844c
|
104.9 MB | Download |
|
md5:dc47aa968372c414b32c3bbe80af86a2
|
104.9 MB | Download |
|
md5:7fb622ef6b8eb5eb6afea07bf3f51c72
|
104.9 MB | Download |
|
md5:dd1d2742e50120ad1f20621fe46ddaaa
|
104.9 MB | Download |
|
md5:2733fb9b156e0e4e0b5d1bf6e9cc10e7
|
104.9 MB | Download |
|
md5:4021b5dd863440fc04eec1808d0e7ce5
|
104.9 MB | Download |
|
md5:18028a71ba678da6bbd5b2fcdae29e44
|
104.9 MB | Download |
|
md5:7c2476ba78b3c2e878ed1558bffa84a2
|
104.9 MB | Download |
|
md5:138667db583b370f8481ab780379a44c
|
104.9 MB | Download |
|
md5:111d1cbefc8ab7808eb20fa9b8140567
|
104.9 MB | Download |
|
md5:12ea1bf0d01cd7e82ba934e349fa132e
|
104.9 MB | Download |
|
md5:c3cd8140497c2b25b837a3aa3238503a
|
104.9 MB | Download |
|
md5:b616b9853162a3dd53b154ebce2a0bf1
|
104.9 MB | Download |
|
md5:c26a86461ce913b84fe52561f7060357
|
104.9 MB | Download |
|
md5:2e9164eb0b6cba5a194994e0d9727229
|
104.9 MB | Download |
|
md5:f07d8db3f69b90177d9189a3909a201f
|
104.9 MB | Download |
|
md5:d76b99795ff5c76a5a34673319c7a00b
|
104.9 MB | Download |
|
md5:3c74d812125f97668417425ac5bd88c2
|
104.9 MB | Download |
|
md5:3081ce8a3c8d523cb6756e6e7868ccf3
|
104.9 MB | Download |
|
md5:bc18c17f0870def05a2903130a919a6b
|
104.9 MB | Download |
|
md5:a83bcc501cce821667ce5edcc6b81b82
|
104.9 MB | Download |
|
md5:33bd8d12b77aa3bf0407c410f6ee8320
|
104.9 MB | Download |
|
md5:441a47386b87ce650c6f605bcb09c885
|
104.9 MB | Download |
|
md5:5d0cb60b56156d7ad058fd9f7f9e87c0
|
104.9 MB | Download |
|
md5:60f49c33503695f6c857110d977cefde
|
104.9 MB | Download |
|
md5:2e0e5c391d4d538e3fd35b3001aaf086
|
104.9 MB | Download |
|
md5:969878be62febb09e9f15aff3e9cd00c
|
104.9 MB | Download |
|
md5:23b61610b46dbb329c03028c169e4229
|
104.9 MB | Download |
|
md5:b0c831b96903bd9a4ecddc92dba6e1cb
|
104.9 MB | Download |
|
md5:d62d2aa59c3901c94d461dd89ed2ef51
|
104.9 MB | Download |
|
md5:d119a206339c828becf4edeccd3eccb9
|
104.9 MB | Download |
|
md5:a313d0925b9c4286645110907b06a6fd
|
104.9 MB | Download |
|
md5:0e5cc0f32cf1f3538ee556f76f011a22
|
104.9 MB | Download |
|
md5:9993ab619112d40b421df4bc58be40a9
|
104.9 MB | Download |
|
md5:ab853c180dc11dff1e618702c4c68da2
|
104.9 MB | Download |
|
md5:cd879bc20aaa1726a8fcf8780fc04b88
|
104.9 MB | Download |
|
md5:93b78ba55d22ba83be20cbd5f8f75a83
|
104.9 MB | Download |
|
md5:bc1703080c7d54f3a25137bab21823c8
|
104.9 MB | Download |
|
md5:4598b368dda8f1152d8f6c796f1df376
|
104.9 MB | Download |
|
md5:bc59b9b23abc75e5fb147710b0f88ca0
|
104.9 MB | Download |
|
md5:12f5a99ed75c61cf70e5d0626b206b31
|
104.9 MB | Download |
|
md5:68e552748797d12453b30dff3fad1d4a
|
104.9 MB | Download |
|
md5:07cad2b1e4d915cfbecd49f6c8691322
|
104.9 MB | Download |
|
md5:5dde235725bab2eebd13f54f66b28706
|
104.9 MB | Download |
|
md5:3d86df258d15231c2f55e54fb2ce6605
|
104.9 MB | Download |
|
md5:6f7cb7bebe33f84f02cddac6be034ae4
|
104.9 MB | Download |
|
md5:038b90ca5e986a8d5a377f307f1d2c26
|
104.9 MB | Download |
|
md5:67667fee9d958ee9cd031d7acc8bb0e7
|
104.9 MB | Download |
|
md5:8fd97e0a6deac493ffb282742e4b391f
|
104.9 MB | Download |
|
md5:e6afc834f1c89284ca25d265f7dab1aa
|
104.9 MB | Download |
|
md5:ae6e912d7aa48cef44c173a0b04d7fd0
|
104.9 MB | Download |
|
md5:958ba8c0c67e584c0ad38d09d8edbf57
|
104.9 MB | Download |
|
md5:08b81fedeccff710bb487980d8a54355
|
104.9 MB | Download |
|
md5:d877f6647133467aff25959e733594c8
|
104.9 MB | Download |
|
md5:25817fb7be62a17c3c17157fb8b0270c
|
104.9 MB | Download |
|
md5:ea6e8464ed12bee29295c737ae7bd790
|
104.9 MB | Download |
|
md5:849d2b9d8a4a184c16b632730488ddfa
|
104.9 MB | Download |
|
md5:d91ded04da0b7bb861a27fd8eb23802d
|
104.9 MB | Download |
|
md5:641a96bf43985ca90dfb7e72486ff400
|
104.9 MB | Download |
|
md5:a3835282e6d22cf9eaeea3a62b5a7f9f
|
104.9 MB | Download |
|
md5:e9ca24786cb04893893e14c0cfbbeaa1
|
104.9 MB | Download |
|
md5:c15987f656f86a8e5b64400ade273a0c
|
104.9 MB | Download |
|
md5:7e477e2a1c0963c8ffeef64101cd5376
|
104.9 MB | Download |
|
md5:9e1c47b1260f21de1cfe4b34bc19b552
|
104.9 MB | Download |
|
md5:7211c467d6581ed816bea0bf5eeff459
|
104.9 MB | Download |
|
md5:315986979429fb53f2dcf915a160a5ef
|
104.9 MB | Download |
|
md5:8bc46b3b92e85b4a8518ae9db09ab512
|
104.9 MB | Download |
|
md5:b4fcaeacde8ef5c8267753331b9f8b78
|
104.9 MB | Download |
|
md5:922236a808f82100395f0d9118d89847
|
104.9 MB | Download |
|
md5:bb14a6adcbef18e7af4a287fc44f016e
|
104.9 MB | Download |
|
md5:270590c8146c3d6118bae5a7c07f5a24
|
104.9 MB | Download |
|
md5:b54ea456b29ea353edc3ce8cf0e6a6ca
|
104.9 MB | Download |
|
md5:9b2751fa280a43eca0fe0f77bde36fb7
|
104.9 MB | Download |
|
md5:403bc53fedb9efcfbfe9bd1dce46438b
|
104.9 MB | Download |
|
md5:db651fdb249b1d5773c88dbc2256a3fd
|
104.9 MB | Download |
|
md5:675623139f79cd9bd27848ce45fb1d54
|
104.9 MB | Download |
|
md5:0d4210f51ce6df3b3b28230dd5e6cbe3
|
104.9 MB | Download |
|
md5:13489494e122a18c1d98fe35cc94648e
|
39.9 MB | Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/CCS-ZCU/GreLa/tree/v0.6 (URL)
Software
- Repository URL
- https://github.com/CCS-ZCU/GreLa