Published October 4, 2024 | Version v4.1
Dataset Open

LAGT

Description

LAGT is s a dataset of lemmatized ancient Greek texts, combining works from the Perseus Digital Library, the First 1000 Years of Greek project, the GLAUx corpus, and a subset of additional early Christian texts added gradually. The scripts used to produce this dataset are available from Github.

In version v4.1,  LAGT includes 1,958 works from more than 475 authors, covering 35,809,325 tokens of raw text. It includes only works from the period from the 8th c. BCE to the 6th c. CE. Since version 4.0, LAGT dataset consists of two parts:

  • Main tabular dataset, containing all metadata and also lemmatized filtered sentences, offered here as a parquet file, to be loaded into python directly as a pandas dataframe object.
  • Morphological data for each document within the corpus with one JSON file per document. Each file is represented as a list of sentences, and each sentence is accompanied by a simplified morphological annotation, containing token, lemma, simplified postag and a positional index of the token. The directory with these files has to be downloaded and unzipped, e.g. in "data/large_files/ subdirectory of a repository or so.

The tabular dataset might be loaded directly into a Python environment as a dataframe using the Pandas library. You can load the dataset directly into your Python environment using the following piece of code:

import pandas as pd
LAGT = pd.read_parquet("https://zenodo.org/records/13889714/files/LAGT_v4-1.parquet?download=1")

Individual works are represented by rows and columns represent attributes, such as the author ID (“doc_id”, e.g. “tlg0086”) and document ID (“doc_id”, e.g. “tlg010”) inherited from the source corpora, the date of creation expressed by means of an interval (“not_before” and “not_after”), manually annotated religious provenience as either pagan, Jewish or Christian (“provenience” attribute) etc., which allow various forms of sorting and filtering. The dating information is coded by means of the “not_before” and “not_after” attributes on the level of authors and with the precision of centuries.

Concerning lemmatization, the dataset contains lemmatized sentences in the "lemmatized_sentences" attribute in form of a list-of-lists, with sublist elements representing individual lemmata. It contains only nouns, proper names, verbs and adjectives.
Wherever available, the lemmata are based on avaialable Treebank data, such as the GLAUx corpus (see below).
Where not, the GreCy model for spaCy is employed for automatic annotation.

The source of the lemmata for individual documents is documented in the "lemmata_source" attribute. Since version 4.0, the lemmata come exclusively either from GLAUx or from grecy.

  • "glaux": lemmata from a large portion of *automatically* annotated ancient Greek texts, extracted from https://github.com/perseids-publications/glaux-trees/tree/master/public/xml
  • "grecy": lemmata obtain from *automatically* annotated ancient Greek texts by means of the *grecy* model for *spaCy*.

Files

LAGT_v4-1_codebook.csv

Files (764.5 MB)

Name Size Download all
md5:c6e0f7ff266b3d680330d1f5a09ea701
270.2 MB Download
md5:d6f74a8f23d23d37804d0d158f7bcc4b
1.7 kB Preview Download
md5:414c541e6a6b0bdf4ea7fdff56c09e6e
302.4 kB Preview Download
md5:73e9ba599a6e618a97b4116d6c02a181
494.0 MB Preview Download

Additional details

Software