CIRCSE/CompHistSem: Computational Historical Semantics
- 1. Università Cattolica del Sacro Cuore, Milan, Italy
Description
Computational Historical Semantics
- co-operative project originally developed by an interdisciplinary team led by Bernhard Jussen and Alexander Mehler at the Goethe University in Frankfurt am Main, and funded by the German Federal Ministry for Education and Research. The project aims to define new methods and tools for historical-semantic analysis
- associated website: https://lta.bbaw.de/ of the Latin Text Archive (\textsc{lta}), hosted by the Berlin-Brandenburg Academy of Sciences and Humanities
- corpus: more than 4000 texts spanning from the 2nd to the 15th Century \textsc{ad}, put together thanks to the support of digitalised collections such as the Patrologia Latina Database (http://pld.chadwyck.co.uk/), the Monumenta Germaniae Historica (https://www.mgh.de/), the Corpus Corporum (https://mlat.uzh.ch/) and the Bibliotheca Augustana (http://www.hs-augsburg.de/~harsch/augustana.html)
CompHistSem in LiLa
We linked a subset of 5 texts from the original corpus:
- Capitularia Regum Francorum (6th–9th c. AD), various authors, from MGH Capitularia 1 & 2. Total of
10 826 sentences
and18 343 024 tokens
(including53 161 punctuation marks
) - De ecclesiasticis officiis (9h c. AD) by Amalarius of Metz, from Patrologia Latina vol. 105. Total of
4 279 sentences
and125 475 tokens
(including 20 845 punctuation marks
) - Vita Karoli Imperatoris (9th c. AD) by Eginhard, from mgh Scriptores rerum Germanicarum 25. Total of
247 sentences
and8 393 tokens
(including1 224 punctuation marks
) - Gesta Hludowici imperatoris (9th c. AD) by Thegan of Trier, from mgh Scriptores rerum Germanicarum 64. Total of
451 sentences
and8 355 tokens
(including1 403 punctuation marks
) - Decretum Gratiani i to iii (treated as distinct documents), also known as Concordia discordantium canonum (12th c. AD) by Gratian, from Corpus Corporum through Patrologia Latina vol. 187. Total of
31 804 sentences
and572 830 tokens
(including124 656 punctuation marks
)
TOTAL: 47607 sentences
for 1058077 tokens
(including 201289 punctuation marks
), the vast majority of which lemmatised and tagged for parts of speech and morphological features by means of the Frankfurt Latin Lexicon, which uses its own tagset, in line with the grammatical categories traditionally recognised for Latin. All texts but the Decretum Gratiani (Corpus Corporum, transcription under Creative Commons Share-Alike license: https://creativecommons.org/licenses/by-sa/4.0/) are retrievable from the Latin Text Archive and are under the Creative Commons license: https://creativecommons.org/licenses/by/4.0/. The texts are encoded in the tei-p5 format, so as xml.
Files
CIRCSE/CompHistSem-1.0.0.zip
Files
(66.4 MB)
Name | Size | Download all |
---|---|---|
md5:2522ea312e6f76dd6a8b60c5eba9d47a
|
66.4 MB | Preview Download |
Additional details
Related works
- Is supplement to
- https://github.com/CIRCSE/CompHistSem/tree/1.0.0 (URL)