Specialised POS Tagged Syriac Corpus for State Morphology

El-Khaissi, Charbel

doi:10.5281/zenodo.12591591

Published June 29, 2024 | Version v1

Dataset Open

Specialised POS Tagged Syriac Corpus for State Morphology

El-Khaissi, Charbel (Data curator)¹

1. Australian National University

Contributors

Data curator:

El-Khaissi, Charbel¹

1. Australian National University

Overview

A total of twelve .TXT files each representing a Syriac text that has been transcribed and tagged for part-of-speech (POS). This corpus forms part of a PhD research project on the historical syntax of Aramaic (Syriac) at The Australian National University (2020—current) in Canberra, Australia. This research project is interested in noun state morphology, among other topics, which is reflected in the POS scheme for this corpus.

Method

A detailed summary of this methodology is provided in El-Khaissi (data paper in review with the Journal of Open Data Humanities).

Transcriptions are sourced from Digital Syriac Corpus.
POS tags are based on word matches using SEDRA IV API (v1.0.0).
Selection of Syriac texts was optimised to minimise external influence on Syriac grammar and maximise full coverage of key periods of the Syriac language from 2nd—13th century AD.

POS Format & Abbreviations

POS tags in the text files follow the following format:

<syntax-category>-<state>_<syriac_word>

Thus, an underscore '_' marks the beginning of a tag sequence while tag values are separated by hyphen(s) '-'. For example (noting text directionality constraints):

ܒܘܪܟܬܐ_EMP-N

The following abbreviation lists the definition of all POS tags, which are based on the parameters available in SEDRA IV API (v1.0.0).

Absolute state noun (indeterminate relic)	ABS
Emphatic state noun (new indeterminate)	EMP
Construct state noun (bound noun)	CNS
State not applicable	X
particle	PTCL
pronoun	PRO
preposition	PREP
verb	V
denominative	DEN
noun	N
numeral	NUM
substantive	SBV
adjective	ADJ
proper noun	PN
adverb	ADV
demonym	DNM
participle adjective	PTCPADJ
adverb	ADV
idiom	IDM
See Quality Control & Limitations below	DUP

Quality Control & Limitations

On average per manuscript, the POS-tagging process achieved a 63.13% saturation of texts. The POS tagging process was based on an exact-match process, which does not take into account syntactic or semantic context. Syriac words which exhibit homonymy are thus tagged with the value 'DUP' and should be assessed manually based on its original context. Among all 297,981 words in the corpus with an available POS tag, approximately 73,188 (24.56%) of tags reflected some kind of homonymy involving a word with various semantic and/or syntactic interpretations.

Since this dataset was created as part of a research project investigating noun state morphology, additional tags were created targetting various state values. Grammatical elements, like number and gender, were not required as part of this investigation and therefore excluded from the POS-tagging process.

Contact

For any questions, please contact Charbel El-Khaissi <Charbel.El-Khaissi@anu.edu.au>.

Files

2-3_The-Book-of-the-Laws-of-Countries_Bardaisan_preprocessed_11032022_pos.txt

Files (7.2 MB)

Name	Size	Download all
12_Commentaries_preprocessed_11032022_pos.txt md5:0daeccbd8999dbb18e8d1d64fb31947f	3.6 MB	Preview Download
13_Treatise-of-Treatises_preprocessed_11032022_pos.txt md5:299d615711f6c012e973d6c1b6eeb1f2	570.1 kB	Preview Download
1_Letter-of-Mara-bar-Serapion_Anonymous_preprocessed_11032022_pos.txt md5:56a99b27127abd9730d75eb4862134bc	23.6 kB	Preview Download
2-3_The-Book-of-the-Laws-of-Countries_Bardaisan_preprocessed_11032022_pos.txt md5:f620b2a4a8d9299b5789ce9682003b5f	84.9 kB	Preview Download
4_The-Demonstrations_Aphrahat_preprocessed_11032022_pos.txt md5:0126a46eaaae669522bfb80093ee56f0	1.2 MB	Preview Download
5_Letter-from-Cyril-to-Rabbula_preprocessed_11032022_pos.txt md5:d811e2f1e8a29277d2adfac69efb2a65	9.2 kB	Preview Download
5_Letter-from-Rabbula-to-Andrew-of-Samosata_preprocessed_11032022_pos.txt md5:ebccb5fa135843513cdac36c23f525d5	2.5 kB	Preview Download
5_Part-of-a-Letter-from-Rabbula-to-Cyril_preprocessed_11032022_pos.txt md5:eedd53d94b9b4af59de572fba35c0354	1.5 kB	Preview Download
6_Letter-from-Barlaha-to-Simeon-on-the-Translation-of-the-Psalms_preprocessed_11032022_pos.txt md5:a6a7703b67b0d6432aee90ded0add729	65.7 kB	Preview Download
6_Letter-from-Simeon-of-Mart-Maryam-in-Response-to-Barlaha_preprocessed_11032022_pos.txt md5:4c9a32b3f9489dec794ddc36d66f2ff9	73.8 kB	Preview Download
7_Ascetic-Discourses_Isaac-of-Niveneh_pre-processed_18112020_pos.txt md5:21904b4d07cd1715f4e0a36b4a229dbc	1.2 MB	Preview Download
8-10_On-Divine-Providence_Anton-of-Tagrit_pre-processed_20112020_pos.txt md5:8488e17750b13434e11f8ec468e69128	275.0 kB	Preview Download

	All versions	This version
Views	251	251
Downloads	856	856
Data volume	656.1 MB	656.1 MB

Specialised POS Tagged Syriac Corpus for State Morphology

Authors/Creators

Contributors

Data curator:

Description

Overview

Method

POS Format & Abbreviations

Quality Control & Limitations

Contact

Files

2-3_The-Book-of-the-Laws-of-Countries_Bardaisan_preprocessed_11032022_pos.txt

Files (7.2 MB)