Published November 9, 2024 | Version 1.0.0
Model Open

Pie Model for Lemmatization, POS Tagging, and Morphological Analysis of Western Armenian

  • 1. École nationale des chartes-PSL / Calfa / LIPN, CNRS UMR 7030
  • 2. Université Sorbonne Paris Nord
  • 3. SeDyL, UMR8202, INALCO, CNRS, IRD

Contributors

Data collector:

  • 1. ROR icon Yerevan State University

Description

The models were trained for lemmatization, POS-tagging, and morphological analysis of Western Armenian. The training dataset used was the Western Armenian corpus from Universal Dependencies (09/2024 release), comprising 124,230 wordforms (96,948 for training / 13,615 for validation / 13,667 for testing). Sentences cover a large set of documents: blog, fiction, news, nonfiction, reviews, social, spoken, web, wiki.
Note that the input data should be pre-tokenized.
 
The model development was part of the ANR project ANR-21-CE38-0006 "DALiH - Digitizing Armenian Linguistic Heritage", led by Victoria Khurshudyan (Inalco, SeDyL, CNRS, IRD), with initial contributions from Calfa. Models have been developed for the EMNLP 2024 conference (NLP4DH workshop), and rely on the PIE framework.

 

Data :

For the training dataset, see:

Results :

For detailed experiments and results, please refer to the linked publication. The following table displays accuracy (f1-score).

task_name

all

ambiguous-tokens

known-tokens

unknown-tokens

abbr

0.9977 (0.8399)

0.7045 (0.5866)

0.998 (0.8563)

0.987 (0.4967)

adptype

0.9969 (0.9628)

0.9147 (0.9216)

0.9968 (0.9631)

0.9982 (0.333)

animacy

0.9851 (0.9655)

0.9586 (0.9258)

0.99 (0.977)

0.7918 (0.7638)

aspect

0.9982 (0.8154)

0.9893 (0.9566)

0.9987 (0.8496)

0.9768 (0.7511)

case

0.9902 (0.9388)

0.9636 (0.9342)

0.993 (0.9555)

0.8812 (0.5959)

connegative

0.9969 (0.4992)

0.5073 (0.3366)

0.9969 (0.4992)

0.9953 (0.4988)

definite

0.9756 (0.9679)

0.9156 (0.864)

0.9781 (0.9707)

0.8783 (0.8836)

degree

0.9632 (0.1962)

0.2944 (0.1516)

0.963 (0.2453)

0.9696 (0.2461)

deixis

0.9979 (0.9506)

0.8254 (0.7873)

0.9979 (0.9523)

0.9964 (0.2496)

hyph

0.9987 (0.9116)

0.9716 (0.9049)

0.9988 (0.9158)

0.9971 (0.4993)

lemma

0.9879 (0.9098)

0.9173 (0.5108)

0.991 (0.9413)

0.865 (0.7453)

mood

0.9964 (0.6131)

0.9744 (0.7198)

0.9968 (0.6152)

0.9833 (0.8434)

number

0.9895 (0.9777)

0.9626 (0.9172)

0.9931 (0.9847)

0.8505 (0.7682)

numform

0.9974 (0.876)

0.8033 (0.597)

0.9975 (0.879)

0.9953 (0.6234)

numtype

0.9967 (0.5264)

0.7854 (0.3376)

0.9967 (0.5264)

0.9964 (0.5552)

person

0.9978 (0.9899)

0.9675 (0.9414)

0.9981 (0.9919)

0.9862 (0.9073)

person[psor]

0.9919 (0.249)

0.5312 (0.2313)

0.9921 (0.249)

0.983 (0.2479)

polarity

0.9969 (0.9928)

0.9804 (0.9687)

0.9973 (0.9936)

0.9776 (0.9602)

polite

0.9991 (0.4998)

0.6014 (0.3755)

0.9991 (0.4998)

0.9982 (0.4995)

pos

0.9933 (0.9897)

0.9874 (0.9611)

0.9967 (0.9949)

0.8606 (0.5483)

poss

0.9974 (0.9541)

0.8635 (0.7017)

0.9974 (0.9551)

0.9975 (0.4994)

prontype

0.9949 (0.9189)

0.882 (0.742)

0.995 (0.9211)

0.9873 (0.317)

reflex

0.9957 (0.871)

0.6424 (0.537)

0.9956 (0.8716)

0.9989 (0.4997)

style

0.9918 (0.1423)

0.4598 (0.126)

0.9921 (0.1423)

0.9801 (0.1414)

subcat

0.9957 (0.9825)

0.9008 (0.8952)

0.9969 (0.9874)

0.9493 (0.875)

tense

0.9979 (0.9913)

0.9833 (0.7189)

0.998 (0.9921)

0.9913 (0.9613)

typo

0.9985 (0.4996)

0.8889 (0.4706)

0.9986 (0.4996)

0.9975 (0.4994)

verbform

0.9956 (0.7889)

0.9731 (0.9462)

0.9961 (0.9874)

0.9783 (0.7649)

voice

0.99 (0.6858)

0.8393 (0.7731)

0.9917 (0.693)

0.9232 (0.4915)

 

Models can be used on Deucalion, the lemmatization service from École nationale des chartes-PSL.

Selected Bibliography:

Vidal-Gorène, C., Khurshudyan, V., & Donabédian-Demopoulos, A. (2020, December). Recycling and comparing morphological annotation models for Armenian diachronic-variational corpus processing. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects (pp. 90-101).

Vidal-Gorène C., Tomeh N., and Khurshudyan V. (2024, November). Cross-Dialectal Transfer and Zero-Shot Learning for Armenian Varieties: A Comparative Analysis of RNNs, Transformers and LLMs. In Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities, pages 438–449, Miami, USA. Association for Computational Linguistics.

Files

Files (1.0 GB)

Name Size Download all
md5:80999440a61bb0331bedd6a2ebe0babf
34.3 MB Download
md5:07f72a6581f0344194a9afe0e0849ece
34.3 MB Download
md5:cd4a016f170ddf292adf6624a19555d6
34.3 MB Download
md5:e3d3981a209c2201eecaa24a1e82eec5
34.3 MB Download
md5:9430852f44be960a94e371f3f7b34ea1
34.3 MB Download
md5:89847d8616a06ba2e302f2861c6916ea
34.3 MB Download
md5:f25dc380da411578fcaf1ea9218deee8
34.3 MB Download
md5:bde8d2a5e3c136df0df7c847e19554e9
34.3 MB Download
md5:b838855c07cee71ab955407decfdbec0
34.3 MB Download
md5:507ed04a34656e9ff3e41c2a9d13f536
34.3 MB Download
md5:e7d1281b92c4b5f3be9c6b1ae35231b7
34.3 MB Download
md5:0fac501d991fcd8aa072a4056c4e34e9
54.7 MB Download
md5:2e87c63b80f0c41465b03279e932cf6c
34.3 MB Download
md5:76eaf68a6422a1ef762380b6ef1ba20a
34.3 MB Download
md5:4e95c38678b6a18d92a72034ea42473d
34.3 MB Download
md5:d425d8fa0914d1ed556debd9bb4d393c
34.3 MB Download
md5:b7b450b50398003905c864f86a2819ed
34.3 MB Download
md5:0c1e389006060331b93432e668789fdb
34.3 MB Download
md5:375aede41f5c5887b34bad60ae8b4835
34.3 MB Download
md5:e3720b2c094fa4aa43fd7feb635aa90d
34.3 MB Download
md5:db06e5c5fe34da4808ca4476c953adbd
23.9 MB Download
md5:8f9b5907ce6af946be1cefb4fbd240a6
34.3 MB Download
md5:77228ce759c97c0476d2a08da9dece36
34.3 MB Download
md5:4b49eef8c47abed265469469de8e62c4
34.3 MB Download
md5:457eb7a705553200aa60b449afc74cbc
34.3 MB Download
md5:a5a0900db0af354fd33075dde7067009
34.3 MB Download
md5:937eca8b52603772594cc5193fc2faaf
34.3 MB Download
md5:a8d7534afca7826c11a9c41d8be2a049
34.3 MB Download
md5:01dac6c71a459d1cc1d0490f7967c5f6
34.3 MB Download
md5:85e246928040d3902eaddbe0187f0cb9
34.3 MB Download

Additional details

Related works

Is described by
Publication: https://aclanthology.org/2024.nlp4dh-1.42/ (URL)
Is source of
Software: https://dh.chartes.psl.eu/deucalion/fr (URL)

Funding

DALiH – Digitizing Armenian Linguistic Heritage: Armenian Multivariational Corpus and Data Processing ANR-21-CE38-0006
Agence Nationale de la Recherche