Published February 21, 2018 | Version v1
Book chapter Open

Modelling multiword expressions in a parallel Bulgarian-English newsmedia corpus

Description

The paper focuses on the modelling of multiword expressions (MWE) in Bulgarian-
English parallel news corpora (SETimes; CSLI dataset and PennTreebank dataset).
Observations were made on alignments in which at least one multiword expression
was used per language. The multiword expressions were classified with respect to
the PARSEME lexicon-based (WG1) and treebank-based (WG4) classifications. The
non-MWE counterparts of MWEs are also considered. Our approach is data-driven
because the data of this study was retrieved from parallel corpora and not from
bilingual dictionaries. The survey shows that the predominant translation relation
between Bulgarian and English is MWE-to-word, and that this relation does not
exclude other translation options. To formalize our observations, a catenae-based
modelling of the parallel pairs is proposed.

 

Files

9.pdf

Files (233.3 kB)

Name Size Download all
md5:e3a3f0ded36a4bda12b060378b51784b
233.3 kB Preview Download