Modelling multiword expressions in a parallel Bulgarian-English newsmedia corpus
Creators
Description
The paper focuses on the modelling of multiword expressions (MWE) in Bulgarian-
English parallel news corpora (SETimes; CSLI dataset and PennTreebank dataset).
Observations were made on alignments in which at least one multiword expression
was used per language. The multiword expressions were classified with respect to
the PARSEME lexicon-based (WG1) and treebank-based (WG4) classifications. The
non-MWE counterparts of MWEs are also considered. Our approach is data-driven
because the data of this study was retrieved from parallel corpora and not from
bilingual dictionaries. The survey shows that the predominant translation relation
between Bulgarian and English is MWE-to-word, and that this relation does not
exclude other translation options. To formalize our observations, a catenae-based
modelling of the parallel pairs is proposed.
Files
9.pdf
Files
(233.3 kB)
Name | Size | Download all |
---|---|---|
md5:e3a3f0ded36a4bda12b060378b51784b
|
233.3 kB | Preview Download |