Sequence models and lexical resources for MWE identification in French
- 1. Aix Marseille Univ, Université de Toulon, CNRS, LIS, Marseille, France
Description
We present a simple and efficient sequence tagger capable of identifying continuous multiword expressions (MWEs) of several categories in French texts. It is based on conditional random fields (CRF), using as features local context information such as previous and next word lemmas and parts of speech. We show that this approach can obtain results that, in some cases, approach more sophisticated parser-based MWE identification methods without requiring syntactic trees from a treebank. Moreover, we study how well the CRF can take into account external information coming from both high-quality hand-crafted lexicons and MWE lists automatically obtained from large monolingual corpora. Results indicate that external information systematically helps improving the tagger's performance, compensating for the limited amount of training data.
Files
10.pdf
Files
(292.6 kB)
Name | Size | Download all |
---|---|---|
md5:38c4a7c490cb354fd530575f126f5f34
|
292.6 kB | Preview Download |