Working paper Open Access
Hartung, Matthias; Orlikowski, Matthias; Veríssimo, Susana
Rolling out text analytics applications or individual components thereof to multiple input languages of interest requires scalable workflows and architectures that do not rely on manual annotation efforts or language-specific re-engineering per target language. These scalability challenges aggravate even further if specialized technical domains are targeted in multiple languages. In recent work, it has been shown that cross-lingual projection of sentiment models in deep learning frameworks based on bilingual sentiment embeddings (BLSE) is feasible without any annotated data in the target language, capitalizing on monolingual embeddings and a bilingual translation dictionary only (Barnes et al., 2018). We use their framework and apply it to multilingual text analytics problems in the pharmaceutical domain in order to (i) investigate under which conditions the BLSE approach scales to technical domains as well, and (ii) assess the impact of different configurations of underlying lexical resources. For the language pair English/Spanish, our findings corroborate the strength of cross-lingual projection approaches such as BLSE in technical scenarios, given the availability of bilingual resources that provide broad lexical coverage, on the one hand, and complementary domain- and task-specific knowledge, on the other.