Combining n-grams and deep convolutional features for language variety classification

Martinc, Matej; Pollak, Senja

doi:10.1017/S1351324919000299

Published July 18, 2019 | Version v1

Journal article Open

Combining n-grams and deep convolutional features for language variety classification

1. Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia
2. Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia; Usher Institute of Population Health Sciences and Informatics, Edinburgh Medical School, Usher Institute, University of Edinburgh, Edinburgh, UK

This paper presents a novel neural architecture capable of outperforming state-of-the-art systems on the task of language variety classification. The architecture is a hybrid that combines character-based convolutional neural network (CNN) features with weighted bag-of-n-grams (BON) features and is therefore capable of leveraging both character-level and document/corpus-level information. We tested the system on the Discriminating between Similar Languages (DSL) language variety benchmark data set from the VarDial 2017 DSL shared task, which contains data from six different language groups, as well as on two smaller data sets (the Arabic Dialect Identification (ADI) Corpus and the German Dialect Identification (GDI) Corpus, from the VarDial 2016 ADI and VarDial 2018 GDI shared tasks, respectively). We managed to outperform the winning system in the DSL shared task by a margin of about 0.4 percentage points and the winning system in the ADI shared task by a margin of about 0.2 percentage points in terms of weighted F1 score without conducting any language group-specific parameter tweaking. An ablation study suggests that weighted BON features contribute more to the overall performance of the system than the CNN-based features, which partially explains the uncompetitiveness of deep learning approaches in the past VarDial DSL shared tasks. Finally, we have implemented our system in a workflow, available in the ClowdFlows platform, in order to make it easily available also to the non-programming members of the research community.

Files

Martinc2019_NLE.pdf

Files (1.1 MB)

Name	Size	Download all
Martinc2019_NLE.pdf md5:760bb79aeeccbdb2047e26d49cb8e2d7	1.1 MB	Preview Download

Additional details

European Commission
EMBEDDIA – Cross-Lingual Embeddings for Less-Represented Languages in European News Media 825153

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	65	65
Downloads	64	64
Data volume	75.7 MB	75.7 MB

Combining n-grams and deep convolutional features for language variety classification

Creators

Description

Files

Martinc2019_NLE.pdf

Files (1.1 MB)

Additional details

Funding