Conference paper Open Access

Selective Word Substitution for Contextualized Data Augmentation

Kyriaki Pantelidou; Despoina Chatzakou; Theodora Tsikrika; Stefanos Vrochidis; Ioannis Kompatsiaris

The often observed unavailability of large amounts of training data typically required by deep learning models to perform well  in the context of NLP tasks has given rise to the exploration of data augmentation techniques. Originally, such techniques mainly focused on rule-based methods (e.g. random insertion/deletion of words) or synonym replacement with the help of lexicons. More recently, model-based techniques which involve the use of non-contextual (e.g. Word2Vec, GloVe) or contextual (e.g. BERT) embeddings seem to be gaining ground as a more effective way of word replacement. For BERT, in particular, which has been employed successfully in various NLP tasks, data augmentation is typically performed by applying a masking approach where an arbitrary number of word positions is selected to replace words with others of the same meaning. Considering that the words selected for substitution are bound to affect the final outcome, this work examines different ways of selecting the words to be replaced by emphasizing different parts of a sentence, namely specific parts of speech or words that carry more sentiment information. Our goal is to study the effect of selecting the words to be substituted during data augmentation on the final performance of a classification model. Evaluation experiments performed for binary classification tasks on two benchmark datasets indicate improvements in the effectiveness against state-of-the-art baselines.

This preprint has not undergone peer review (when applicable) or any post-submission improvements or corrections. The Version of Record of this contribution is published in the 27th International Conference on Natural Language & Information Systems and is available online at http://dx.doi.org/10.1007/978-3-031-08473-7_47.
Files (161.5 kB)
Name Size
Selective_Word_Substitution_for_Contextualized_Data_Augmentation.pdf
md5:397ee79b3c31490b4057f8953dd9cf91
161.5 kB Download
110
81
views
downloads
All versions This version
Views 110110
Downloads 8181
Data volume 13.1 MB13.1 MB
Unique views 8585
Unique downloads 7474

Share

Cite as