On the Impact of Dataset Size:A Twitter Classification Case Study

Nguyen, Thi Huyen; Nguyen, Hoang H.; Ahmadi, Zahra; Hoang, Tuan-Anh; Doan, Thanh-Nam

doi:10.1145/3486622.3493960

Published April 13, 2022 | Version v1

Conference paper Open

On the Impact of Dataset Size:A Twitter Classification Case Study

1. L3S Research Center, Leibniz University Hannover
2. VNU University of Science, Hanoi, Vietnam
3. University of Tennessee at Chattanooga, Tennessee, USA

The recent advent and evolution of deep learning models and pre-trained embedding techniques have created a breakthrough in supervised learning. Typically, we expect that adding more labeled data improves the predictive performance of supervised models. On the other hand, collecting more labeled data is not an easy task due to several difficulties, such as manual labor costs, data privacy, and computational constraint. Hence, a comprehensive study on the relation between training set size and the classification performance of different methods could be essentially useful in the selection of a learning model for a specific task. However, the literature lacks such a thorough and systematic study. In this paper, we concentrate on this relationship in the context of short, noisy texts from Twitter. We design a systematic mechanism to comprehensively observe the performance improvement of supervised learning models with the increase of data sizes on three well-known Twitter tasks: sentiment analysis, informativeness detection, and information relevance. Besides, we study how significantly better the recent deep learning models are compared to traditional machine learning approaches in the case of various data sizes. Our extensive experiments show (a) recent pre-trained models have overcome big data requirements, (b) a good choice of text representation has more impact than adding more data, and (c) adding more data is not always beneficial in supervised learning.

Files

On the Impact of Dataset Size.pdf

Files (949.0 kB)

Name	Size	Download all
On the Impact of Dataset Size.pdf md5:65494849edcb43f80b393684e1a900e8	949.0 kB	Preview Download

Additional details

MIRROR – Migration-Related Risks caused by misconceptions of Opportunities and Requirement 832921: European Commission

	All versions	This version
Views	46	46
Downloads	140	140
Data volume	135.7 MB	135.7 MB

On the Impact of Dataset Size:A Twitter Classification Case Study

Creators

Description

Files

On the Impact of Dataset Size.pdf

Files (949.0 kB)

Additional details

Funding