A COMPARATIVE STUDY OF MACHINE LEARNING FOR AGE GROUP IMPUTATION IN LARGE-SCALE E-COMMERCE DATA
Authors/Creators
Description
Current privacy regulations and the prevalence of voluntary non-disclosure have led to significant gaps in demographic data within e-commerce platforms, severely hindering personalized marketing efforts. This study proposes a machine learning–based approach to address the problem of missing age group information in large-scale e-commerce platforms. Resolving this issue can enhance the personalization potential while respecting user privacy constraints. The dataset used in this research comprised actual e-commerce platform data, including customer behavior logs, product attributes, and temporal and regional variables. Initially, five classification models—logistic regression, decision tree, random forest, k-nearest neighbors (knn), and XGBoost—were compared. Preliminary experiments revealed that logistic regression performed relatively poorly; therefore, it was excluded from the final comparison, and hyperparameter optimization was performed on the remaining four models. Model performance was evaluated on the validation set using accuracy and F1-score as the primary metrics. Experimental results showed that the random forest (default configuration) achieved the highest performance with approximately 78\% accuracy, while XGBoost, although underperforming in the default setting, improved to a comparable level after optimization. In contrast, decision tree and knn showed limited improvement from optimization, with performance in some cases declining compared to the default setting. Feature importance analysis identified behavioral frequency, a specific event type, gender, and product attributes as key factors. This research contributes by empirically demonstrating the feasibility of constructing age group prediction models using large-scale e-commerce data, thereby offering a practical strategy for addressing missing demographic information under privacy constraints. Furthermore, the feature importance analysis provides actionable insights for target marketing and personalized recommendation system design. Conclusively, this study empirically demonstrates that behavioral logs alone are sufficient to predict demographic attributes with high accuracy. The proposed Random Forest-based framework offers a cost-effective and privacy-preserving alternative to complex deep learning models for practical deployment in real-world e-commerce systems.
Files
35Vol104No4.pdf
Files
(1.5 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:ef71bdb09a855bd68cb6627a5b055d8c
|
1.5 MB | Preview Download |