Comparative Analysis of Logistic Regression and XGBoost for Depression Detection from Reddit Posts
Authors/Creators
Description
Depression is a major mental health disorder that affects
millions of individuals worldwide and often remains
undiagnosed due to social stigma and limited access to
professional care. The increasing use of social media
platforms such as Reddit provides an opportunity to
analyze textual expressions that may contain indicators
of depressive behavior. This study investigates the
effectiveness of machine learning techniques for
depression detection using textual data collected from
Reddit posts.
A Reddit Depression Dataset containing 7,731 posts was
analyzed through exploratory data analysis, text
preprocessing, TF-IDF feature extraction, and VADER
sentiment analysis. The extracted features were
evaluated using two machine learning classifiers:
Logistic Regression and XGBoost. Performance was
assessed using accuracy, precision, recall, F1-score,
confusion matrix analysis, and ROC-AUC.
Experimental results demonstrated strong classification
performance. Logistic Regression achieved an accuracy
of 94.44% and an AUC of 0.986, while XGBoost
achieved a slightly higher accuracy of 94.70%. The
findings indicate that TF-IDF lexical features provide
substantial predictive information for distinguishing
depressive and non-depressive posts. Sentiment analysis
further revealed noticeable differences in emotional
polarity between the two classes.
The study presents a reproducible and computationally
efficient framework for depression detection using
publicly available data and open-source tools. The
proposed workflow is suitable for academic research
and educational environments where interpretability,
simplicity,
and reproducibility are important
considerations.
Files
Reddit_Depression_Detection_ML.pdf
Files
(339.9 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:e69c1b34f7b8ebe73042ee2d909baded
|
339.9 kB | Preview Download |
Additional details
Software
- Programming language
- Python
References
- [1] American Psychiatric Association, Diagnostic and Statistical Manual of Mental Disorders (DSM-5), 5th ed. Arlington, VA, USA: American Psychiatric Publishing, 2013.
- [2] World Health Organization, "Depressive disorder (depression)," WHO Fact Sheet, Sep. 2023. [Online]. Available: https://www.who.int/news-room/fact sheets/detail/depression.
- [3] J. W. Pennebaker, M. R. Mehl, and K. G. Niederhoffer, "Psychological aspects of natural language use: Our words, our selves," Annu. Rev. Psychol., vol. 54, pp. 547-577, 2003.
- [4] M. De Choudhury, M. Gamon, S. Counts, and E. Horvitz, "Predicting depression via social media," in Proc. 7th Int. AAAI Conf. Weblogs and Social Media (ICWSM), 2013, pp. 128-137.
- [5] A. Yates, A. Cohan, and N. Goharian, "Depression and self-harm risk assessment in online forums," in Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2017, pp. 2968-2978.
- [6] A. H. Orabi, P. Buddhitha, M. H. Orabi, and D. Inkpen, "Deep learning for depression detection of Twitter users," in Proc. 5th Workshop on Computational Linguistics and Clinical Psychology (CLPsych), 2018, pp. 88-97.
- [7] T. Chen and C. Guestrin, "XGBoost: A Scalable Tree Boosting System," in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD), 2016, pp. 785-794.
- [8] C. J. Hutto and E. Gilbert, "VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text," in Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media (ICWSM), Ann Arbor, MI, USA, 2014, pp. 216-225.
- [9] G. Coppersmith, M. Dredze, and C. Harman, "Quantifying Mental Health Signals in Twitter," in Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality (CLPsych), Baltimore, MD, USA, 2014, pp. 51-60.
- [10] D. L. Mowery, C. Bryan, and M. Conway, "Feature Studies to Inform the Classification of Depressive Symptoms from Twitter Data for Population Health," in Proceedings of the Fourth Workshop on Computational Linguistics and Clinical Psychology (CLPsych), Vancouver, Canada, 2017, pp. 1-12.
- [11] InfamousCoder, "Depression Reddit Cleaned Dataset," Kaggle. [Online]. Available: https://www.kaggle.com/datasets/infamouscoder/depre ssion-reddit-cleaned. [Accessed: May 2026].
- [12] S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python. Sebastopol, CA, USA: O'Reilly Media, 2009.
- [13] M. Honnibal and I. Montani, "spaCy 2: Natural Language Understanding with Bloom Embeddings, Convolutional Neural Networks and Incremental Parsing," 2017. [Online]. Available: https://spacy.io
- [14] F. Pedregosa, G. Varoquaux, A. Gramfort et al., "Scikit-learn: Machine Learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011.