Published June 3, 2026 | Version v1
Preprint Open

Comparative Analysis of Logistic Regression and XGBoost for Depression Detection from Reddit Posts

  • 1. ROR icon Amicable Knowledge Solution University

Description

Depression is a major mental health disorder that affects 
millions of individuals worldwide and often remains 
undiagnosed due to social stigma and limited access to 
professional care. The increasing use of social media 
platforms such as Reddit provides an opportunity to 
analyze textual expressions that may contain indicators 
of depressive behavior. This study investigates the 
effectiveness of machine learning techniques for 
depression detection using textual data collected from 
Reddit posts. 
A Reddit Depression Dataset containing 7,731 posts was 
analyzed through exploratory data analysis, text 
preprocessing, TF-IDF feature extraction, and VADER 
sentiment analysis. The extracted features were 
evaluated using two machine learning classifiers: 
Logistic Regression and XGBoost. Performance was 
assessed using accuracy, precision, recall, F1-score, 
confusion matrix analysis, and ROC-AUC. 
Experimental results demonstrated strong classification 
performance. Logistic Regression achieved an accuracy 
of 94.44% and an AUC of 0.986, while XGBoost 
achieved a slightly higher accuracy of 94.70%. The 
findings indicate that TF-IDF lexical features provide 
substantial predictive information for distinguishing 
depressive and non-depressive posts. Sentiment analysis 
further revealed noticeable differences in emotional 
polarity between the two classes. 
The study presents a reproducible and computationally 
efficient framework for depression detection using 
publicly available data and open-source tools. The 
proposed workflow is suitable for academic research 
and educational environments where interpretability, 
simplicity, 
and reproducibility are important 
considerations.

Files

Reddit_Depression_Detection_ML.pdf

Files (339.9 kB)

Name Size Download all
md5:e69c1b34f7b8ebe73042ee2d909baded
339.9 kB Preview Download

Additional details

Software

Programming language
Python

References

  • [1] American Psychiatric Association, Diagnostic and Statistical Manual of Mental Disorders (DSM-5), 5th ed. Arlington, VA, USA: American Psychiatric Publishing, 2013.
  • [2] World Health Organization, "Depressive disorder (depression)," WHO Fact Sheet, Sep. 2023. [Online]. Available: https://www.who.int/news-room/fact sheets/detail/depression.
  • [3] J. W. Pennebaker, M. R. Mehl, and K. G. Niederhoffer, "Psychological aspects of natural language use: Our words, our selves," Annu. Rev. Psychol., vol. 54, pp. 547-577, 2003.
  • [4] M. De Choudhury, M. Gamon, S. Counts, and E. Horvitz, "Predicting depression via social media," in Proc. 7th Int. AAAI Conf. Weblogs and Social Media (ICWSM), 2013, pp. 128-137.
  • [5] A. Yates, A. Cohan, and N. Goharian, "Depression and self-harm risk assessment in online forums," in Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2017, pp. 2968-2978.
  • [6] A. H. Orabi, P. Buddhitha, M. H. Orabi, and D. Inkpen, "Deep learning for depression detection of Twitter users," in Proc. 5th Workshop on Computational Linguistics and Clinical Psychology (CLPsych), 2018, pp. 88-97.
  • [7] T. Chen and C. Guestrin, "XGBoost: A Scalable Tree Boosting System," in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD), 2016, pp. 785-794.
  • [8] C. J. Hutto and E. Gilbert, "VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text," in Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media (ICWSM), Ann Arbor, MI, USA, 2014, pp. 216-225.
  • [9] G. Coppersmith, M. Dredze, and C. Harman, "Quantifying Mental Health Signals in Twitter," in Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality (CLPsych), Baltimore, MD, USA, 2014, pp. 51-60.
  • [10] D. L. Mowery, C. Bryan, and M. Conway, "Feature Studies to Inform the Classification of Depressive Symptoms from Twitter Data for Population Health," in Proceedings of the Fourth Workshop on Computational Linguistics and Clinical Psychology (CLPsych), Vancouver, Canada, 2017, pp. 1-12.
  • [11] InfamousCoder, "Depression Reddit Cleaned Dataset," Kaggle. [Online]. Available: https://www.kaggle.com/datasets/infamouscoder/depre ssion-reddit-cleaned. [Accessed: May 2026].
  • [12] S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python. Sebastopol, CA, USA: O'Reilly Media, 2009.
  • [13] M. Honnibal and I. Montani, "spaCy 2: Natural Language Understanding with Bloom Embeddings, Convolutional Neural Networks and Incremental Parsing," 2017. [Online]. Available: https://spacy.io
  • [14] F. Pedregosa, G. Varoquaux, A. Gramfort et al., "Scikit-learn: Machine Learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011.