lightweight-nsm-v2
Authors/Creators
Description
# Lightweight NSM: Narcissism Detection Dataset and Evaluation Results (Version 2.0)
## Description
This dataset contains 2,000 annotated social media posts from Mastodon used to train and evaluate the Lightweight NSM (Narcissism Social Media) classifier. Posts were collected from two hashtags (#selfie and #selfportrait) and annotated using the DSM-5 (Diagnostic and Statistical Manual of Mental Disorders, 5th Edition) clinical framework for Narcissistic Personality Disorder.
**Version 2.0 Updates:**
- **Full anonymization**: All personally identifiable information (usernames, URLs, @mentions, image URLs) has been removed to ensure GDPR compliance and protect participant privacy
- **Enhanced methodology documentation**: Added detailed description of data cleaning process (40,021 → ~14,000 posts) and stratified random sampling method (1,000 posts per hashtag)
- **CSV format**: Dataset provided in CSV format for easy access, browser preview in Zenodo, and compatibility with spreadsheet applications
## Dataset Summary
- **Total posts**: 2,000 (1,000 per hashtag)
- **Narcissistic posts**: 330 (16.5%)
- **Non-narcissistic posts**: 1,670 (83.5%)
- **Collection period**: November-December 2025
- **Platform**: Mastodon (7 major instances)
- **Annotation method**: GPT-4.1-mini with DSM-5 criteria (≥2 criteria threshold)
- **Model performance**: F1-score 0.69 (narcissistic class), Accuracy 0.89
## Key Finding
Posts tagged with #selfie exhibit a **7.5-fold higher narcissism rate** (29.1%) compared to #selfportrait posts (3.9%), suggesting distinct psychological motivations between validation-seeking and artistic self-expression.
## Files Included
### Underlying Data
1. **mastodon_2000_annotated_anonymized.csv** (663 KB) - Complete anonymized annotated dataset with DSM-5 labels in CSV format
2. **annotated_dataset_2000_posts_anonymized.csv** (663 KB) - Alternative CSV file with the same dataset for compatibility
3. **evaluation_results.csv** (187 bytes) - Model performance metrics (F1, precision, recall, accuracy)
### Extended Data
4. **confusion_matrix.png** (25 KB, 300 DPI) - Confusion matrix visualization (test set N=400)
5. **feature_importance_top20.png** (35 KB, 300 DPI) - Top 20 most important features for classification
6. **annotation_prompt_gpt4.txt** (4.6 KB) - GPT-4.1-mini annotation prompt with DSM-5 criteria definitions
7. **data_collection_methodology.pdf** (369 KB) - Comprehensive 12-page methodology documentation including data collection, cleaning, sampling, annotation, and model training procedures
8. **README.md** (7.9 KB) - Dataset documentation and usage instructions
## Source Code
The complete source code for data collection, preprocessing, annotation, and model training is available on GitLab:
**Repository**: https://gitlab.com/aleksandra.ata/lightweight-nsm
**License**: Apache 2.0
Includes:
- `Mastodon_Scraper_MultiHashtag_Top50K.ipynb` - Data collection from Mastodon API
- `preprocessing_pipeline_FIXED.ipynb` - Data cleaning and stratified sampling
- `training_evaluation_pipeline.ipynb` - Model training and evaluation
- `best_model.pkl` - Pre-trained Logistic Regression classifier
## Citation
If you use this dataset, please cite:
```
Ata, A., & Kozak, K. (2026). Lightweight NSM: Narcissism Detection Dataset
and Evaluation Results (Version 2.0) [Data set]. Zenodo.
https://doi.org/10.5281/zenodo.18484444
```
## Related Publication
This dataset supports the research article:
> Ata, A., & Kozak, K. (2026). A Lightweight Narcissistic Social Media (NSM) Detection Architecture Using TF-IDF and Logistic Regression. *Open Research Europe* (in press).
## Ethics and Privacy
- All data was collected from publicly available Mastodon posts using the platform's official API
- No private or protected content was accessed
- **Full anonymization applied**: All usernames, profile URLs, @mentions, image URLs, and location markers have been removed
- Posts are identified only by anonymous numerical IDs (anon_XXXXXX)
- Research conducted in accordance with GDPR Article 89 and EU Horizon Europe ethical guidelines
- Informed consent waived as data was publicly posted and fully anonymized prior to publication
## Funding
This work was supported by the European Union's Horizon Europe research and innovation programme under grant agreement No. 101095654 (Trustworthy Hybrid Cognitive Systems - THCS).
## Contact
For questions or data removal requests:
**Aleksandra Ata**
Email: aleksandra.ata@dsw.edu.pl
Affiliation: DSW University of Lower Silesia, Wrocław, Poland
## License
This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC-BY 4.0).
---
**Keywords**: narcissism, social media, Mastodon, DSM-5, natural language processing, personality detection, clinical linguistics, mental health, text classification, logistic regression
Files
annotated_dataset_2000_posts_anonymized.csv
Files
(1.8 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:6a1d04f149a166c33034fa5d90306ecf
|
678.1 kB | Preview Download |
|
md5:761a748c4bcbf87a182c3e4124e055ad
|
4.7 kB | Preview Download |
|
md5:94e978868e9c4199353b9da29c92c724
|
25.5 kB | Preview Download |
|
md5:62055d933977321c23ddb2c6a39b53bc
|
377.8 kB | Preview Download |
|
md5:ae66830df7e8c4f3efa4c4ea6422e528
|
187 Bytes | Preview Download |
|
md5:5b5f849a330e80d01f3023b043761e16
|
35.1 kB | Preview Download |
|
md5:6a1d04f149a166c33034fa5d90306ecf
|
678.1 kB | Preview Download |
|
md5:c089300d7b1171d2689802b3b1be1a32
|
8.1 kB | Preview Download |
Additional details
Identifiers
Related works
- Is new version of
- Dataset: 10.5281/zenodo.18484444 (DOI)
Dates
- Updated
-
2026-02-05v3
Software
- Repository URL
- https://gitlab.com/aleksandra.ata/lightweight-nsm
References
- 10.5281/zenodo.18484444