Published February 6, 2026 | Version v3
Dataset Open

lightweight-nsm-v2

Description

# Lightweight NSM: Narcissism Detection Dataset and Evaluation Results (Version 2.0)

## Description

This dataset contains 2,000 annotated social media posts from Mastodon used to train and evaluate the Lightweight NSM (Narcissism Social Media) classifier. Posts were collected from two hashtags (#selfie and #selfportrait) and annotated using the DSM-5 (Diagnostic and Statistical Manual of Mental Disorders, 5th Edition) clinical framework for Narcissistic Personality Disorder.

**Version 2.0 Updates:**
- **Full anonymization**: All personally identifiable information (usernames, URLs, @mentions, image URLs) has been removed to ensure GDPR compliance and protect participant privacy
- **Enhanced methodology documentation**: Added detailed description of data cleaning process (40,021 → ~14,000 posts) and stratified random sampling method (1,000 posts per hashtag)
- **CSV format**: Dataset provided in CSV format for easy access, browser preview in Zenodo, and compatibility with spreadsheet applications

## Dataset Summary

- **Total posts**: 2,000 (1,000 per hashtag)
- **Narcissistic posts**: 330 (16.5%)
- **Non-narcissistic posts**: 1,670 (83.5%)
- **Collection period**: November-December 2025
- **Platform**: Mastodon (7 major instances)
- **Annotation method**: GPT-4.1-mini with DSM-5 criteria (≥2 criteria threshold)
- **Model performance**: F1-score 0.69 (narcissistic class), Accuracy 0.89

## Key Finding

Posts tagged with #selfie exhibit a **7.5-fold higher narcissism rate** (29.1%) compared to #selfportrait posts (3.9%), suggesting distinct psychological motivations between validation-seeking and artistic self-expression.

## Files Included

### Underlying Data
1. **mastodon_2000_annotated_anonymized.csv** (663 KB) - Complete anonymized annotated dataset with DSM-5 labels in CSV format
2. **annotated_dataset_2000_posts_anonymized.csv** (663 KB) - Alternative CSV file with the same dataset for compatibility
3. **evaluation_results.csv** (187 bytes) - Model performance metrics (F1, precision, recall, accuracy)

### Extended Data
4. **confusion_matrix.png** (25 KB, 300 DPI) - Confusion matrix visualization (test set N=400)
5. **feature_importance_top20.png** (35 KB, 300 DPI) - Top 20 most important features for classification
6. **annotation_prompt_gpt4.txt** (4.6 KB) - GPT-4.1-mini annotation prompt with DSM-5 criteria definitions
7. **data_collection_methodology.pdf** (369 KB) - Comprehensive 12-page methodology documentation including data collection, cleaning, sampling, annotation, and model training procedures
8. **README.md** (7.9 KB) - Dataset documentation and usage instructions

## Source Code

The complete source code for data collection, preprocessing, annotation, and model training is available on GitLab:

**Repository**: https://gitlab.com/aleksandra.ata/lightweight-nsm  
**License**: Apache 2.0

Includes:
- `Mastodon_Scraper_MultiHashtag_Top50K.ipynb` - Data collection from Mastodon API
- `preprocessing_pipeline_FIXED.ipynb` - Data cleaning and stratified sampling
- `training_evaluation_pipeline.ipynb` - Model training and evaluation
- `best_model.pkl` - Pre-trained Logistic Regression classifier

## Citation

If you use this dataset, please cite:

```
Ata, A., & Kozak, K. (2026). Lightweight NSM: Narcissism Detection Dataset 
and Evaluation Results (Version 2.0) [Data set]. Zenodo. 
https://doi.org/10.5281/zenodo.18484444
```

## Related Publication

This dataset supports the research article:

> Ata, A., & Kozak, K. (2026). A Lightweight Narcissistic Social Media (NSM) Detection Architecture Using TF-IDF and Logistic Regression. *Open Research Europe* (in press).

## Ethics and Privacy

- All data was collected from publicly available Mastodon posts using the platform's official API
- No private or protected content was accessed
- **Full anonymization applied**: All usernames, profile URLs, @mentions, image URLs, and location markers have been removed
- Posts are identified only by anonymous numerical IDs (anon_XXXXXX)
- Research conducted in accordance with GDPR Article 89 and EU Horizon Europe ethical guidelines
- Informed consent waived as data was publicly posted and fully anonymized prior to publication

## Funding

This work was supported by the European Union's Horizon Europe research and innovation programme under grant agreement No. 101095654 (Trustworthy Hybrid Cognitive Systems - THCS).

## Contact

For questions or data removal requests:

**Aleksandra Ata**  
Email: aleksandra.ata@dsw.edu.pl  
Affiliation: DSW University of Lower Silesia, Wrocław, Poland

## License

This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC-BY 4.0).

---

**Keywords**: narcissism, social media, Mastodon, DSM-5, natural language processing, personality detection, clinical linguistics, mental health, text classification, logistic regression

Files

annotated_dataset_2000_posts_anonymized.csv

Files (1.8 MB)

Name Size Download all
md5:6a1d04f149a166c33034fa5d90306ecf
678.1 kB Preview Download
md5:761a748c4bcbf87a182c3e4124e055ad
4.7 kB Preview Download
md5:94e978868e9c4199353b9da29c92c724
25.5 kB Preview Download
md5:62055d933977321c23ddb2c6a39b53bc
377.8 kB Preview Download
md5:ae66830df7e8c4f3efa4c4ea6422e528
187 Bytes Preview Download
md5:5b5f849a330e80d01f3023b043761e16
35.1 kB Preview Download
md5:6a1d04f149a166c33034fa5d90306ecf
678.1 kB Preview Download
md5:c089300d7b1171d2689802b3b1be1a32
8.1 kB Preview Download

Additional details

Related works

Is new version of
Dataset: 10.5281/zenodo.18484444 (DOI)

Dates

Updated
2026-02-05
v3

References

  • 10.5281/zenodo.18484444