Published February 6, 2026 | Version v3

Dataset Open

lightweight-nsm-v2

# Lightweight NSM: Narcissism Detection Dataset and Evaluation Results (Version 2.0)

## Description

This dataset contains 2,000 annotated social media posts from Mastodon used to train and evaluate the Lightweight NSM (Narcissism Social Media) classifier. Posts were collected from two hashtags (#selfie and #selfportrait) and annotated using the DSM-5 (Diagnostic and Statistical Manual of Mental Disorders, 5th Edition) clinical framework for Narcissistic Personality Disorder.

**Version 2.0 Updates:**
- **Full anonymization**: All personally identifiable information (usernames, URLs, @mentions, image URLs) has been removed to ensure GDPR compliance and protect participant privacy
- **Enhanced methodology documentation**: Added detailed description of data cleaning process (40,021 → ~14,000 posts) and stratified random sampling method (1,000 posts per hashtag)
- **CSV format**: Dataset provided in CSV format for easy access, browser preview in Zenodo, and compatibility with spreadsheet applications

## Dataset Summary

- **Total posts**: 2,000 (1,000 per hashtag)
- **Narcissistic posts**: 330 (16.5%)
- **Non-narcissistic posts**: 1,670 (83.5%)
- **Collection period**: November-December 2025
- **Platform**: Mastodon (7 major instances)
- **Annotation method**: GPT-4.1-mini with DSM-5 criteria (≥2 criteria threshold)
- **Model performance**: F1-score 0.69 (narcissistic class), Accuracy 0.89

## Key Finding

Posts tagged with #selfie exhibit a **7.5-fold higher narcissism rate** (29.1%) compared to #selfportrait posts (3.9%), suggesting distinct psychological motivations between validation-seeking and artistic self-expression.

## Files Included

### Underlying Data
1. **mastodon_2000_annotated_anonymized.csv** (663 KB) - Complete anonymized annotated dataset with DSM-5 labels in CSV format
2. **annotated_dataset_2000_posts_anonymized.csv** (663 KB) - Alternative CSV file with the same dataset for compatibility
3. **evaluation_results.csv** (187 bytes) - Model performance metrics (F1, precision, recall, accuracy)

### Extended Data
4. **confusion_matrix.png** (25 KB, 300 DPI) - Confusion matrix visualization (test set N=400)
5. **feature_importance_top20.png** (35 KB, 300 DPI) - Top 20 most important features for classification
6. **annotation_prompt_gpt4.txt** (4.6 KB) - GPT-4.1-mini annotation prompt with DSM-5 criteria definitions
7. **data_collection_methodology.pdf** (369 KB) - Comprehensive 12-page methodology documentation including data collection, cleaning, sampling, annotation, and model training procedures
8. **README.md** (7.9 KB) - Dataset documentation and usage instructions

## Source Code

The complete source code for data collection, preprocessing, annotation, and model training is available on GitLab:

**Repository**: https://gitlab.com/aleksandra.ata/lightweight-nsm
**License**: Apache 2.0

Includes:
- `Mastodon_Scraper_MultiHashtag_Top50K.ipynb` - Data collection from Mastodon API
- `preprocessing_pipeline_FIXED.ipynb` - Data cleaning and stratified sampling
- `training_evaluation_pipeline.ipynb` - Model training and evaluation
- `best_model.pkl` - Pre-trained Logistic Regression classifier

## Citation

If you use this dataset, please cite:

```
Ata, A., & Kozak, K. (2026). Lightweight NSM: Narcissism Detection Dataset
and Evaluation Results (Version 2.0) [Data set]. Zenodo.
https://doi.org/10.5281/zenodo.18484444
```

## Related Publication

This dataset supports the research article:

> Ata, A., & Kozak, K. (2026). A Lightweight Narcissistic Social Media (NSM) Detection Architecture Using TF-IDF and Logistic Regression. *Open Research Europe* (in press).

## Ethics and Privacy

- All data was collected from publicly available Mastodon posts using the platform's official API
- No private or protected content was accessed
- **Full anonymization applied**: All usernames, profile URLs, @mentions, image URLs, and location markers have been removed
- Posts are identified only by anonymous numerical IDs (anon_XXXXXX)
- Research conducted in accordance with GDPR Article 89 and EU Horizon Europe ethical guidelines
- Informed consent waived as data was publicly posted and fully anonymized prior to publication

## Funding

This work was supported by the European Union's Horizon Europe research and innovation programme under grant agreement No. 101095654 (Trustworthy Hybrid Cognitive Systems - THCS).

## Contact

For questions or data removal requests:

**Aleksandra Ata**
Email: aleksandra.ata@dsw.edu.pl
Affiliation: DSW University of Lower Silesia, Wrocław, Poland

## License

This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC-BY 4.0).

---

**Keywords**: narcissism, social media, Mastodon, DSM-5, natural language processing, personality detection, clinical linguistics, mental health, text classification, logistic regression

Files

annotated_dataset_2000_posts_anonymized.csv

Files (1.8 MB)

Name	Size	Download all
annotated_dataset_2000_posts_anonymized.csv md5:6a1d04f149a166c33034fa5d90306ecf	678.1 kB	Preview Download
annotation_prompt_gpt4.txt md5:761a748c4bcbf87a182c3e4124e055ad	4.7 kB	Preview Download
confusion_matrix.png md5:94e978868e9c4199353b9da29c92c724	25.5 kB	Preview Download
data_collection_methodology.pdf md5:62055d933977321c23ddb2c6a39b53bc	377.8 kB	Preview Download
evaluation_results.csv md5:ae66830df7e8c4f3efa4c4ea6422e528	187 Bytes	Preview Download
feature_importance_top20.png md5:5b5f849a330e80d01f3023b043761e16	35.1 kB	Preview Download
mastodon_2000_annotated_anonymized.csv md5:6a1d04f149a166c33034fa5d90306ecf	678.1 kB	Preview Download
README.md md5:c089300d7b1171d2689802b3b1be1a32	8.1 kB	Preview Download

Additional details

DOI: 10.5281/zenodo.18484444

Is new version of: Dataset: 10.5281/zenodo.18484444 (DOI)

Updated: 2026-02-05

v3

Repository URL: https://gitlab.com/aleksandra.ata/lightweight-nsm

10.5281/zenodo.18484444

Views

Downloads

Show more details

	All versions	This version
Views	62	23
Downloads	69	29
Data volume	62.1 MB	17.1 MB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

License: Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: February 6, 2026
Modified: February 6, 2026

annotated_dataset_2000_posts_anonymized.csv

Files (1.8 MB)

Identifiers

Related works

Dates

Software

References

lightweight-nsm-v2

Authors/Creators

Description

Files

annotated_dataset_2000_posts_anonymized.csv

Files (1.8 MB)

Additional details

Identifiers

Related works

Dates

Software

References