Published September 20, 2025
| Version v1
Dataset
Open
Identifying Constructive Conflict in Online Discussions through Controversial yet Toxicity Resilient Posts
- 1. Indiana University Bloomington
Contributors
- 1. Indiana University Bloomington
Description
Identifying Constructive Conflict in Online Discussions through Controversial yet Toxicity Resilient Posts
Readme
20-September-2025
Contact: oseckin[at]iu.edu
This repository contains the datasets required to reproduce the results presented in the paper "Identifying Constructive Conflict in Online Discussions through Controversial yet Toxicity Resilient Posts."
- submissions.parquet: PII redacted Reddit submissions included in the study.
- comments.parquet: PII redacted Reddit submissions included in the study.
- t_manual_annotation.parquet: Manually annotated Reddit posts for toxicity.
- c_manual_annotation.parquet: Manually annotated Reddit posts for controversiality.
Dataset Content for submissions.parquet
Reddit Metadata
- created_utc: Unix timestamp indicating when the Reddit submission was created.
- id: Unique identifier for the Reddit submission.
- subreddit: The name of the subreddit where the submission was posted.
- score: Reddit's upvote/downvote score for the submission.
- train_valid_test: Dataset split indicator showing whether the submission is used for training, validation, or testing.
Toxicity and Controversy Scores
- submission_openai: OpenAI's toxicity score for the submission text (0-1 scale).
- mean_openai: Mean toxicity score across all comments in the submission.
- toxic_ratio: Proportion of comments in the submission that are considered toxic (score ≥ 0.5).
- ta_score: Model-predicted toxicity score generated by the fine-tuned DistilBERT model.
- c_score: Model-predicted controversy score generated by the fine-tuned DistilBERT model.
Topic Analysis
- bertopic: Topic assignment from BERTopic clustering analysis of submission texts.
Linguistic Features
- is_question: Binary indicator (0/1) for whether the submission contains a question mark.
- question_ratio: Proportion of sentences in the submission that are questions.
- gratitude_ratio: Proportion of sentences in the submission that express gratitude.
- gratitude: Binary indicator (0/1) for whether the submission contains gratitude expressions.
- gratitude_count: Count of gratitude expressions found in the submission.
- pos_tag_counts: Dictionary containing counts of each part-of-speech tag in the submission.
- proper_noun_count: Count of proper nouns (NNP, NNPS) found in the submission.
- proper_noun_ratio: Proportion of words in the submission that are proper nouns.
- text_length: Total number of words in the submission.
- lexical_item_count: Count of unique lexical items (nouns, verbs, adjectives, adverbs) in the submission.
- mtld: Measure of Textual Lexical Diversity (MTLD) score indicating vocabulary richness.
- hedge: Binary indicator (0/1) for whether the submission contains hedging language.
- hedge_count: Count of hedging expressions found in the submission.
- hedge_ratio: Proportion of sentences in the submission that contain hedging language.
- polarity: Sentiment polarity score of the submission text (-1 to 1 scale).
- positive_polarity: Binary indicator (0/1) for whether the submission has positive sentiment.
- negative_polarity: Binary indicator (0/1) for whether the submission has negative sentiment.
Files
Files
(232.5 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:319676efe2f4ec4b1bcef02d29b8c2ca
|
110.0 kB | Download |
|
md5:9c443e823f1a3e56aa6e14b6b84404c6
|
168.0 MB | Download |
|
md5:4f5b3c071e6bb93976491354bf3bef60
|
64.4 MB | Download |
|
md5:1b62d796bc0190e8bb39be17523fb044
|
67.6 kB | Download |