Identifying Constructive Conflict in Online Discussions through Controversial yet Toxicity Resilient Posts

Seçkin, Özgür Can; Truong, Bao Tran; flammini, alessandro; Menczer, Filippo

doi:10.5281/zenodo.17167317

Published September 20, 2025 | Version v1

Dataset Open

Identifying Constructive Conflict in Online Discussions through Controversial yet Toxicity Resilient Posts

1. Indiana University Bloomington

Contributors

Researchers:

1. Indiana University Bloomington

Identifying Constructive Conflict in Online Discussions through Controversial yet Toxicity Resilient Posts

Readme

20-September-2025

Contact: oseckin[at]iu.edu

This repository contains the datasets required to reproduce the results presented in the paper "Identifying Constructive Conflict in Online Discussions through Controversial yet Toxicity Resilient Posts."

submissions.parquet: PII redacted Reddit submissions included in the study.
comments.parquet: PII redacted Reddit submissions included in the study.
t_manual_annotation.parquet: Manually annotated Reddit posts for toxicity.
c_manual_annotation.parquet: Manually annotated Reddit posts for controversiality.

Dataset Content for `submissions.parquet`

Reddit Metadata

created_utc: Unix timestamp indicating when the Reddit submission was created.
id: Unique identifier for the Reddit submission.
subreddit: The name of the subreddit where the submission was posted.
score: Reddit's upvote/downvote score for the submission.
train_valid_test: Dataset split indicator showing whether the submission is used for training, validation, or testing.

Toxicity and Controversy Scores

submission_openai: OpenAI's toxicity score for the submission text (0-1 scale).
mean_openai: Mean toxicity score across all comments in the submission.
toxic_ratio: Proportion of comments in the submission that are considered toxic (score ≥ 0.5).
ta_score: Model-predicted toxicity score generated by the fine-tuned DistilBERT model.
c_score: Model-predicted controversy score generated by the fine-tuned DistilBERT model.

Topic Analysis

bertopic: Topic assignment from BERTopic clustering analysis of submission texts.

Linguistic Features

is_question: Binary indicator (0/1) for whether the submission contains a question mark.
question_ratio: Proportion of sentences in the submission that are questions.
gratitude_ratio: Proportion of sentences in the submission that express gratitude.
gratitude: Binary indicator (0/1) for whether the submission contains gratitude expressions.
gratitude_count: Count of gratitude expressions found in the submission.
pos_tag_counts: Dictionary containing counts of each part-of-speech tag in the submission.
proper_noun_count: Count of proper nouns (NNP, NNPS) found in the submission.
proper_noun_ratio: Proportion of words in the submission that are proper nouns.
text_length: Total number of words in the submission.
lexical_item_count: Count of unique lexical items (nouns, verbs, adjectives, adverbs) in the submission.
mtld: Measure of Textual Lexical Diversity (MTLD) score indicating vocabulary richness.
hedge: Binary indicator (0/1) for whether the submission contains hedging language.
hedge_count: Count of hedging expressions found in the submission.
hedge_ratio: Proportion of sentences in the submission that contain hedging language.
polarity: Sentiment polarity score of the submission text (-1 to 1 scale).
positive_polarity: Binary indicator (0/1) for whether the submission has positive sentiment.
negative_polarity: Binary indicator (0/1) for whether the submission has negative sentiment.

Files

Files (232.5 MB)

Name	Size	Download all
c_manual_annotation.parquet md5:319676efe2f4ec4b1bcef02d29b8c2ca	110.0 kB	Download
comments.parquet md5:9c443e823f1a3e56aa6e14b6b84404c6	168.0 MB	Download
submissions.parquet md5:4f5b3c071e6bb93976491354bf3bef60	64.4 MB	Download
t_manual_annotation.parquet md5:1b62d796bc0190e8bb39be17523fb044	67.6 kB	Download

	All versions	This version
Views	403	403
Downloads	285	285
Data volume	18.4 GB	18.4 GB

Identifying Constructive Conflict in Online Discussions through Controversial yet Toxicity Resilient Posts

Creators

Contributors

Researchers:

Description

Identifying Constructive Conflict in Online Discussions through Controversial yet Toxicity Resilient Posts

Readme

Dataset Content for submissions.parquet

Reddit Metadata

Toxicity and Controversy Scores

Topic Analysis

Linguistic Features

Files

Files (232.5 MB)

Dataset Content for `submissions.parquet`