Published September 20, 2025 | Version v1
Dataset Open

Identifying Constructive Conflict in Online Discussions through Controversial yet Toxicity Resilient Posts

  • 1. Indiana University Bloomington
  • 1. Indiana University Bloomington

Description

Identifying Constructive Conflict in Online Discussions through Controversial yet Toxicity Resilient Posts

Readme

20-September-2025

Contact: oseckin[at]iu.edu

This repository contains the datasets required to reproduce the results presented in the paper "Identifying Constructive Conflict in Online Discussions through Controversial yet Toxicity Resilient Posts."

  • submissions.parquet: PII redacted Reddit submissions included in the study.
  • comments.parquet: PII redacted Reddit submissions included in the study.
  • t_manual_annotation.parquet: Manually annotated Reddit posts for toxicity.
  • c_manual_annotation.parquet: Manually annotated Reddit posts for controversiality.

Dataset Content for submissions.parquet

Reddit Metadata

  • created_utc: Unix timestamp indicating when the Reddit submission was created.
  • id: Unique identifier for the Reddit submission.
  • subreddit: The name of the subreddit where the submission was posted.
  • score: Reddit's upvote/downvote score for the submission.
  • train_valid_test: Dataset split indicator showing whether the submission is used for training, validation, or testing.

Toxicity and Controversy Scores

  • submission_openai: OpenAI's toxicity score for the submission text (0-1 scale).
  • mean_openai: Mean toxicity score across all comments in the submission.
  • toxic_ratio: Proportion of comments in the submission that are considered toxic (score ≥ 0.5).
  • ta_score: Model-predicted toxicity score generated by the fine-tuned DistilBERT model.
  • c_score: Model-predicted controversy score generated by the fine-tuned DistilBERT model.

Topic Analysis

  • bertopic: Topic assignment from BERTopic clustering analysis of submission texts.

Linguistic Features

  • is_question: Binary indicator (0/1) for whether the submission contains a question mark.
  • question_ratio: Proportion of sentences in the submission that are questions.
  • gratitude_ratio: Proportion of sentences in the submission that express gratitude.
  • gratitude: Binary indicator (0/1) for whether the submission contains gratitude expressions.
  • gratitude_count: Count of gratitude expressions found in the submission.
  • pos_tag_counts: Dictionary containing counts of each part-of-speech tag in the submission.
  • proper_noun_count: Count of proper nouns (NNP, NNPS) found in the submission.
  • proper_noun_ratio: Proportion of words in the submission that are proper nouns.
  • text_length: Total number of words in the submission.
  • lexical_item_count: Count of unique lexical items (nouns, verbs, adjectives, adverbs) in the submission.
  • mtld: Measure of Textual Lexical Diversity (MTLD) score indicating vocabulary richness.
  • hedge: Binary indicator (0/1) for whether the submission contains hedging language.
  • hedge_count: Count of hedging expressions found in the submission.
  • hedge_ratio: Proportion of sentences in the submission that contain hedging language.
  • polarity: Sentiment polarity score of the submission text (-1 to 1 scale).
  • positive_polarity: Binary indicator (0/1) for whether the submission has positive sentiment.
  • negative_polarity: Binary indicator (0/1) for whether the submission has negative sentiment.

Files

Files (232.5 MB)

Name Size Download all
md5:319676efe2f4ec4b1bcef02d29b8c2ca
110.0 kB Download
md5:9c443e823f1a3e56aa6e14b6b84404c6
168.0 MB Download
md5:4f5b3c071e6bb93976491354bf3bef60
64.4 MB Download
md5:1b62d796bc0190e8bb39be17523fb044
67.6 kB Download