Reddit Comments Dataset for Text Style Transfer Tasks

Kopf, Fabian

doi:10.5281/zenodo.8051180

Published June 10, 2023 | Version 1.0.0

Dataset Open

Reddit Comments Dataset for Text Style Transfer Tasks

Kopf, Fabian¹

1. Technische Hochschule Augsburg

Contributors

Supervisor:

Zarcone, Alessandra¹

1. Technische Hochschule Augsburg

Reddit Comments Dataset for Text Style Transfer Tasks

A dataset of Reddit comments prepared for Text Style Transfer Tasks.

The dataset contains Reddit comments translated into a formal language. For the translation of Reddit comments into a formal language text-davinci-003 was used. To make text-davinci-003 translate the comments into a more formal version, the following prompt was used:
"Here is some text: {original_comment} Here is a rewrite of the text, which is more neutral: {"
This prompting technique was taken from A Recipe For Arbitrary Text Style Transfer with Large Language Models.

The dataset contains comments from the following Subreddits: antiwork, atheism, Conservative, conspiracy, dankmemes, gaybros, leagueoflegends, lgbt, libertarian, linguistics, MensRights, news, offbeat, PoliticalCompassMemes, politics, teenagers, TrueReddit, TwoXChromosomes, wallstreetbets, worldnews.

The quality of formal translations was assessed with BERTScore and chrF++:

BERTScore: F1-Score: 0.89, Precision: 0.90, Recall: 0.88
chrF++: 37.16

The average perplexity of the generated formal texts was calculated using GPT-2 and is 123.77

The dataset consists of 3 components.

reddit_commments.csv

This file contains a collection of randomly selected comments from 20 Subreddits. For each comment, the following information was collected:
- subreddit (name of the subreddit in which the comment was posted)
- id (ID of the comment)
- submission_id (ID of the submission to which the comment was posted)
- body (the comment itself)
- created_utc (timestamp in seconds)
- parent_id (The ID of the comment or submission to which the comment is a reply)
- permalink (The URL to the original comment)-
- token_size (How many tokens the comment will be split into by the standard GPT-2 tokenizer)
- perplexity (What perplexity does GPT-2 calculate for the comment)

The comments were filtered. This file contains only comments that:
- have been split by GPT-2 Tokenizer into more than 10 tokens but less than 512 tokens.
- are not [removed] or [deleted]
- do not contain URLs

This file was used as a source for the other two file types.

Labeled Files (training_labeled.csv and eval_labeled.csv)

These files contain the formal translations of the Reddit comments.

The 150 comments with the highest calculated perplexity of GPT-2 from each Subreddit were translated into a formal version. This filter was used to translate as many comments as possible that have large stylistic salience.

They are structured as follows:
- Subreddit (name of the subreddit where the comment was posted).
- Original Comment
- Formal Comment

Labeled Files with Style Examples (training_labeled_with_style_samples.json and eval_labeled_with_style_samples.json)

These files contain an original Reddit comment, three sample comments from the same subreddit, and the formal translation of the original Reddit comment.

These files can be used to train models to perform style transfers based on given examples.
The task is to transform the formal translation of the Reddit comment, using the three given examples, into the style of the examples.

An entry in this file is structured as follows:

"data":[
   {
      "input_sentence":"The original Reddit comment",
      "style_samples":[
         "sample1",
         "sample2",
         "sample3"
      ],
      "results_sentence":"The formal translated input_sentence",
      "subreddit":"The subreddit from which the comments originated"
   },
   "..."
]

Files

eval_labeled.csv

Files (20.8 MB)

Name	Size	Download all
eval_labeled.csv md5:7c8b39f857a676da3c2d854b9d0e67c6	116.4 kB	Preview Download
eval_labeled_with_style_samples.json md5:224d0355e2e2cc528e3a8c8d45279785	374.8 kB	Preview Download
reddit_comments.csv md5:f85479b9c2537b4afea068209e697665	18.4 MB	Preview Download
train_labeled_with_style_samples.json md5:b240b84375a02afbe7b3f1a066f3cfab	1.4 MB	Preview Download
training_labeled.csv md5:82028f1d38e752c41e7800a72d5b30d1	419.9 kB	Preview Download

Additional details

Reif, Emily et al. (2022) A Recipe For Arbitrary Text Style Transfer with Large Language Models

	All versions	This version
Views	1,631	1,351
Downloads	1,215	887
Data volume	5.5 GB	3.9 GB

Reddit Comments Dataset for Text Style Transfer Tasks

Authors/Creators

Contributors

Supervisor:

Description

Files

eval_labeled.csv

Files (20.8 MB)

Additional details

References