Published June 10, 2023 | Version 1.0.0
Dataset Open

Reddit Comments Dataset for Text Style Transfer Tasks

Creators

  • 1. Technische Hochschule Augsburg

Contributors

  • 1. Technische Hochschule Augsburg

Description

Reddit Comments Dataset for Text Style Transfer Tasks

A dataset of Reddit comments prepared for Text Style Transfer Tasks.

The dataset contains Reddit comments translated into a formal language. For the translation of Reddit comments into a formal language text-davinci-003 was used. To make text-davinci-003 translate the comments into a more formal version, the following prompt was used:
"Here is some text: {original_comment} Here is a rewrite of the text, which is more neutral: {"
This prompting technique was taken from A Recipe For Arbitrary Text Style Transfer with Large Language Models.

The dataset contains comments from the following Subreddits: antiwork, atheism, Conservative, conspiracy, dankmemes, gaybros, leagueoflegends, lgbt, libertarian, linguistics, MensRights, news, offbeat, PoliticalCompassMemes, politics, teenagers, TrueReddit, TwoXChromosomes, wallstreetbets, worldnews.

The quality of formal translations was assessed with BERTScore and chrF++:

  • BERTScore: F1-Score: 0.89, Precision: 0.90, Recall: 0.88
  • chrF++: 37.16

The average perplexity of the generated formal texts was calculated using GPT-2 and is 123.77


The dataset consists of 3 components.

reddit_commments.csv

This file contains a collection of randomly selected comments from 20 Subreddits. For each comment, the following information was collected:
- subreddit (name of the subreddit in which the comment was posted)
- id (ID of the comment)
- submission_id (ID of the submission to which the comment was posted)
- body (the comment itself)
- created_utc (timestamp in seconds)
- parent_id (The ID of the comment or submission to which the comment is a reply)
- permalink (The URL to the original comment)-
- token_size (How many tokens the comment will be split into by the standard GPT-2 tokenizer)
- perplexity (What perplexity does GPT-2 calculate for the comment)

The comments were filtered. This file contains only comments that:
- have been split by GPT-2 Tokenizer into more than 10 tokens but less than 512 tokens.
- are not [removed] or [deleted]
- do not contain URLs

This file was used as a source for the other two file types.

Labeled Files (training_labeled.csv and eval_labeled.csv)

These files contain the formal translations of the Reddit comments.

The 150 comments with the highest calculated perplexity of GPT-2 from each Subreddit were translated into a formal version. This filter was used to translate as many comments as possible that have large stylistic salience.

They are structured as follows:
- Subreddit (name of the subreddit where the comment was posted).
- Original Comment
- Formal Comment

Labeled Files with Style Examples (training_labeled_with_style_samples.json and eval_labeled_with_style_samples.json)

These files contain an original Reddit comment, three sample comments from the same subreddit, and the formal translation of the original Reddit comment.

These files can be used to train models to perform style transfers based on given examples.
The task is to transform the formal translation of the Reddit comment, using the three given examples, into the style of the examples.

An entry in this file is structured as follows:

"data":[
   {
      "input_sentence":"The original Reddit comment",
      "style_samples":[
         "sample1",
         "sample2",
         "sample3"
      ],
      "results_sentence":"The formal translated input_sentence",
      "subreddit":"The subreddit from which the comments originated"
   },
   "..."
]

 

Files

eval_labeled.csv

Files (20.8 MB)

Name Size Download all
md5:7c8b39f857a676da3c2d854b9d0e67c6
116.4 kB Preview Download
md5:224d0355e2e2cc528e3a8c8d45279785
374.8 kB Preview Download
md5:f85479b9c2537b4afea068209e697665
18.4 MB Preview Download
md5:b240b84375a02afbe7b3f1a066f3cfab
1.4 MB Preview Download
md5:82028f1d38e752c41e7800a72d5b30d1
419.9 kB Preview Download

Additional details

References

  • Reif, Emily et al. (2022) A Recipe For Arbitrary Text Style Transfer with Large Language Models