Reddit Comments Dataset for Text Style Transfer Tasks
Description
Reddit Comments Dataset for Text Style Transfer Tasks
A dataset of Reddit comments prepared for Text Style Transfer Tasks.
The dataset contains Reddit comments translated into a formal language. For the translation of Reddit comments into a formal language text-davinci-003 was used. To make text-davinci-003 translate the comments into a more formal version, the following prompt was used:
"Here is some text: {original_comment} Here is a rewrite of the text, which is more neutral: {"
This prompting technique was taken from A Recipe For Arbitrary Text Style Transfer with Large Language Models.
The dataset contains comments from the following Subreddits: antiwork, atheism, Conservative, conspiracy, dankmemes, gaybros, leagueoflegends, lgbt, libertarian, linguistics, MensRights, news, offbeat, PoliticalCompassMemes, politics, teenagers, TrueReddit, TwoXChromosomes, wallstreetbets, worldnews.
The quality of formal translations was assessed with BERTScore and chrF++:
- BERTScore: F1-Score: 0.89, Precision: 0.90, Recall: 0.88
- chrF++: 37.16
The average perplexity of the generated formal texts was calculated using GPT-2 and is 123.77
The dataset consists of 3 components.
reddit_commments.csv
This file contains a collection of randomly selected comments from 20 Subreddits. For each comment, the following information was collected:
- subreddit (name of the subreddit in which the comment was posted)
- id (ID of the comment)
- submission_id (ID of the submission to which the comment was posted)
- body (the comment itself)
- created_utc (timestamp in seconds)
- parent_id (The ID of the comment or submission to which the comment is a reply)
- permalink (The URL to the original comment)-
- token_size (How many tokens the comment will be split into by the standard GPT-2 tokenizer)
- perplexity (What perplexity does GPT-2 calculate for the comment)
The comments were filtered. This file contains only comments that:
- have been split by GPT-2 Tokenizer into more than 10 tokens but less than 512 tokens.
- are not [removed] or [deleted]
- do not contain URLs
This file was used as a source for the other two file types.
Labeled Files (training_labeled.csv and eval_labeled.csv)
These files contain the formal translations of the Reddit comments.
The 150 comments with the highest calculated perplexity of GPT-2 from each Subreddit were translated into a formal version. This filter was used to translate as many comments as possible that have large stylistic salience.
They are structured as follows:
- Subreddit (name of the subreddit where the comment was posted).
- Original Comment
- Formal Comment
Labeled Files with Style Examples (training_labeled_with_style_samples.json and eval_labeled_with_style_samples.json)
These files contain an original Reddit comment, three sample comments from the same subreddit, and the formal translation of the original Reddit comment.
These files can be used to train models to perform style transfers based on given examples.
The task is to transform the formal translation of the Reddit comment, using the three given examples, into the style of the examples.
An entry in this file is structured as follows:
"data":[
{
"input_sentence":"The original Reddit comment",
"style_samples":[
"sample1",
"sample2",
"sample3"
],
"results_sentence":"The formal translated input_sentence",
"subreddit":"The subreddit from which the comments originated"
},
"..."
]
Files
eval_labeled.csv
Files
(20.8 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:7c8b39f857a676da3c2d854b9d0e67c6
|
116.4 kB | Preview Download |
|
md5:224d0355e2e2cc528e3a8c8d45279785
|
374.8 kB | Preview Download |
|
md5:f85479b9c2537b4afea068209e697665
|
18.4 MB | Preview Download |
|
md5:b240b84375a02afbe7b3f1a066f3cfab
|
1.4 MB | Preview Download |
|
md5:82028f1d38e752c41e7800a72d5b30d1
|
419.9 kB | Preview Download |
Additional details
References
- Reif, Emily et al. (2022) A Recipe For Arbitrary Text Style Transfer with Large Language Models