Published August 26, 2023 | Version 1.0
Dataset Open

OpenLLMText Dataset

  • 1. ROR icon Carnegie Mellon University

Description

The dataset contains approximately 300k text entries collected from 5 different sources (Human, ChatGPT, PaLM, LLaMA, GPT2-XL).

60k of them are Human-written, randomly selected from the OpenWebText dataset. These entries are collected from the user generated content from Reddit before 2019.

60k of them are the ChatGPT's (gpt3.5-turbo) paragraph-by-paragraph rephrasing for the human written data.

60k of them are the PaLM's (Pathway Language Model, text-bison-001) paragraph-by-paragraph rephrasing for the human written data.

60k of them are the LLaMA-7B's (Large Language Model Meta AI) paragraph-by-pargraph rephrasing for the human written data.

60k of them are the data adapted from the GPT2-output dataset released by the OpenAI (GPT2-XL).

Files

ChatGPT.zip

Files (349.3 MB)

Name Size Download all
md5:67e25c815fdcc29df1732281931cf778
45.5 MB Preview Download
md5:26be006582244507c28db72b3270d111
102.6 MB Preview Download
md5:e7b7425b03cf852d98c95432f5da6bfe
97.7 MB Preview Download
md5:39c1ad8b86b52c28a99074900a79fbbe
56.7 MB Preview Download
md5:c296f4c7e056996b23df56a2a50dd29d
3.0 MB Preview Download
md5:1c46ddb0553073d81b30892db4af9e11
35.6 MB Preview Download
md5:a398f1a608be080afd55428443c92ae4
8.2 MB Preview Download