OpenLLMText Dataset
Description
The dataset contains approximately 300k text entries collected from 5 different sources (Human, ChatGPT, PaLM, LLaMA, GPT2-XL).
60k of them are Human-written, randomly selected from the OpenWebText dataset. These entries are collected from the user generated content from Reddit before 2019.
60k of them are the ChatGPT's (gpt3.5-turbo) paragraph-by-paragraph rephrasing for the human written data.
60k of them are the PaLM's (Pathway Language Model, text-bison-001) paragraph-by-paragraph rephrasing for the human written data.
60k of them are the LLaMA-7B's (Large Language Model Meta AI) paragraph-by-pargraph rephrasing for the human written data.
60k of them are the data adapted from the GPT2-output dataset released by the OpenAI (GPT2-XL).
Files
ChatGPT.zip
Files
(349.3 MB)
Name | Size | Download all |
---|---|---|
md5:67e25c815fdcc29df1732281931cf778
|
45.5 MB | Preview Download |
md5:26be006582244507c28db72b3270d111
|
102.6 MB | Preview Download |
md5:e7b7425b03cf852d98c95432f5da6bfe
|
97.7 MB | Preview Download |
md5:39c1ad8b86b52c28a99074900a79fbbe
|
56.7 MB | Preview Download |
md5:c296f4c7e056996b23df56a2a50dd29d
|
3.0 MB | Preview Download |
md5:1c46ddb0553073d81b30892db4af9e11
|
35.6 MB | Preview Download |
md5:a398f1a608be080afd55428443c92ae4
|
8.2 MB | Preview Download |