OpenLLMText Dataset

Chen, Yutian; Kang, Hao; Zhai, Yiyan; Li, Liangze; Singh, Rita; Raj, Bhiksha

doi:10.5281/zenodo.8285326

Published August 26, 2023 | Version 1.0

Dataset Open

OpenLLMText Dataset

1. Carnegie Mellon University

The dataset contains approximately 300k text entries collected from 5 different sources (Human, ChatGPT, PaLM, LLaMA, GPT2-XL).

60k of them are Human-written, randomly selected from the OpenWebText dataset. These entries are collected from the user generated content from Reddit before 2019.

60k of them are the ChatGPT's (gpt3.5-turbo) paragraph-by-paragraph rephrasing for the human written data.

60k of them are the PaLM's (Pathway Language Model, text-bison-001) paragraph-by-paragraph rephrasing for the human written data.

60k of them are the LLaMA-7B's (Large Language Model Meta AI) paragraph-by-pargraph rephrasing for the human written data.

60k of them are the data adapted from the GPT2-output dataset released by the OpenAI (GPT2-XL).

Files

ChatGPT.zip

Files (349.3 MB)

Name	Size
ChatGPT.zip md5:67e25c815fdcc29df1732281931cf778	45.5 MB	Preview Download
GPT2.zip md5:26be006582244507c28db72b3270d111	102.6 MB	Preview Download
Human.zip md5:e7b7425b03cf852d98c95432f5da6bfe	97.7 MB	Preview Download
LLaMA.zip md5:39c1ad8b86b52c28a99074900a79fbbe	56.7 MB	Preview Download
OpenAI-baseline-response.zip md5:c296f4c7e056996b23df56a2a50dd29d	3.0 MB	Preview Download
PaLM.zip md5:1c46ddb0553073d81b30892db4af9e11	35.6 MB	Preview Download
ZeroGPT-baseline-response.zip md5:a398f1a608be080afd55428443c92ae4	8.2 MB	Preview Download

	All versions	This version
Views	1,664	1,663
Downloads	1,930	1,930
Data volume	128.3 GB	128.3 GB

OpenLLMText Dataset

Authors/Creators

Description

Files

ChatGPT.zip

Files (349.3 MB)