Published April 10, 2023 | Version v1
Dataset Open

Quora Chat Dutch

Description

This dataset contains 54,444 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch. They are translations of Baize's machine-generated answers to the Quora dataset.

Data Instances

{
  'id': 10,
  'input': 'Het gesprek tussen de mens en de AI-assistent.\n'
           '[|Human|] Ik vroeg me af wanneer de Moto G3 de Android 7.0-update '
           'zal krijgen?\n'
           '[|AI|] De Moto G3 (Moto G 3rd gen 2015) komt in aanmerking voor de '
           'Android 7.0 (Nougat) update.\n'
           '[|Human|] Geweldig! Weet je wanneer de update beschikbaar zal '
           'zijn?\n'
           '[|AI|] De update werd uitgebracht in december 2016 voor de Moto G4 '
           'en G4 Plus. Er is echter nog geen woord over wanneer de Moto G3 de '
           'update zal ontvangen.\n'
           '[|Human|] Is er nieuws of het binnenkort de update zal ontvangen?\n'
           '[|AI|] Het spijt me, ik heb momenteel geen informatie daarover. '
           'Het is het beste om uit te kijken naar updates van Motorola.\n'
           '[|Human|] ',
  'topic': 'Wanneer krijgt Moto G3 (Moto G 3e gen 2015) de Android 7.0 '
           '(Nougat) update?'
}

 

Data Fields

  • id: the ID of the item. The following IDs are not included because they could not be translated: [1352, 14599, 25985, 27167, 28254, 33731, 41550, 44364, 47293, 49060, 51383, 53740]
  • input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent.\n and has at least one occurrence of both [|AI|] and [|Human|].
  • topic: the topic description

Dataset Creation

Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbomax_tokens=1024, temperature=0 as parameters.

The prompt template to translate the input is:

CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.

Here are the requirements that you should adhere to:
1. maintain the format: the conversation consists of the AI (marked as `[|AI|]`) and the human (`[|Human|]`) talking in turns and responding to each other;
2. do not translate the speaker identifiers `[|AI|]` and `[|Human|]` but always copy them into the translation in appropriate places;
3. ensure accurate translation and keep the correctness of the conversation;
4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
5. translate the human's text using informal, but standard, language;
6. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang};
8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested;
9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.

Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.\n\n"""

 

The prompt to translate the topic is:

TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct," \
                           " summarizing manner. Translate accurately and formally. Do not provide any explanation" \
                           " about the translation and do not include the original title.\n\n"

 

The system message was:

You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

 

Note that 12 items (0.02%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1352, 14599, 25985, 27167, 28254, 33731, 41550, 44364, 47293, 49060, 51383, 53740].

The translation quality has not been verified. Use at your own risk!

Licensing Information

Licensing info for quora is listed as unknown. Baize is listed as GPLv3. So this dataset is also listed as such. A specific restriction is in place: as per OpenAI's terms of use, this dataset cannot be used to build a commercial system that competes with OpenAI's services.

This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

If you use this dataset, you must also follow the Sharing and Usage policies.

As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

This dataset is also available on the Hugging Face hub which is the canonical repository for this dataset. See the link for more info.

Files

Files (82.3 MB)

Name Size Download all
md5:318276cadf01b3b240ca0e77deb47ed9
82.3 MB Download