Published April 12, 2023 | Version v1
Dataset Open

Stack Overflow Chat Dutch

Description

This dataset contains 56,964 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch, specifically in the domain of programming (Stack Overflow). They are translations of Baize's machine-generated answers to the Stack Overflow dataset.

Data Instances

 {'id': 9,
  'input': 'Het gesprek tussen de mens en de AI-assistent.\n'
           '[|Human|] Ik heb wat hulp nodig met Spark SQL union en het ordenen '
           'van kolommen.\n'
           '[|AI|] Natuurlijk, ik help je graag. Waar heb je specifiek hulp '
           'bij nodig?\n'
           '[|Human|] Ik wil de output van een UNION-query ordenen op een '
           'kolom die niet aanwezig is in de SELECT-instructie. Is dat '
           'mogelijk?\n'
           '[|AI|] Ja, het is mogelijk met Spark SQL. Je kunt de ordinale '
           'positie van de kolom in de ORDER BY-clausule gebruiken in plaats '
           'van de kolomnaam. Als je bijvoorbeeld wilt ordenen op de derde '
           'kolom in de SELECT-instructie, kun je ORDER BY 3 gebruiken.\n'
           '[|Human|] Dat is nuttig. Kan ik ook een alias gebruiken in de '
           'ORDER BY-clausule?\n'
           '[|AI|] Ja, je kunt een alias gebruiken in de ORDER BY-clausule. De '
           'alias moet worden gedefinieerd in de SELECT-instructie. Als je '
           'bijvoorbeeld de derde kolom als "col3" alias geeft, kun je ordenen '
           'met ORDER BY col3.\n'
           '[|Human|] Oké, ik denk dat ik het begrijp. Kan ik dezelfde alias '
           'gebruiken voor kolommen uit beide tabellen in de UNION?\n'
           '[|AI|] Nee, je kunt dezelfde alias niet gebruiken voor kolommen '
           'uit beide tabellen in de UNION. Elke tabel in de UNION moet een '
           'unieke alias hebben voor de kolommen.\n'
           '[|Human|] ',
  'topic': 'Spark SQL UNION - ORDER BY kolom niet in SELECT'},

 

Data Fields

  • id: the ID of the item. The following 82 IDs are not included because they could not be translated: [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992]
  • input: the machine-generated conversation between AI and "Human". Always starts with Het gesprek tussen de mens en de AI-assistent.\n and has at least one occurrence of both [|AI|] and [|Human|].
  • topic: the topic description

Dataset Creation

Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbomax_tokens=1024, temperature=0 as parameters.

The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a conversation between an AI assistant and a human from {src_lang} into {tgt_lang}.

Here are the requirements that you should adhere to:
1. maintain the format: the conversation consists of the AI (marked as `[|AI|]`) and the human (`[|Human|]`) talking in turns and responding to each other;
2. do not translate the speaker identifiers `[|AI|]` and `[|Human|]` but always copy them into the translation in appropriate places;
3. ensure accurate translation and keep the correctness of the conversation;
4. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
5. translate the human's text using informal, but standard, language;
6. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
7. if the human asks to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in {tgt_lang}, and then also generate a corrected output version for the AI in {tgt_lang};
8. if the human asks to translate text from one to another language, then you only translate the human's question to {tgt_lang} but you keep the translation that the AI provides in the language that the human requested;
9. do not translate code fragments but copy them as they are. If there are English examples, variable names or definitions in code fragments, keep them in English.

Now translate the following conversation with the requirements set out above. Do not provide an explanation and do not add anything else.\n\n"""

 

The prompt to translate the topic is:

TOPIC_TRANSLATION_PROMPT = "Translate the following title of a conversation from {src_lang} to {tgt_lang} in a succinct," \
                           " summarizing manner. Translate accurately and formally. Do not provide any explanation" \
                           " about the translation and do not include the original title.\n\n"

 

The system message was:

You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

 

Note that 82 items (0.1%) were not successfully translated. The translation was missing the AI identifier [|AI|] and/or the human one [|Human|]. The IDs for the missing items are [1713, 1937, 1960, 4326, 4356, 8357, 8542, 8827, 9137, 9782, 11560, 11961, 12244, 12362, 12488, 13259, 13621, 14445, 14835, 15006, 17746, 18808, 19285, 19426, 19491, 21270, 21661, 22098, 23352, 23840, 23869, 25148, 25928, 27102, 27856, 28387, 29942, 30041, 30251, 32396, 32742, 32941, 33628, 34116, 34648, 34859, 35977, 35987, 36035, 36456, 37028, 37238, 37640, 38107, 38735, 39015, 40984, 41115, 41567, 42397, 43219, 43783, 44599, 44980, 45239, 47676, 48922, 49534, 50282, 50683, 50804, 50919, 51076, 51211, 52000, 52183, 52489, 52595, 53884, 54726, 55795, 56992].

The translation quality has not been verified. Use at your own risk!

Licensing Information

Licensing info for Stack Overflow Questions is listed as Apache 2.0. If you use the current dataset, you should also adhere to the original license.

This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

If you use this dataset, you must also follow the Sharing and Usage policies.

As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

This dataset is also available on the Hugging Face hub with the same DOI and license. See that README for more info.

Files

Files (82.8 MB)

Name Size Download all
md5:3a5dfbe9aec2aa9a8dea2f10442dbe52
82.8 MB Download