ChatSubs

Description

The ChatSubs dataset contains dialogue data in Spanish and 3 of Spain's co-official languages (Catalan, Basque, and Galician). It has been obtained from OpenSubtitles, from which we have gathered the movie subtitles in our languages of interest and processed them to generate clearly segmented dialogues and their turns. We also openly share the code developed for data processing. 

The result is a dataset of 206.706 JSON files with more than 20 million dialogues and 96 million turns. It represents one of the biggest dialogue corpus available, as other similar datasets in better resourced languages do not reach 500k dialogues or present less defined conversations.  Thus, the ChatSubs dataset is an ideal resource for research teams that are interested in training dialogue models in Spanish, Catalan, Basque, and Galician.

Corpus structure

The archive that we share contains four datasets, one for each language: open_subtitles_ca (Catalan), open_subtitles_es (Spanish), open_subtitles_eu (Basque), open_subtitles_gl (Galician). Every folder contains the metadata file with information about the subtitle files (export.txt) from the original dump of the raw subtitles.

Each JSON file has a unique IDSubtitleFile as a name, same as the original file it is extracted from. 

The file structure can be understood using the 1953288724.jsonl file from the open_subtitles_ca dataset as an example. Here, the last four digits of the filename, i.e., 8724, are reversed, resulting in 4278. Starting from the root open_subtitles_ca, this reversed sequence forms a series of subfolders leading to the JSON file. The full path becomes open_subtitles_ca/4/2/7/8/1953288724.jsonl.

The code for generating the corpus

https://github.com/conversa-ai/ChatSubs

Acknowledgments 

This dataset and publication is a result of the project CONVERSA (TED2021-132470B-I00) funded by MCIN/AEI/10.13039/501100011033 and by "European Union NextGenerationEU/PRTR".