Published August 27, 2025 | Version 1.0
Dataset Restricted

The 3DSeTwitch Corpus – A Three-Dimensional Corpus Annotated for Sexist Phenomena. Version 1.0

Description

Description

The 3DSeTwitch corpus, developed within the framework of the OLiNDiNUM project (Observatoire LINguistique du DIscours NUMérique), is a multimodal dataset comprising 47 French-language streams (222 hours, 10 minutes, 46 seconds) collected from 20 popular Twitch channels (10 male and 10 female streamers) active between October 2021 and April 2022. The corpus synchronizes streamer speech with live chat messages to enable integrated analysis.

Research objective

The corpus was designed to support the identification and analysis of sexist hate speech and to investigate how such discourse circulates on the Twitch.tv platform.

Methodology

Data selection

Male streamers were selected based on their popularity using two statistical tools: Sullygnome and Twitch Stat’s.
Female streamers were selected based on their popularity according to a specialized ranking published on the Influenzzz platform.
For both groups, popularity thresholds were applied (≥ 100,000 views for male streamers; ≥ 10,000 views for female streamers), with a maximum of 5 streams per channel. VODs were chosen based on their availability during the target period.

Data extraction and processing

Data were extracted using TwitchDownloader, saving each video (in .mp4 format) and its chat (in .json and .txt formats).
The tool developed by Steven Coats (2024) was then used to:

  • Automatically transcribe audio with WhisperX (Radford et al., 2022);
  • Align the data in HTML files structured in 4 columns: timestamp, transcribed speech, user pseudonym, chat message;
  • Generate .png graphs showing streamer speech density and chat activity per minute.

Corpus structure

The corpus is organized in two main folders: F-all-3dSeTwitch (for female streamers) and M-all-3dSeTwitch (for male streamers), each containing one subfolder per channel.
Each stream includes:

  • .html file (aligned speech/chat transcription);
  • .png file (speech/chat density graph).

An Excel file consolidates metadata about channels and streams.

Version 1.0: This release contains only raw, structured data. No annotation has been performed at this stage. A future version will include annotations of sexist phenomena.

Usage, access, and legal framework

Usage and access

This corpus was created for scientific research purposes. It is distributed under the Creative Commons CC-BY-NC-SA 4.0 license:
https://creativecommons.org/licenses/by-nc-sa/4.0/.
This means that reuse is permitted for non-commercial purposes, provided the authors are credited and any derivative works are shared under the same license.
The corpus is accessible via the Zenodo platform to members of Higher Education and Research (ESR).
For any other motivated request, please contact: arobert@unisa.it.
Only the transcriptions (speech and chat messages) are distributed. The original videos are not redistributed, except for the videos associated with Ultia’s streams, which are not published directly, but can be provided upon justified request at the address above.

Copyright

Streamers are considered public figures. Their speech was recorded in an open, online environment. This use falls under the copyright exception for research purposes (art. L122-5 of the French Intellectual Property Code, Directive 2019/790/EU, art. 3).
Anyone wishing to exercise a right of withdrawal may send a motivated request to the above address. Requests will be reviewed in accordance with applicable law (copyright and GDPR).

Personal data protection (GDPR)

Chat usernames have been pseudonymized.
Each pseudonym has been replaced with a generic, non-identifying label:

  • Broadcaster (for the streamer)
  • User + number (for viewers)
  • Modo + number (for moderators or bots)

Direct mentions (e.g. @pseudo) have also been modified.
No attempt at re-identification will be made. Processing complies with Article 89 of the GDPR on scientific research purposes.

Cited references

Coats, S. (2024). A framework for analysis of speech and chat content in YouTube and Twitch streams. In Céline Poudat and Mathilde Guernut (eds.), Proceedings of the 11th Conference on CMC and Social Media Corpora for the Humanities, 16–19. Nice, France: CORLI.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv:2212.04356 [eess.AS]. https://doi.org/10.48550/arXiv.2212.04356

Files

Restricted

The record is publicly accessible, but files are restricted. <a href="https://zenodo.org/account/settings/login?next=https://zenodo.org/records/16966701">Log in</a> to check if you have access.

Additional details

References

  • Robert, A. & Pietrandrea, P. (2024). The 3DSeTwitch corpus – A three-dimensional corpus annotated for sexist phenomena. In Céline Poudat and Mathilde Guernut (eds.), Proceedings of the 11th Conference on CMC and Social Media Corpora for the Humanities, 110–112. Nice, France: CORLI.