MetaHarm: Harmful YouTube Video Dataset Annotated by Domain Experts, GPT-4-Turbo, and Crowdworkers

Jo, Wonjeong; Wojcieszak, Magdalena

doi:10.5281/zenodo.14647452

Published January 15, 2025 | Version v1

Dataset Restricted

MetaHarm: Harmful YouTube Video Dataset Annotated by Domain Experts, GPT-4-Turbo, and Crowdworkers

1. University of California, Davis
2. University of Warsaw

We provide text metadata, image frames, and thumbnails of YouTube videos classified as harmful or harmless by domain experts, GPT-4-Turbo, and crowdworkers. Harmful videos are categorized into one or more of six harm categories: Information harms (IH), Hate and Harassment harms (HH), Clickbait harms (CB), Addictive harms (ADD), Sexual harms (SXL), and Physical harms (PH).

This repository includes the text metadata and a link to external cloud storage for the image data.

Text Metadata

Folder	Subfolder	#Videos
Ground Truth	Harmful_full_agreement (classified as harmful by all the three actors)	5,109
	Harmful_subset_agreement (classified as harmful by more than two actors)	14,019
Domain Experts	Harmful	15,115
	Harmless	3,303
GPT-4-Turbo	Harmful	10,495
	Harmless	7,818
Crowdworkers (Workers from Amazon Mechanical Turk)	Harmful	12,668
	Harmless	4,390
Unannotated large pool	-	60,906

Note. The term "actor" refers to the annotating entities: domain experts, GPT-4-Turbo, and crowdworkers

Explanations about the indicators

1. Ground truth - harmful_full_agreement & harmful_subset agreement

- links

- video_id

- channel

- description

- transcript

- date

- maj_harmcat: In the full_agreement version, this represents a harm category identified by all three actors. In the subset_agreement version, it represents a harm category classified by more than two actors.

- all_harmcat: This includes all harm categories classified by any of the actors without requiring agreement. It captures all classified categories.

2. Domain Experts, GPT-4-Turbo, Crowdworkers

- links

- video_id

- channel

- description

- transcript

- date

- harmcat

3. Unannotated large pool

- links

- video_id

- channel

- description

- transcript

- date

Note. Some data from the external dataset does not include date information. In such cases, the date was marked as 1990-01-01.
We retrieved transcripts using the YouTubeTranscriptApi. If a video does not have any text data in the transcript section, it means the API failed to retrieve the transcript, possibly because the video does not contain any detectable language.

Some image frames are also available in the pickle file.

Image data

The image frames and thumbnails are available at this link: https://ucdavis.app.box.com/folder/302772803692?s=d23b20snl1slwkuh4pgvjs31m7r1xae2

1. Image frames (imageframes_1-20.zip): Image frames are organized into 20 zip folders due to the large size of the image frames. Each zip folder contains subfolders named after the unique video IDs of the annotated videos. Inside each subfolder, there are 15 sequentially numbered image frames (from 0 to 14) extracted from the corresponding video. The image frame folders do not distinguish between videos classified as harmful or non-harmful.

2. Thumbnails (Thumbnails.zip): The zip folder contains thumbnails from the individual videos used in classification. Each thumbnail is named using the unique video ID. This folder does not distinguish between videos classified as harmful or harmless

Related works (in preprint)

For details about the harm classification taxonomy and the performance comparison between crowdworkers, GPT-4-Turbo, and domain experts, please see https://arxiv.org/abs/2411.05854.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Request access

If you would like to request access to these files, please fill out the form below.

You are currently not logged in. Do you have an account? Log in here

Additional details

Is supplement to: Publication: arXiv:2411.05854 (arXiv)

	All versions	This version
Views	294	294
Downloads	86	86
Data volume	13.9 GB	13.9 GB

MetaHarm: Harmful YouTube Video Dataset Annotated by Domain Experts, GPT-4-Turbo, and Crowdworkers

Creators

Description

Text Metadata

Explanations about the indicators

Image data

Related works (in preprint)

Files

Restricted

Request access

Additional details

Related works