Published June 17, 2025 | Version v1
Dataset Open

YTCommentVerse: A Multi-Category Multi-Lingual YouTube Comment Corpus

  • 1. ROR icon Deakin University
  • 2. ROR icon Stanford University

Description

Introduction

We introduce YTCommentVerse, a large-scale multilingual and multi-category dataset of YouTube comments. It contains over 32 million comments from 178,000 videos contributed by more than 20 million unique users spanning 15 distinct YouTube content categories such as Music, News, Education and Entertainment. Each comment in the dataset includes video and comment IDs, user channel details, upvotes and category labels. With comments in over 50 languages,
YTCommentVerse provides a rich resource for exploring sentiment, toxicity and engagement patterns across diverse cultural and topical contexts. This dataset helps fill a major gap in publicly available social media datasets particularly for analyzing video sharing platforms by combining multiple languages, detailed categories and other metadata.

Data Description

Each entry in the dataset is related to one comment for a specific YouTube video in the related category with the following columns: videoID, commentID, commenterName, commenterChannelID, comment, votes, originalChannelID, category. Each field is explained below:

videoID: represents the video ID in YouTube.
commentID: represents the comment ID.
commenterName: represents the name of the commenter.
commenterChannelID: represents the ID of the commenter.
comment: represents the comment text.
votes: represents the upvotes received by that comment.
originalChannelID: represents the original channel ID who posted the video.
category: represents the category of the YouTube video.

Data Anonymization

The data is anonymized by removing all Personally Identifiable Information (PII). 

Data sample

{
"videoID": "ab9fe84e2b2406efba4c23385ef9312a",
"commentID": "488b24557cf81ed56e75bab6cbf76fa9",
"commenterName": "b654822a96eae771cbac945e49e43cbd",
"commenterChannelID": "2f1364f249626b3ca514966e3ef3aead",
"comment": "ich fand den Handelwecker am besten",
"votes": 2,
"originalChannelID": "oc_2f1364f249626b3ca514966e3ef3aead",
"category": "entertainment"
}

Multilingual data

Language | Text |

|--------------|---------------------------------------------------|

| English | You girls are so awesome!! |

| Russian | Точно так же Я стрелец |

| Hindi | आज भी भाई कʏ आवाज में वही पुरानी बात है.... |

| Chinese | 無論如何,你已經是台灣YT訂閱數之首 |

| Bengali | খুিন হািসনােক ভারেতর àধানমন্... |

| Spanish | jajajaj esto tiene que ser una brom |

| Portuguese | nossa senhora!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!... |

| Malayalam | നമസ്കാരം |

| Telegu | నమసాక్రం |

| Japanese | こんにちは |

BibTex

```bibtex
@inproceedings{dutta2025ytcommentverse,
  title={YTCommentVerse: A Multi-Category Multi-Lingual YouTube Comment Corpus},
  author={Dutta, Hridoy Sankar and Khan, Biswadeep},
  booktitle={Proceedings of the 34th ACM International Conference on Information and Knowledge Management},
  pages={6351--6355},
  year={2025}
}

 

Files

Files (10.8 GB)

Name Size Download all
md5:4a28043e3971da028add9b46665b4c50
10.8 GB Download