Published February 22, 2023 | Version 1.0

Dataset Open

Reddit and StackOverflow dataset (Programming languages)

1. Università degli Studi di Salerno

This data set contains anonymized data collected from Reddit (via the Pushshift API) and StackOverflow (from Kaggle's dataset).

Each folder includes the data split by trimester. The schema of StackOverflow and Reddit-related files follows:

Fields from StackOverflow
- question_id
- answer_id
- creation_date - answer creation_date
- score - score of the question/answer
- tags - all tags flagged for a question
- answer_count - number of answers for a question
- start_question - question's time of creation
- last_activity_date - last update on the question
- new_id - hashed id of the answerer
- q_new_id - hashed id of the questioner
Fields from Reddit
- comment_id
- submission_id
- score - score of the question/submission
- subreddit
- created_utc - time of creation (unrelated to last modified comments)
- new_id - hashed id

The .txt files represent the structure of the corresponding hypergraphs.

Files

data.zip

Files (134.1 MB)

Name	Size	Download all
data.zip md5:405c95a36c527c85d3708fe3a473386c	134.1 MB	Preview Download

133

Views

21

Downloads

Show more details

	All versions	This version
Views	133	133
Downloads	21	21
Data volume	2.9 GB	2.9 GB

More info on how stats are collected....

DOI

Resource type

Dataset

Publisher

Zenodo

Languages

English

Creative Commons Attribution 4.0 International

The Creative Commons Attribution license allows re-distribution and re-use of a licensed work on the condition that the creator is appropriately credited. Read more

Technical metadata

Created: March 2, 2023
Modified: March 7, 2023