Published February 22, 2023 | Version 1.0
Dataset Open

Reddit and StackOverflow dataset (Programming languages)

  • 1. Università degli Studi di Salerno

Description

This data set contains anonymized data collected from Reddit (via the Pushshift API) and StackOverflow (from Kaggle's dataset).

Each folder includes the data split by trimester. The schema of StackOverflow and Reddit-related files follows:

  • Fields from StackOverflow
    • question_id
    • answer_id
    • creation_date - answer creation_date
    • score - score of the question/answer
    • tags - all tags flagged for a question
    • answer_count - number of answers for a question
    • start_question - question's time of creation
    • last_activity_date - last update on the question
    • new_id - hashed id of the answerer
    • q_new_id - hashed id of the questioner
  • Fields from Reddit
    • comment_id
    • submission_id
    • score - score of the question/submission
    • subreddit
    • created_utc - time of creation (unrelated to last modified comments)
    • new_id - hashed id

The .txt files represent the structure of the corresponding hypergraphs.

Files

data.zip

Files (134.1 MB)

Name Size Download all
md5:405c95a36c527c85d3708fe3a473386c
134.1 MB Preview Download