There is a newer version of the record available.

Published April 14, 2020 | Version 2020-03-15
Dataset Open

SOTorrent Dataset

  • 1. The University of Adelaide

Description

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.

If you use this dataset in your work, please cite our MSR 2018 paper (BibTex) or our MSR 2019 mining challenge proposal.

Notes

The dataset is based on the official Stack Overflow data dump released 2020-03-02 (https://archive.org/details/stackexchange) and the Google BigQuery GitHub data set queried 2020-03-15 and last updated 2020-03-13 according to table info (https://cloud.google.com/bigquery/public-data/github). Please read the license files (LICENSE.md) before using the dataset.

Files

LICENSE.md

Files (71.3 GB)

Name Size Download all
md5:72f7c698d1b5b623bd20052fbefaab82
282.0 MB Download
md5:90ed78ee408e9c13dbd4feb310fbb09a
4.5 GB Download
md5:81c98cf483269153193377411dea73fe
332.1 MB Download
md5:215e519c59139945b893e4617a8d9e32
3.1 MB Download
md5:c5620928537d9b260ec8445c583420af
72.8 MB Download
md5:5fe9e9166f3f8cd670f7f71010e7f9f8
17.1 kB Preview Download
md5:08d2dfb2b0961ae48060ec12fb06c540
3.8 kB Download
md5:d2df8cc1aabc8fb5a3d1f43522291527
8.1 GB Download
md5:30f0070717f68ff6f506b76dfc6dd7a5
17.1 GB Download
md5:ee822884e407d8a6bdf199337a88bd55
20.7 GB Download
md5:5a3401fa4095e097ced0e63cdfb9a5dc
89.3 MB Download
md5:5484e592534e76f3a9d8c8b06fdf8775
34.2 MB Download
md5:1cfc9e06ee255ccf475a9d9227581d3c
14.5 GB Download
md5:39be94177bc1b2df7f20cb213c607de7
288.2 MB Download
md5:630914c03a633c4673c8619f731de203
903.1 MB Download
md5:ff4af5b25a7f852ad2881a0e065dbbfb
1.2 GB Download
md5:229830d6bbffb09039d2593856371459
470.1 MB Download
md5:5e089db11bd776d8b601f922f6b42552
2.5 kB Preview Download
md5:284760a74d96c638959ce6afe51a7be6
4.3 kB Download
md5:b5503ce2cf5521f9ba5fd2d1addf312a
255.6 MB Download
md5:9893f72df2469a16f11c9ea8b149b8ba
795.6 kB Download
md5:6686f09daff57ba8a60532f2df0baa64
594.3 MB Download
md5:5547c3454d9d90032b3ddf03c2e82b39
625.0 MB Download
md5:7fbf12f3d79da3c4b6c4e7d6ddbab953
1.3 GB Download