There is a newer version of the record available.

Published January 25, 2020 | Version 2020-01-24
Dataset Open

SOTorrent Dataset

  • 1. University of Trier

Description

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.

If you use this dataset in your work, please cite our MSR 2018 paper (BibTex) or our MSR 2019 mining challenge proposal.

Notes

The dataset is based on the official Stack Overflow data dump released 2019-12-02 (https://archive.org/details/stackexchange) and the Google BigQuery GitHub data set queried 2020-01-24 and last updated 2020-01-24 according to table info (https://cloud.google.com/bigquery/public-data/github). Please read the license files (LICENSE.md) before using the dataset.

Files

LICENSE.md

Files (69.4 GB)

Name Size Download all
md5:8525975afd498a1af8fe81c15f9085f6
273.6 MB Download
md5:687d25ada03b31818a292807f53f3e1a
4.5 GB Download
md5:5a024a3efd17d2e28410e2ba27a531f0
325.6 MB Download
md5:79a2ecd42706122efd81c340d7b06e05
3.1 MB Download
md5:837226c21ba79e619fd9e3d437624e6e
70.9 MB Download
md5:b64594fa84a0505b7d7b66378e714fa0
17.0 kB Preview Download
md5:dfbe32c51003b1a18ef64ae54c8555db
3.4 kB Download
md5:b55e4733f4e88103475e2ed758ee8edf
7.9 GB Download
md5:90c1bde074b189758a42ffb3746f3c87
16.7 GB Download
md5:cd76b00f810ed821289343c50239f8a4
20.2 GB Download
md5:bf641791cf864c0f85abd23fa089931b
86.8 MB Download
md5:605397f36f87eb37c8a4bd0e32dd67e6
34.2 MB Download
md5:f2773978c1f8fc3da36acbfa783a2336
14.1 GB Download
md5:0a7e526bf8ebda43609018327cb4ed06
884.1 MB Download
md5:de3c30b9c0a7b7b2d8588691af1e8b10
1.1 GB Download
md5:0eedb8a5e22ff338bad342ff170c8885
434.1 MB Download
md5:937db90888a5a8ba86eb04d11bd6a414
2.5 kB Preview Download
md5:028c8f5bc40b513ba30d6f28d6c1f471
4.2 kB Download
md5:bb2c49e9e0b675d195d6e9306edc4479
244.8 MB Download
md5:3fc462667fa2e0514bc7b09bd59fa8fe
783.5 kB Download
md5:a99f647d34ace4f1ed5fccf27139f8d7
580.2 MB Download
md5:5460819a40327bab4737b1a2b2b1fb85
590.1 MB Download
md5:e515cf6ab066a496ab132e345c39a4cc
1.2 GB Download