There is a newer version of the record available.

Published June 26, 2019 | Version 2019-06-21
Dataset Open

SOTorrent Dataset

  • 1. University of Trier

Description

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.

If you use this dataset in your work, please cite our MSR 2018 paper (BibTex) or our MSR 2019 mining challenge proposal.

Notes

The dataset is based on the official Stack Overflow data dump released 2019-06-03 (https://archive.org/details/stackexchange) and the Google BigQuery GitHub data set queried 2019-06-21 and last updated 2019-06-21 according to table info (https://cloud.google.com/bigquery/public-data/github). Please read the license files (LICENSE.md) before using the dataset.

Files

LICENSE.md

Files (66.1 GB)

Name Size Download all
md5:2fdbf6f7fdd710a08619479a4ffbfa86
7.4 kB Download
md5:e532beb0a89f225380653ddbb716965e
359 Bytes Download
md5:b2793067d60650da1cc3d1896a8dfc6a
1.2 kB Download
md5:059bfcd26cf884d3a00bfb122d60c70d
501 Bytes Download
md5:0eb0ba7f4450937872aac9630ec5eac0
7.0 kB Download
md5:26eaabb94945ac4708f582877b2c9a37
4.7 kB Download
md5:e1ba99272a56f827eeb81ad212f422b9
1.0 kB Download
md5:dd588ef364491baec91c35cef6378910
1.6 kB Download
md5:ec13516f3b53d324ecf05c5b830cf4b6
256.2 MB Download
md5:c268a0c2182261dbebc69d91dd20f249
4.4 GB Download
md5:5b80f832a1c87dbc9457945e6907925e
313.0 MB Download
md5:2f5b14e6bd6d31658179b74be2ed103e
835.1 MB Download
md5:59b3f9f42565f9f3058ee052a1ac5cff
39.2 kB Preview Download
md5:93ec9825463cc8b87a8ce96d03785ded
7.5 GB Download
md5:3f04ded35d02ea2c254a8cb313fdc5d4
15.9 GB Download
md5:0db64c3557d6c8b755adf9d3c1440d0e
19.2 GB Download
md5:7ea4cf58a93d7ee92e9f58d80abfd97d
81.7 MB Download
md5:ec80900be6dedb06624c4960c7b64534
172.2 MB Download
md5:b2225e045205c3fbe61b169225b138f0
13.4 GB Download
md5:7fd433a0c33af7c5f0856f2271a97ad1
742.7 MB Download
md5:e2b7cd90a986a8d8f1c94915e3480918
1.1 GB Download
md5:048a683328ada232f8409a3c44d6d763
486 Bytes Preview Download
md5:b5d3b7770e117e17db8444be2a0a1201
762.8 kB Download
md5:35946e0eceb38a2162b211d496234399
546.9 MB Download
md5:247a2a323b3591575f7990a635cb17b3
534.0 MB Download
md5:750c72e7634e7fc9c87f01c3ed5d9d8b
1.2 GB Download