There is a newer version of the record available.

Published November 23, 2020 | Version 2020-08-31
Dataset Open

SOTorrent Dataset

  • 1. The University of Adelaide

Description

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.

If you use this dataset in your work, please cite our MSR 2018 paper (BibTex) or our MSR 2019 mining challenge proposal.

Notes

The dataset is based on the official Stack Overflow data dump released 2020-06-02 (https://archive.org/details/stackexchange) and the Google BigQuery GitHub data set queried 2020-11-02 and last updated 2020-10-29 according to table info (https://cloud.google.com/bigquery/public-data/github). Please read the license files (LICENSE.md) before using the dataset.

Files

LICENSE.md

Files (70.3 GB)

Name Size Download all
md5:05ba91805c4adbcaad3f0e3431b187ad
241.9 MB Download
md5:35cb88b27e3e8f067a4f3dc7aeea0758
4.3 GB Download
md5:397a8b11f7da2ea2d4ebe686c1763df4
321.5 MB Download
md5:e971cbce9345cb82e583376149578f13
3.7 MB Download
md5:28de1e66ca6f0447610de19fb3143dce
1.1 GB Download
md5:bbdaa0f83b3451c86cd327b3693131bb
17.1 kB Preview Download
md5:e26088b178c7f46d5b193f39edfbc59a
3.3 kB Download
md5:460128ffc7198c7df18ac5751f2384f4
8.1 GB Download
md5:0525ca03007c6263663b3c378b4e8cc9
17.1 GB Download
md5:82c568479cba60c3405e630ecd88c296
19.7 GB Download
md5:565d708bbe951eee2da433ebb06c233c
65.4 MB Download
md5:185ac8842b2f78e5273515bc0b97454a
201.6 MB Download
md5:b7808b5b121500d3957bf5a306010c95
13.7 GB Download
md5:cbc07ab53e3e5295887b0b4446d5731e
299.4 MB Download
md5:991d94064c03a03e8aae8e0edb44ef5f
947.3 MB Download
md5:891278000d14c910def543a591568826
1.1 GB Download
md5:7afc6ecbe4af57fbf6b3edaa77dcfbe4
496.9 MB Download
md5:73495fb56484539cb68b67c85a77957e
2.5 kB Preview Download
md5:e2657e2ea58390a1074b8230c56c8abb
4.3 kB Download
md5:ce3cb0a1960ccdc6977df899f61a8469
254.7 MB Download
md5:c1cbebc355abaa602f4917d0bfc11fbf
736.4 kB Download
md5:03d0cf59c8cae5f077a9954676145517
628.5 MB Download
md5:526a0fa8d79dac0956ed92d57e42e622
554.5 MB Download
md5:d8038bb5b0d6834b3b4b545a32354a7e
1.2 GB Download