There is a newer version of the record available.

Published November 25, 2020 | Version 2020-11-16
Dataset Open

SOTorrent Dataset

  • 1. The University of Adelaide

Description

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.

If you use this dataset in your work, please cite our MSR 2018 paper (BibTex) or our MSR 2019 mining challenge proposal.

Notes

The dataset is based on the official Stack Overflow data dump released 2020-09-08 (https://archive.org/details/stackexchange) and the Google BigQuery GitHub data set queried 2020-11-22 and last updated 2020-11-19 according to table info (https://cloud.google.com/bigquery/public-data/github). Please read the license files (LICENSE.md) before using the dataset.

Files

LICENSE.md

Files (72.6 GB)

Name Size Download all
md5:800a086b3f76a99bc1cf18a951f22ed5
250.3 MB Download
md5:6eaaf59e0af0ffdd00e83dca37cda89c
4.4 GB Download
md5:f89ad472f8913f71799d132d1d642938
330.7 MB Download
md5:e4f831da7943cf6dcb782659d312a204
3.7 MB Download
md5:604d3fb7f90eda24cae2d17c57e1b49b
1.1 GB Download
md5:bbdaa0f83b3451c86cd327b3693131bb
17.1 kB Preview Download
md5:e26088b178c7f46d5b193f39edfbc59a
3.3 kB Download
md5:9ccb71f676457c047f36cc26cf6efb05
8.4 GB Download
md5:90eb8eefc52a76f944bfd96d15d53b11
17.7 GB Download
md5:ca5612aab43b2b71de4a3cc8a9a44204
20.3 GB Download
md5:446f4a1c57a890f91dbad825070c4439
67.5 MB Download
md5:00822fbba61a33e2bcdbdd88392f8d5b
201.8 MB Download
md5:34241255532a352332cd60ec98830ad0
14.1 GB Download
md5:377f01a326dbae96d4115c76e2d55a1e
307.5 MB Download
md5:19ffa011a4fc23fefa308f6594581d84
977.9 MB Download
md5:ce09536c591276411c011e8fafc04a7c
1.2 GB Download
md5:90692c4f759842753100b24cd348d8d2
533.6 MB Download
md5:4132fb11dac86395ca4f0d5bda607bd1
2.4 kB Preview Download
md5:e2657e2ea58390a1074b8230c56c8abb
4.3 kB Download
md5:30c5693b9dc2d133ee3f265c309beed7
268.3 MB Download
md5:738b9c9d214d47dd28a696fe3dedcf9a
747.3 kB Download
md5:e7e2196d3f052a7a93d406a53d40082b
646.3 MB Download
md5:1651c9c9445042acb0837229ac426c65
587.9 MB Download
md5:1e7fc50fba6228488219f7514a676543
1.2 GB Download