There is a newer version of this record available.

Dataset Open Access

SOTorrent Dataset

Baltes, Sebastian

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.

If you use this dataset in your work, please cite our MSR 2018 paper (BibTex) or our MSR 2019 mining challenge proposal.

The dataset is based on the official Stack Overflow data dump released 2020-09-08 (https://archive.org/details/stackexchange) and the Google BigQuery GitHub data set queried 2020-11-22 and last updated 2020-11-19 according to table info (https://cloud.google.com/bigquery/public-data/github). Please read the license files (LICENSE.md) before using the dataset.
Files (72.6 GB)
Name Size
Badges.sql.7z
md5:800a086b3f76a99bc1cf18a951f22ed5
250.3 MB Download
Comments.sql.7z
md5:6eaaf59e0af0ffdd00e83dca37cda89c
4.4 GB Download
CommentUrl.sql.7z
md5:f89ad472f8913f71799d132d1d642938
330.7 MB Download
GHCommits.sql.7z
md5:e4f831da7943cf6dcb782659d312a204
3.7 MB Download
GHMatches.sql.7z
md5:604d3fb7f90eda24cae2d17c57e1b49b
1.1 GB Download
LICENSE.md
md5:bbdaa0f83b3451c86cd327b3693131bb
17.1 kB Download
load_sotorrent.sh
md5:e26088b178c7f46d5b193f39edfbc59a
3.3 kB Download
PostBlockDiff.sql.7z
md5:9ccb71f676457c047f36cc26cf6efb05
8.4 GB Download
PostBlockVersion.sql.7z
md5:90eb8eefc52a76f944bfd96d15d53b11
17.7 GB Download
PostHistory.sql.7z
md5:ca5612aab43b2b71de4a3cc8a9a44204
20.3 GB Download
PostLinks.sql.7z
md5:446f4a1c57a890f91dbad825070c4439
67.5 MB Download
PostReferenceGH.sql.7z
md5:00822fbba61a33e2bcdbdd88392f8d5b
201.8 MB Download
Posts.sql.7z
md5:34241255532a352332cd60ec98830ad0
14.1 GB Download
PostTags.sql.7z
md5:377f01a326dbae96d4115c76e2d55a1e
307.5 MB Download
PostVersion.sql.7z
md5:19ffa011a4fc23fefa308f6594581d84
977.9 MB Download
PostVersionUrl.sql.7z
md5:ce09536c591276411c011e8fafc04a7c
1.2 GB Download
PostViews.sql.7z
md5:90692c4f759842753100b24cd348d8d2
533.6 MB Download
README.md
md5:4132fb11dac86395ca4f0d5bda607bd1
2.4 kB Download
sql.7z
md5:e2657e2ea58390a1074b8230c56c8abb
4.3 kB Download
StackSnippetVersion.sql.7z
md5:30c5693b9dc2d133ee3f265c309beed7
268.3 MB Download
Tags.sql.7z
md5:738b9c9d214d47dd28a696fe3dedcf9a
747.3 kB Download
TitleVersion.sql.7z
md5:e7e2196d3f052a7a93d406a53d40082b
646.3 MB Download
Users.sql.7z
md5:1651c9c9445042acb0837229ac426c65
587.9 MB Download
Votes.sql.7z
md5:1e7fc50fba6228488219f7514a676543
1.2 GB Download
11,220
32,949
views
downloads
All versions This version
Views 11,220295
Downloads 32,949376
Data volume 252.1 TB2.1 TB
Unique views 8,293253
Unique downloads 6,556132

Share

Cite as