There is a newer version of this record available.

Dataset Open Access

SOTorrent Dataset

Baltes, Sebastian

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.

If you use this dataset in your work, please cite our MSR 2018 paper (BibTex) or our MSR 2019 mining challenge proposal.

The dataset is based on the official Stack Overflow data dump released 2020-06-02 (https://archive.org/details/stackexchange) and the Google BigQuery GitHub data set queried 2020-11-02 and last updated 2020-10-29 according to table info (https://cloud.google.com/bigquery/public-data/github). Please read the license files (LICENSE.md) before using the dataset.
Files (70.3 GB)
Name Size
Badges.sql.7z
md5:05ba91805c4adbcaad3f0e3431b187ad
241.9 MB Download
Comments.sql.7z
md5:35cb88b27e3e8f067a4f3dc7aeea0758
4.3 GB Download
CommentUrl.sql.7z
md5:397a8b11f7da2ea2d4ebe686c1763df4
321.5 MB Download
GHCommits.sql.7z
md5:e971cbce9345cb82e583376149578f13
3.7 MB Download
GHMatches.sql.7z
md5:28de1e66ca6f0447610de19fb3143dce
1.1 GB Download
LICENSE.md
md5:bbdaa0f83b3451c86cd327b3693131bb
17.1 kB Download
load_sotorrent.sh
md5:e26088b178c7f46d5b193f39edfbc59a
3.3 kB Download
PostBlockDiff.sql.7z
md5:460128ffc7198c7df18ac5751f2384f4
8.1 GB Download
PostBlockVersion.sql.7z
md5:0525ca03007c6263663b3c378b4e8cc9
17.1 GB Download
PostHistory.sql.7z
md5:82c568479cba60c3405e630ecd88c296
19.7 GB Download
PostLinks.sql.7z
md5:565d708bbe951eee2da433ebb06c233c
65.4 MB Download
PostReferenceGH.sql.7z
md5:185ac8842b2f78e5273515bc0b97454a
201.6 MB Download
Posts.sql.7z
md5:b7808b5b121500d3957bf5a306010c95
13.7 GB Download
PostTags.sql.7z
md5:cbc07ab53e3e5295887b0b4446d5731e
299.4 MB Download
PostVersion.sql.7z
md5:991d94064c03a03e8aae8e0edb44ef5f
947.3 MB Download
PostVersionUrl.sql.7z
md5:891278000d14c910def543a591568826
1.1 GB Download
PostViews.sql.7z
md5:7afc6ecbe4af57fbf6b3edaa77dcfbe4
496.9 MB Download
README.md
md5:73495fb56484539cb68b67c85a77957e
2.5 kB Download
sql.7z
md5:e2657e2ea58390a1074b8230c56c8abb
4.3 kB Download
StackSnippetVersion.sql.7z
md5:ce3cb0a1960ccdc6977df899f61a8469
254.7 MB Download
Tags.sql.7z
md5:c1cbebc355abaa602f4917d0bfc11fbf
736.4 kB Download
TitleVersion.sql.7z
md5:03d0cf59c8cae5f077a9954676145517
628.5 MB Download
Users.sql.7z
md5:526a0fa8d79dac0956ed92d57e42e622
554.5 MB Download
Votes.sql.7z
md5:d8038bb5b0d6834b3b4b545a32354a7e
1.2 GB Download
6,681
27,867
views
downloads
All versions This version
Views 6,68144
Downloads 27,86740
Data volume 224.4 TB128.8 GB
Unique views 5,07137
Unique downloads 4,74614

Share

Cite as