Dataset Open Access

SOTorrent Dataset

Baltes, Sebastian

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.

If you use this dataset in your work, please cite our MSR 2018 paper (BibTex) or our MSR 2019 mining challenge proposal.

The dataset is based on the official Stack Overflow data dump released 2020-12-08 (https://archive.org/details/stackexchange) and the Google BigQuery GitHub data set queried 2021-01-04 and last updated 2020-12-31 according to table info (https://cloud.google.com/bigquery/public-data/github). Please read the license files (LICENSE.md) before using the dataset.
Files (74.2 GB)
Name Size
Badges.sql.7z
md5:bedf16cc05609b3bbabb8f7584fdbeee
257.2 MB Download
Comments.sql.7z
md5:37ca8eae9ab80db2263fb8acdc480575
4.4 GB Download
CommentUrl.sql.7z
md5:dcd7131275c632f28d38001b9d7902df
338.1 MB Download
GHCommits.sql.7z
md5:cf0be5d8d931c05176bbe8681df63782
3.7 MB Download
GHMatches.sql.7z
md5:46c3109cd38e690eaedfdada8e61a74e
1.2 GB Download
LICENSE.md
md5:bbdaa0f83b3451c86cd327b3693131bb
17.1 kB Download
load_sotorrent.sh
md5:9dd420259404d87f838711fdd07d9fe3
3.3 kB Download
PostBlockDiff.sql.7z
md5:1689c7f6e1491660f64973360765b028
8.6 GB Download
PostBlockVersion.sql.7z
md5:a4b2d229fdd48298780138e27a11cbed
18.0 GB Download
PostHistory.sql.7z
md5:f69138ae0474ef3c9ce894185f5cef3b
20.8 GB Download
PostLinks.sql.7z
md5:37557ecf805ca1787fcbd319a8cf8397
69.3 MB Download
PostReferenceGH.sql.7z
md5:a0eda3be367de9d428214ce2f9b934f3
202.3 MB Download
Posts.sql.7z
md5:c2bd50ffdd527ae815a58d07b4ea7fbc
14.5 GB Download
PostTags.sql.7z
md5:64d6b1b438ef450ac77e967ef03fa9ec
318.9 MB Download
PostVersion.sql.7z
md5:ca29cf4dc2894370dee636e3693c8b3e
996.2 MB Download
PostVersionUrl.sql.7z
md5:9db34f011d4ab2b2a832e32acf398744
1.2 GB Download
PostViews.sql.7z
md5:6d05b891270e1cee6e1f10108978d5fc
574.9 MB Download
README.md
md5:795aa3c2e1974b159a1180ac11080835
2.4 kB Download
sql.7z
md5:10deb245cfaa62b1d010955214e84f0f
4.3 kB Download
StackSnippetVersion.sql.7z
md5:fdff39045a2f91db0842c2dc5153fcb2
279.2 MB Download
Tags.sql.7z
md5:eefbc34e506f90e40ed637269761eb81
755.0 kB Download
TitleVersion.sql.7z
md5:f1209b8b86c3113981566254574ad0d5
660.3 MB Download
Users.sql.7z
md5:a11bc075c3f3518a72b9b4fc9c81941a
620.3 MB Download
Votes.sql.7z
md5:ad8a1a7909bc3b27ec4779cd85e19b58
1.3 GB Download
11,220
32,949
views
downloads
All versions This version
Views 11,2202,045
Downloads 32,9493,198
Data volume 252.1 TB21.5 TB
Unique views 8,2931,681
Unique downloads 6,5561,176

Share

Cite as