Dataset Open Access

SOTorrent Dataset

Baltes, Sebastian

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.

If you use this dataset in your work, please cite our MSR 2018 paper (BibTex) or our MSR 2019 mining challenge proposal.

The dataset is based on the official Stack Overflow data dump released 2020-03-02 (https://archive.org/details/stackexchange) and the Google BigQuery GitHub data set queried 2020-03-15 and last updated 2020-03-13 according to table info (https://cloud.google.com/bigquery/public-data/github). Please read the license files (LICENSE.md) before using the dataset.
Files (71.3 GB)
Name Size
Badges.xml.7z
md5:72f7c698d1b5b623bd20052fbefaab82
282.0 MB Download
Comments.xml.7z
md5:90ed78ee408e9c13dbd4feb310fbb09a
4.5 GB Download
CommentUrl.csv.7z
md5:81c98cf483269153193377411dea73fe
332.1 MB Download
GHMatches.csv.7z
md5:c5620928537d9b260ec8445c583420af
72.8 MB Download
LICENSE.md
md5:5fe9e9166f3f8cd670f7f71010e7f9f8
17.1 kB Download
load_sotorrent.sh
md5:08d2dfb2b0961ae48060ec12fb06c540
3.8 kB Download
PostBlockDiff.csv.7z
md5:d2df8cc1aabc8fb5a3d1f43522291527
8.1 GB Download
PostBlockVersion.csv.7z
md5:30f0070717f68ff6f506b76dfc6dd7a5
17.1 GB Download
PostHistory.xml.7z
md5:ee822884e407d8a6bdf199337a88bd55
20.7 GB Download
PostLinks.xml.7z
md5:5a3401fa4095e097ced0e63cdfb9a5dc
89.3 MB Download
PostReferenceGH.csv.7z
md5:5484e592534e76f3a9d8c8b06fdf8775
34.2 MB Download
Posts.xml.7z
md5:1cfc9e06ee255ccf475a9d9227581d3c
14.5 GB Download
PostTags.csv.7z
md5:39be94177bc1b2df7f20cb213c607de7
288.2 MB Download
PostVersion.csv.7z
md5:630914c03a633c4673c8619f731de203
903.1 MB Download
PostVersionUrl.csv.7z
md5:ff4af5b25a7f852ad2881a0e065dbbfb
1.2 GB Download
PostViews.csv.7z
md5:229830d6bbffb09039d2593856371459
470.1 MB Download
README.md
md5:5e089db11bd776d8b601f922f6b42552
2.5 kB Download
sql.7z
md5:f71b04a70304c0d0f0d25e0fdbb3eb02
4.3 kB Download
StackSnippetVersion.csv.7z
md5:b5503ce2cf5521f9ba5fd2d1addf312a
255.6 MB Download
Tags.xml.7z
md5:9893f72df2469a16f11c9ea8b149b8ba
795.6 kB Download
TitleVersion.csv.7z
md5:6686f09daff57ba8a60532f2df0baa64
594.3 MB Download
Users.xml.7z
md5:5547c3454d9d90032b3ddf03c2e82b39
625.0 MB Download
Votes.xml.7z
md5:7fbf12f3d79da3c4b6c4e7d6ddbab953
1.3 GB Download
4,952
23,451
views
downloads
All versions This version
Views 4,952212
Downloads 23,4515,913
Data volume 177.4 TB72.2 TB
Unique views 3,892174
Unique downloads 3,849680

Share

Cite as