There is a newer version of this record available.

Dataset Open Access

SOTorrent Dataset

Baltes, Sebastian

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.

If you use this dataset in your work, please cite our MSR 2018 paper (BibTex) or our MSR 2019 mining challenge proposal.

The dataset is based on the official Stack Overflow data dump released 2019-12-02 (https://archive.org/details/stackexchange) and the Google BigQuery GitHub data set queried 2019-12-25 and last updated 2019-12-20 according to table info (https://cloud.google.com/bigquery/public-data/github). Please read the license files (LICENSE.md) before using the dataset.
Files (69.4 GB)
Name Size
Badges.xml.7z
md5:8525975afd498a1af8fe81c15f9085f6
273.6 MB Download
Comments.xml.7z
md5:687d25ada03b31818a292807f53f3e1a
4.5 GB Download
CommentUrl.csv.7z
md5:5a024a3efd17d2e28410e2ba27a531f0
325.6 MB Download
GHMatches.csv.7z
md5:4253cb938a2610d126a6174e03050e74
68.2 MB Download
LICENSE.md
md5:c5b54a667fb6230a661f093e30067c4e
17.0 kB Download
PostBlockDiff.csv.7z
md5:b55e4733f4e88103475e2ed758ee8edf
7.9 GB Download
PostBlockVersion.csv.7z
md5:90c1bde074b189758a42ffb3746f3c87
16.7 GB Download
PostHistory.xml.7z
md5:cd76b00f810ed821289343c50239f8a4
20.2 GB Download
PostLinks.xml.7z
md5:bf641791cf864c0f85abd23fa089931b
86.8 MB Download
PostReferenceGH.csv.7z
md5:ed5c62a58ec897b8bdf30504c1134b8e
34.3 MB Download
Posts.xml.7z
md5:f2773978c1f8fc3da36acbfa783a2336
14.1 GB Download
PostVersion.csv.7z
md5:0a7e526bf8ebda43609018327cb4ed06
884.1 MB Download
PostVersionUrl.csv.7z
md5:de3c30b9c0a7b7b2d8588691af1e8b10
1.1 GB Download
PostViews.csv.7z
md5:0eedb8a5e22ff338bad342ff170c8885
434.1 MB Download
README.md
md5:b43cd321205e3611164401b7b123988c
2.5 kB Download
sql.7z
md5:e9da53351ecc419c12662b495a05f237
4.2 kB Download
StackSnippetVersion.csv.7z
md5:bb2c49e9e0b675d195d6e9306edc4479
244.8 MB Download
Tags.xml.7z
md5:3fc462667fa2e0514bc7b09bd59fa8fe
783.5 kB Download
TitleVersion.csv.7z
md5:a99f647d34ace4f1ed5fccf27139f8d7
580.2 MB Download
Users.xml.7z
md5:5460819a40327bab4737b1a2b2b1fb85
590.1 MB Download
Votes.xml.7z
md5:e515cf6ab066a496ab132e345c39a4cc
1.2 GB Download
11,220
32,949
views
downloads
All versions This version
Views 11,220219
Downloads 32,949224
Data volume 252.1 TB1.0 TB
Unique views 8,293175
Unique downloads 6,556105

Share

Cite as