Dataset Open Access

SOTorrent Dataset

Baltes, Sebastian; Dumani, Lorik

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.

If you use this dataset in your work, please cite our MSR 2018 paper (BibTex) or our MSR 2019 mining challenge proposal.

The dataset is based on the official Stack Overflow data dump released 2019-06-03 (https://archive.org/details/stackexchange) and the Google BigQuery GitHub data set queried 2019-06-21 and last updated 2019-06-21 according to table info (https://cloud.google.com/bigquery/public-data/github). Please read the license files (LICENSE.md) before using the dataset.
Files (66.1 GB)
Name Size
1_create_database.sql
md5:2fdbf6f7fdd710a08619479a4ffbfa86
7.4 kB Download
2_create_sotorrent_user.sql
md5:e532beb0a89f225380653ddbb716965e
359 Bytes Download
3_load_so_from_xml.sql
md5:b2793067d60650da1cc3d1896a8dfc6a
1.2 kB Download
4_create_indices.sql
md5:059bfcd26cf884d3a00bfb122d60c70d
501 Bytes Download
5_create_sotorrent_tables.sql
md5:0eb0ba7f4450937872aac9630ec5eac0
7.0 kB Download
6_load_sotorrent.sql
md5:26eaabb94945ac4708f582877b2c9a37
4.7 kB Download
7_load_gh_references.sql
md5:e1ba99272a56f827eeb81ad212f422b9
1.0 kB Download
8_create_sotorrent_indices.sql
md5:dd588ef364491baec91c35cef6378910
1.6 kB Download
Badges.xml.7z
md5:ec13516f3b53d324ecf05c5b830cf4b6
256.2 MB Download
Comments.xml.7z
md5:c268a0c2182261dbebc69d91dd20f249
4.4 GB Download
CommentUrl.csv.7z
md5:5b80f832a1c87dbc9457945e6907925e
313.0 MB Download
GHMatches.csv.7z
md5:2f5b14e6bd6d31658179b74be2ed103e
835.1 MB Download
LICENSE.md
md5:59b3f9f42565f9f3058ee052a1ac5cff
39.2 kB Download
PostBlockDiff.csv.7z
md5:93ec9825463cc8b87a8ce96d03785ded
7.5 GB Download
PostBlockVersion.csv.7z
md5:3f04ded35d02ea2c254a8cb313fdc5d4
15.9 GB Download
PostHistory.xml.7z
md5:0db64c3557d6c8b755adf9d3c1440d0e
19.2 GB Download
PostLinks.xml.7z
md5:7ea4cf58a93d7ee92e9f58d80abfd97d
81.7 MB Download
PostReferenceGH.csv.7z
md5:ec80900be6dedb06624c4960c7b64534
172.2 MB Download
Posts.xml.7z
md5:b2225e045205c3fbe61b169225b138f0
13.4 GB Download
PostVersion.csv.7z
md5:7fd433a0c33af7c5f0856f2271a97ad1
742.7 MB Download
PostVersionUrl.csv.7z
md5:e2b7cd90a986a8d8f1c94915e3480918
1.1 GB Download
README.md
md5:048a683328ada232f8409a3c44d6d763
486 Bytes Download
Tags.xml.7z
md5:b5d3b7770e117e17db8444be2a0a1201
762.8 kB Download
TitleVersion.csv.7z
md5:35946e0eceb38a2162b211d496234399
546.9 MB Download
Users.xml.7z
md5:247a2a323b3591575f7990a635cb17b3
534.0 MB Download
Votes.xml.7z
md5:750c72e7634e7fc9c87f01c3ed5d9d8b
1.2 GB Download
3,337
10,750
views
downloads
All versions This version
Views 3,337157
Downloads 10,750386
Data volume 59.0 TB1.0 TB
Unique views 2,666140
Unique downloads 2,03381

Share

Cite as