There is a newer version of this record available.

Dataset Open Access

SOTorrent Dataset

Baltes, Sebastian; Dumani, Lorik

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. The dataset also links SO content to other platforms by aggregating URLs from text blocks and by collecting links from GitHub files to SO posts. Our vision is that, in the future, researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.

The dataset is based on the official Stack Overflow data dump released 2017-12-01 (https://archive.org/details/stackexchange) and the Google BigQuery GitHub data set queried 2018-02-15 (https://cloud.google.com/bigquery/public-data/github). Please read the license files (LICENSE.md) before using the dataset.
Files (80.3 GB)
Name Size
1_create_database.sql
md5:c6e59e6cf26244266b6140604d84fb28
4.3 kB Download
2_load_so_from_xml.sql
md5:df89b09b4f261ff17a5e331bf1364a50
1.2 kB Download
3_create_indices.sql
md5:dfc610b177d240d7e1ed825787e54ada
255 Bytes Download
4_create_sotorrent_tables.sql
md5:922463870599861e355fa583a2d54ff8
4.0 kB Download
5_create_sotorrent_user.sql
md5:4d25b7bb873aa15b8b0ff7f8853a8147
506 Bytes Download
6_load_sotorrent.sql
md5:d64e2fdfa2e925e8e4c42372d41eae28
2.5 kB Download
7_load_postreferencegh.sql
md5:4b436a6c0ba2d8d1b3650cbd8bb8f025
398 Bytes Download
8_create_sotorrent_indices.sql
md5:6bf8397907bfd5e7e3294b9cce84fcee
1.1 kB Download
Badges.xml.gz
md5:08f14a0cdccd3f01f98d2dc0f72af702
265.9 MB Download
Comments.xml.gz
md5:5468330f602b84da1a452f94dcccc362
5.3 GB Download
LICENSE.md
md5:a342f12af3354b2656da59fda2c8b3cb
38.4 kB Download
PostBlockDiff.csv.gz
md5:311d3ac916b7d9cab333d049a32435e4
8.3 GB Download
PostBlockVersion.csv.gz
md5:a94da5fc45ff48fc691bb3e1d88c9d5f
18.3 GB Download
PostHistory.xml.gz
md5:ef60bc0774724df24079dfa2488aeed2
28.4 GB Download
PostLinks.xml.gz
md5:a513ff2491b9e4b72456d26df4a4695a
86.4 MB Download
PostReferenceGH.csv.gz
md5:318b21d05919312a5119fab57689da27
337.5 MB Download
Posts.xml.gz
md5:d2d99c634e4c0cf112daace2f1e62cd4
16.1 GB Download
PostVersion.csv.gz
md5:855479cdebfc8702e8a715ddef4a0326
917.7 MB Download
PostVersionUrl.csv.gz
md5:ce92a21fbd16949f936fb64c1e302a8c
696.6 MB Download
README.md
md5:6802d794990a7dc0e0a297da2321cd5b
1.1 kB Download
Tags.xml.gz
md5:21856cd5f720cf7f065fa82dde75405e
993.8 kB Download
Users.xml.gz
md5:16ff25feb3807f480d36d56795f29f4b
510.4 MB Download
Votes.xml.gz
md5:b9dcb2e851b6faa274da2e36a954db12
1.2 GB Download
1,263
4,927
views
downloads
All versions This version
Views 1,26348
Downloads 4,927237
Data volume 21.9 TB653.2 GB
Unique views 1,02044
Unique downloads 7939

Share

Cite as