There is a newer version of this record available.

Dataset Open Access

SOTorrent Data Set

Baltes, Sebastian; Dumani, Lorik

Stack Overflow (SO) is the largest Q&A website for software developers, providing a huge amount of copyable code snippets. Recent studies have shown that developers regularly copy those snippets into their software projects, often without the required attribution. Beside possible licensing issues, maintenance issues may arise, because the snippets evolve on SO, but the developers who copied the code are not aware of these changes. To help researchers investigate the evolution of code snippets on SO and their relation to other platforms like GitHub, we build SOTorrent, an open data set based on data from the official SO data dump and the Google BigQuery GitHub data set. SOTorrent provides access to the version history of SO content on the level of whole posts and individual text or code blocks. Moreover, it links SO content to external resources in two ways: (1) by extracting linked URLs from text blocks of SO posts and (2) by providing a table with links to SO posts found in the source code of all projects in the BigQuery GitHub data set.

The dataset is based on the official Stack Overflow data dump released 2017-12-01 (https://archive.org/details/stackexchange) and the Google BigQuery GitHub data set queried 2017-11-20 (https://cloud.google.com/bigquery/public-data/github). Please read all three license files (LICENSE_1.txt, LICENSE_2.txt, LICENSE_3.txt) before using the dataset.
Files (76.6 GB)
Name Size
1_create_database.sql
md5:c6e59e6cf26244266b6140604d84fb28
4.3 kB Download
2_create_sotorrent_user.sql
md5:4d25b7bb873aa15b8b0ff7f8853a8147
506 Bytes Download
3_load_so_from_xml.sql
md5:df89b09b4f261ff17a5e331bf1364a50
1.2 kB Download
4_create_indices.sql
md5:dfc610b177d240d7e1ed825787e54ada
255 Bytes Download
5_create_sotorrent_tables.sql
md5:792d89fef5bbdb7cab67e52db2054bf1
3.8 kB Download
6_import_sotorrent.sql
md5:c4923ba6b505cda40e81f71562b38503
2.1 kB Download
7_create_sotorrent_indices.sql
md5:3e95f2763006b8a4b86188c14e0e1940
752 Bytes Download
Badges.xml.gz
md5:08f14a0cdccd3f01f98d2dc0f72af702
265.9 MB Download
Comments.xml.gz
md5:5468330f602b84da1a452f94dcccc362
5.3 GB Download
LICENSE_1
md5:7fe0a3c070cf6da7b9b11bb02adad522
22.8 kB Download
LICENSE_2
md5:8338f2e6a3dc5c724de1cb6ad8a6f17b
248 Bytes Download
LICENSE_3
md5:30560b322dbbddacfc6292942a42732d
20.6 kB Download
PostBlockDiff.csv.gz
md5:d96919b855795684aa4d7f779e31cf29
7.2 GB Download
PostBlockDiffOperation.csv.gz
md5:52c862330e4850d08848fe141ffe2dc3
79 Bytes Download
PostBlockType.csv.gz
md5:2627b668db395aa447be78057b327065
60 Bytes Download
PostBlockVersion.csv.gz
md5:1a9f208abd7c25a8cf31c5773a845af8
16.1 GB Download
PostHistory.xml.gz
md5:ef60bc0774724df24079dfa2488aeed2
28.4 GB Download
PostLinks.xml.gz
md5:a513ff2491b9e4b72456d26df4a4695a
86.4 MB Download
PostReferenceGH.csv.gz
md5:53318cf851181d2e0c08cba3f7ce3147
269.8 MB Download
Posts.xml.gz
md5:d2d99c634e4c0cf112daace2f1e62cd4
16.1 GB Download
PostType.csv.gz
md5:2e73dcd4f791e547254bca618857c746
135 Bytes Download
PostVersion.csv.gz
md5:966a3601a19e5016becdd1e555df66bc
571.0 MB Download
PostVersionUrl.csv.gz
md5:a98a845fce686c0d196ad6d3f3ade395
655.3 MB Download
README.md
md5:dd279bd65996102fb6bee854ccba8137
868 Bytes Download
Tags.xml.gz
md5:21856cd5f720cf7f065fa82dde75405e
993.8 kB Download
Users.xml.gz
md5:16ff25feb3807f480d36d56795f29f4b
510.4 MB Download
Votes.xml.gz
md5:b9dcb2e851b6faa274da2e36a954db12
1.2 GB Download
1,193
4,741
views
downloads
All versions This version
Views 1,19328
Downloads 4,741383
Data volume 20.8 TB1.7 TB
Unique views 96327
Unique downloads 74725

Share

Cite as