Dataset Open Access

Usage and Attribution of Stack Overflow Code Snippets in GitHub Projects — Supplementary Material

Baltes, Sebastian

Data collector(s)
Kiefer, Richard
Researcher(s)
Diehl, Stephan

Background: Stack Overflow (SO) is the largest Q&A website for software developers, providing a huge amount of copyable code snippets. Using those snippets raises various maintenance and legal issues. SO’s license (CC BY-SA 3.0) requires attribution, i.e., referencing the original question or answer, and requires derived work to adopt a compatible license. While there is a heated debate on SO’s license model for code snippets and the required attribution, little is known about the extent to which snippets are copied from SO without proper attribution.

Aim: Our main goal was to analyze how often code from SO posts is used in public GitHub projects, but not attributed as required by the license. Further, we wanted to investigate if developers are aware of SO’s license and its implications, and to what degree they adhere to the attribution requirements defined in SO’s terms of service.

Method: We present results of a large-scale empirical study analyzing the usage and attribution of non-trivial Java code snippets from SO answers in public GitHub projects. We followed three different approaches to triangulate an estimate for the ratio of unattributed usages and conducted two online surveys with software developers to complement our results.

Results: For the different sets of projects that we analyzed, the amount of projects containing files with a reference to SO varied between 3.3% and 11.9%. We found that at most 1.8% of all analyzed repositories containing code from SO used the code in a way compatible with CC BY-SA 3.0. Moreover, we estimate that at most a quarter of the copied code snippets from SO are attributed as required, i.e., using a link in a source code comment. About half of the surveyed developers admitted copying code from SO without attribution. Furthermore, about two thirds of them were not aware of the license of SO code snippets and its implications.

Files (86.7 MB)
Name Size
0_README.txt
md5:d4c593aa9863d298566d6115540b0988
278 Bytes Download
1_preliminary-study.zip
md5:01c4ca7389b0df8671d16a99a4d3b00e
22.2 kB Download
2_phase-1.zip
md5:227983bcef60b70cd9c0fe36b73775c9
14.1 MB Download
3_phase-2.zip
md5:de95dfae08d75ca2a2614d512fac93e7
1.8 MB Download
4_phase-3.zip
md5:7f41f87f128ebd5c97a438cc1f6f85a5
10.4 MB Download
5_licensing-conflicts.zip
md5:c66bf0946550de8ff406473a8768c6d4
349.0 kB Download
6_attribution-requirements.zip
md5:69268305e6380a8c4403dfb9f315b365
181.3 kB Download
7_awareness-study.zip
md5:ac35cd2478ce46354d268ee8e63e89eb
1.4 MB Download
8_limitations.zip
md5:9f039b49889ceb2cb64fbf6eb3e7c9be
58.5 MB Download
LICENSE.txt
md5:e5ba95a3e8465c07991d2ce7629d0013
518 Bytes Download
35
32
views
downloads
All versions This version
Views 3535
Downloads 3232
Data volume 203.0 MB203.0 MB
Unique views 3131
Unique downloads 33

Share

Cite as