Published September 26, 2024 | Version 1.0
Dataset Open

CodeSCAN: ScreenCast ANalysis for Video Programming Tutorials

Description

CodeSCAN is the first large-scale and diverse dataset of coding screenshots with pixel-perfect annotations. It features:

  • 24 popular programming languages (according to Github)
  • 100 random repositories per language (with MIT, BSD-3 or WTFPL License), i.e. 2.400 repositories in total
  • Per repository we use 5 files, i.e. 12.000 files in total
  • ~100 different themes and 25 different fonts
  • Diverse layouts changes, such as menu bar visibility, sidebar position, output window content, etc.
  • Numerous realistic interactions such as searching, typing and selecting within a file, etc.

Check our project page for details. The dataset is for academic research use only.

Files

codescan.zip

Files (7.9 GB)

Name Size Download all
md5:6cc76d8865d37c1e381cd5eb741a57f6
7.9 GB Preview Download

Additional details

Related works

Is supplement to
arXiv:2409.18556 (arXiv)

Software