Published July 25, 2023 | Version 1.0.0
Dataset Open

Notably Inaccessible – Data Driven Understanding of Data Science Notebook (In)Accessibility

  • 1. University of Washington

Description

Overview

This dataset artifact contains the intermediate datasets from pipeline executions necessary to reproduce the results of the paper.
We share this artifact in hopes of providing a starting point for other researchers to extend the analysis on notebooks, discover more about their accessibility, and offer solutions to make data science more accessible. The scripts needed to generate these datasets and analyse them are shared in the GitHub repository for this work.

The dataset contains large files of approximately 60 GB so please exercise caution when extracting the data from compressed files.


The dataset contains files which could take a significant amount of run time of the scripts to generate/reproduce.

Dataset Contents

We briefly summarize the included files in our dataset. Please refer to the documentation for specific information about the structure of the data in these files, the scripts to generate them, and runtimes for various parts of our data processing pipeline.

  1. epoch_9_loss_0.04706_testAcc_0.96867_X_resnext101_docSeg.pth: We share this model file, originally provided by Jobin et al., to enable the classification of figures found in our dataset. Please place this into the `model/` directory.
  2. model-results.csv: This file contains results from the classification performed on the figures found in the notebooks in our dataset.

    Performing this classification may take upto a day.

  3. a11y-scan-dataset.zip: This archive contains two files and results in datasets of approximately 60GB when extracted. Please ensure that you have sufficient disk space to uncompress this zip archive. The archive contains:

    • a11y/a11y-detailed-result.csv: This dataset contains the accessibility scan results from the scans run on the 100k notebooks across themes.

      The detailed result file can be really large (> 60 GB) and can be time-consuming to construct.
    • a11y/a11y-aggregate-scan.csv: This file is an aggregate of the detailed result that contains the number of each type of error found in each notebook.

      This file is also shared outside the compressed directory.
  4. errors-different-counts-a11y-analyze-errors-summary.csv: This file contains the counts of errors that occur in notebooks across different themes.

  5. nb_processed_cell_html.csv: This file contains metadata corresponding to each cell extracted from the html exports of our notebooks.

  6. nb_first_interactive_cell.csv: This file contains the necessary metadata to compute the first interactive element, as defined in our paper, in each notebook.

  7. nb_processed.csv: This file contains the necessary data after processing the notebooks extracting the number of images, imports, languages, and cell level information.

  8. processed_function_calls.csv: This file contains the information about the notebooks, the various imports and function calls used within the notebooks.

Files

Notably Inaccessible.zip

Files (1.8 GB)

Name Size Download all
md5:6c74dc3b43226ffae8007b9fd2760d3a
1.8 GB Preview Download

Additional details

Related works

Is part of
Conference paper: 10.1145/3597638.3608417 (DOI)

Funding

U.S. National Science Foundation
Using Passive Sensing to Assess the Impact of Real-Time Discrimination against Women and Underrepresented Minorities in Engineering 2009977

References

  • Jobin, K.V., Mondal, A. and Jawahar, C.V., 2019, September. Docfigure: A dataset for scientific document figure classification. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW) (Vol. 1, pp. 74-79). IEEE.