Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study"

Pepe, Federica; Nardone, Vittoria; Mastropaolo, Antonio; Canfora, Gerardo; BAVOTA, Gabriele; Di Penta, Massimiliano

doi:10.5281/zenodo.10058142

Published October 31, 2023 | Version v2

Dataset Open

Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study"

1. University of Sannio
2. University of Molise
3. Università della Svizzera italiana
4. Università degli Studi del Sannio

This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"

## Root directory

- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements

- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)

- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.

## Dataset

- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed

- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library

- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model

- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project

- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads

## RQ1

- `RQ1/RQ1_dataset-list.txt`: list of HF datasets

- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets

- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script

- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis

- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`

- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`

## RQ2

- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task

- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling

- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias

- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories

- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category

## RQ3

- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses

- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness

- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name

- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license

- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)

- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level

## scripts

Contains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README

Files

hf-study-replication.zip

Files (143.0 MB)

Name	Size	Download all
hf-study-replication.zip md5:8e471c41504e6cb2ca7d0f749a4ffd7a	143.0 MB	Preview Download

	All versions	This version
Views	1,100	821
Downloads	52	38
Data volume	7.7 GB	5.6 GB

Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study"

Creators

Description

Files

hf-study-replication.zip

Files (143.0 MB)