DOCUMENTATION OF RESEARCH DATA AND CODE The life cycle of the data as well as the code is described detailed in chapter 3 (Methodology) of the thesis. Also the Data Management Plan can be helpful (DMP). If values were used more than once, entries were combined in one cell and separated by " | " CODE re3data_extract.ipynb - Jupyter Notebook in R fetching data about repositories from re3data's API. re3data_normalize.ipynb - Jupyter Notebook in R normalizing single fields and splitting up size property re3data_analysis.ipynb - Jupyter Notebook in R uncollapsing rows for size sublevel, analysing fields, with on-the-fly-modifications and plots DATASET RE3DATA REAL_repository_info_v1.csv - Output of re3data_extract.ipynb - rows: 3037 repository records - columns: 26 properties from re3data Schema 2.2 https://doi.org/10.2312/re3.006 "re3data.orgIdentifier" "repositoryName" "repositoryURL" "repositoryIdentifier" "description" "type" "size" "updated" "startDate" "endDate" "subject" "contentType" "providerType" "keyword" "databaseAccessType" "dataUploadType" "softwareName" "api" "apiType" "pidSystem" "enhancedPublication" "certificate" "metadataStandardName" "remarks" "entryDate" "lastUpdate" REAL_repository_info_v2.csv - Output of re3data_normalize.ipynb - 28 columns: [from REAL_repository_info_v1.csv] + "size_number": number count of a size unit, numeric e.g. 272705, excecpt for ranges 10-50 "size_unit": Unit of a size entity, strings e.g. records "softwareName": -unknown OR -none "pidSystem": -unknown OR -none "enhancedPublication": -unknown OR -none "startDate": simplified to YYYY "endDate" : simplified to YYYY REAL_repository_info_v3.xlsx - Intellectual Typing and Data Manipulation - 33 columns: [from REAL_repository_info_v2.csv] + "size_unit_type": typing of size_unit in the dimension of quantity or volume, Controlled vocabulary: quantity, volume "size_unit_discipline": typing of size_unit in the dimension of discipline, Controlled vocabulary: generic, disciplinary "size_unit_content": typing of size_unit in the dimension of content, Controlled vocabulary: neutral, other "size_non_unit_type": typing of remaining size values besider number and unit, Controlled vocabulary: accuracy, detail, entirety, partof, omittable "size_quality": quality issues in size, Controlled vocabulary: dot error, comma error, distiribution error, language error, number error, orthographic error, range, semantic error, syntax error "updated_corrected": correction of updated found by updated_quality, date format e.g. 1/1/2021 "updated_quality": quality issues in updated, Controlled vocabulary: granularity, format error, missing, no size, to high - "subject" "keyword" REAL_repository_info_v3.csv - same as REAL_repository_info_v3.xlsx REAL_repository_info_v4.csv - Output of re3data_analysis.ipynb - can be joined with REAL_repository_info_v3.csv by "re3data.orgIdentifier" - rows: 4293 sizes per repository - 4 columns: [from REAL_repository_info_v3.csv] "re3data.orgIdentifier" "size_number" "size_unit" "size_unit_content" + "size_unit_content_other": detailed typing of size_non_unit_type for value other, Controlled vocabulary: abstract, file , method, nodata, physical