DOCUMENTATION OF RESEARCH DATA AND CODE The life cycle of the data as well as the code is described detailed in chapter 3 (Methodology) of the thesis. Also the Data Management Plan can be helpful (DMP). If values were used more than once, entries were combined in one cell and separated by " | " CODE re3data_extract.ipynb - Jupyter Notebook in R fetching data about repositories from re3data's API. re3data_normalize.ipynb - Jupyter Notebook in R normalizing single fields and splitting up size property re3data_analysis.ipynb - Jupyter Notebook in R uncollapsing rows for size sublevel, analysing fields, with on-the-fly-modifications and plots DATASET RE3DATA REAL_repository_info_v1.csv - Output of re3data_extract.ipynb - rows: 3037 repository records - columns: 26 properties from re3data Schema 2.2 "re3data.orgIdentifier" "repositoryName" "repositoryURL" "repositoryIdentifier" "description" "type" "size" "updated" "startDate" "endDate" "subject" "contentType" "providerType" "keyword" "databaseAccessType" "dataUploadType" "softwareName" "api" "apiType" "pidSystem" "enhancedPublication" "certificate" "metadataStandardName" "remarks" "entryDate" "lastUpdate" REAL_repository_info_v2.csv - Output of re3data_normalize.ipynb - 28 columns: [from REAL_repository_info_v1.csv] + "size_number": number count of a size unit, numeric e.g. 272705, excecpt for ranges 10-50 "size_unit": Unit of a size entity, strings e.g. records "softwareName": -unknown OR -none "pidSystem": -unknown OR -none "enhancedPublication": -unknown OR -none "startDate": simplified to YYYY "endDate" : simplified to YYYY REAL_repository_info_v3.xlsx - Intellectual Typing and Data Manipulation - 33 columns: [from REAL_repository_info_v2.csv] + "size_unit_type": typing of size_unit in the dimension of quantity or volume, Controlled vocabulary: quantity, volume "size_unit_discipline": typing of size_unit in the dimension of discipline, Controlled vocabulary: generic, disciplinary "size_unit_content": typing of size_unit in the dimension of content, Controlled vocabulary: neutral, other "size_non_unit_type": typing of remaining size values besider number and unit, Controlled vocabulary: accuracy, detail, entirety, partof, omittable "size_quality": quality issues in size, Controlled vocabulary: dot error, comma error, distiribution error, language error, number error, orthographic error, range, semantic error, syntax error "updated_corrected": correction of updated found by updated_quality, date format e.g. 1/1/2021 "updated_quality": quality issues in updated, Controlled vocabulary: granularity, format error, missing, no size, to high - "subject" "keyword" REAL_repository_info_v3.csv - same as REAL_repository_info_v3.xlsx REAL_repository_info_v4.csv - Output of re3data_analysis.ipynb - can be joined with REAL_repository_info_v3.csv by "re3data.orgIdentifier" - rows: 4293 sizes per repository - 4 columns: [from REAL_repository_info_v3.csv] "re3data.orgIdentifier" "size_number" "size_unit" "size_unit_content" + "size_unit_content_other": detailed typing of size_non_unit_type for value other, Controlled vocabulary: abstract, file , method, nodata, physical