ConnOSS and Metadata Extraction for Research Software
Authors/Creators
Description
Metadata and software descriptors help realize the FAIR principles for software. Various efforts exist around research software metadata (e.g., CodeMeta, Bioschemas, maSMP schema) as well as metadata extraction (e.g., SOMEF, HERMES, MAUS). Despite these efforts, existing tools and schemas remain fragmented, cover limited metadata, and are rarely built for large-scale, automated processing or enrichment with modern AI techniques. To address this gap, the Connected Open Source Software (ConnOSS) project aims to provide a consistent infrastructure for metadata extraction and publication, enabling researchers to create harmonized software descriptions and facilitating metadata harvesting by registries and aggregators. The project aims to analyze and extend existing research software metadata schemas, identify metadata sources, and develop a harmonized extraction pipeline from platforms like GitHub and GitLab. Machine learning models trained on a curated corpus plan to extract, enrich, and validate metadata from README files, addressing current automation gaps. A publication workflow then intends to make metadata accessible to humans and machines via GitHub/GitLab pages. In this poster we introduce ConnOSS and present a preliminary comparison across different research software metadata extractors which will be later used to define the requirements for the ConnOSS metadata extractor.
This work is part of the contributions to the deRSE 2026 Conference, see https://events.hifis.net/event/2945/contributions/21334/
This work has been supported by the German Research Foundation (DFG) through the project ConnOSS with project number 561044496.