Published December 15, 2025 | Version v3.0
Software Open

StockholmUniversityRDMteam/RDMtoolkit4suRe-use: Public version 3.0

  • 1. Stockholm University

Description

What's Changed

Abstract (English)

Stockholm University _harvest combine_ for research data published in external repositories  

Here are the tools used for harvesting and transforming for archival purpose research data deposited by Stockholm University affiliated researchers in repositories such as Dataverse, Dryad, Figshare and Zenodo.

Also added to this version of the toolkit are our tools for processing machine-actionable DMPs created with our so-called SU-VR template in DMP Online to comply with the RDA DMP Common Standard maDMP-schema v1.1 and the recent v1.2.

[This project in part corresponds to the repository kept as (https://gitlab.com/JoakimPhilipson/StockholmUniversityRDMteamTransfer/wikis/Stockholm-University-Research-Data-Harvestor-and-Transformer/)

To start with we provided the tools used for this purpose in the management of our Figshare institutional instance. These have now developed from two basic files, one .xquery and one .xslt (+ a bash-script.sh for the creation of directories or subfolders, and moving original metadata files for items), for the purpose of harvesting METS-files from the su.figshare.com instance by means of OAI-PMH, and transforming them to sip metadatafiles compliant with the Riksarkivet (RA) - National Archives of Sweden model FGS (Förvaltningsgemensam specifikation) - CSPackage (Common Specification). Since the METS file metadata obtained from Figshare via OAI-PMH is insufficient for this purpose, it is supplemented by metadata from the Figshare API and other sources.

The script files (.txt, .sh, .xq and .xsl) for harvesting from the different repositories have been numbered in order to reflect the order of use, e.g. The former FGS1.2 (2017) Förvaltningsgemensam specifikation för paketstruktur för e-arkiv. Specifikation, RAFGS1V1.2, for the earlier FGS-CSPackage structure, has now been replaced by FGS Paketstruktur 2.0 Förvaltningsgemensam specifikation för paketstruktur för e-arkiv RAFGS1V2.0, an implementation of: Common Specification for Information Packages E-ARK CSIP version 2.1.0 and Specification for Submission Information Packages E-ARK SIP version 2.1.0.. Most of the files needed for Figshare harvest and transformation are now in the figsHarvesTransform directory. Only older versions and files that are common to the harvest and transformation for all repositories are directly at the root, notably the important parameter file filext2mimetypeMapMAIN.xml.

In the present workflow for Figshare, the 2et4extractFigsFileInfo.xq file has two different "modes", with different sections activated automatically, based on the input file representing a feed with several item records or an individual item:

(i) First mode is for splitting up an OAI-PMH METSfeed into separate items, "$artId_originalMD", and by default producing the "file_infoFeed", which is now amended and renamed automatically by the 3dir-mvOrigMDfigMETS.sh to become well-formed xml. This will serve as a kind of ToC, together with the corresponding original METSfeed in the final figsMETSfeedNNpacs directory. In this mode, section 0 is turned on, and section 5 (for fetching data files) is turned off.

(ii) Second mode, then, creates the necessary file_info.xml supplementary metadata-file, to be used as a parameter in the subsequent 5figMETS2fgs.xsl transformation of each item record to a METS.xml that is compliant with FGS 2.0. As explained, this is also when the actual data files belonging to an item are fetched to the corresponding package subdirectory. In this mode, then, section 5 (fetching the data files) is turned ON.

The xslt-file 5figMETS2fgs.xsl now has 2 parameters to be specified in the setup: `

  • file_info_data = 'doc('${cfdu}/file_info.xml')'
  • filext2mimetypeMap = 'doc('${cfdu}/filext2mimetypeMapMAIN.xml')'

both essential, to be set up as XPath expressions (tick the box). They should preferably be run only on individual items, that have been split up from the original feeds. To work together, the specific 'file_info.xml' parameter doc (created by the 2et4extractFigsFileInfo.xq) and the item _originalMD.xml file (split up from the original OAI-PMH-feed) on which to run the xslt, should both be in the same folder. (This is essentially the same for all four repositories represented here.)

For dataverseHarvesTransform the first parameter should be instead:

  • file_info_data = 'doc('${cfdu}/nativeMDfile_info.xml')'.

dryadHarvesTransform has instead an extra parameter that is probably no longer needed:

  • checkSums=doc('${cfdu}/checkSums.xml').

This now serves more as a fallback, since file metadata for newer dryad datasets already come with an inbuilt SHA256 checksum, which is used as first choice in the dryad6DataCite2FGS.xsl script.

The bash script-file, 3dir-mvOrigMDfigMETS.sh in the figsHarvesTransform is for making package folders and moving the items of original metadata xml files *MD.xml, that were split up from the original METS-feed, in to their respective package folders, where subsequently also their associated data files will end up in the next run of the xq-file.

With this version comes also a similar directory with script files for Zenodo, zenHarvesTranform1zenSUBdataciteFeedFirst.sh2zenMiniSplit.xq3zenDir-mvOrigMD,4extractZenFileInfo.xq and 5zeno2fgs.xsl . The use of these scripts for Zenodo is similar to that for Figshare, now with a recently introduced first automatic feed fetcher, 1zenSUBdataciteFeedFirst.sh, but a separate 2zenMiniSplit.xq script for doing the split up.

The main similar tools for Dataverse (which has only 2 bash and 1 xslt scripts, no XQuery needed here) and Dryad have now been added to this repository. For each repository there is also a schemas/ folder, containing some of the validation schemas for the resulting output mets.xml file.

 

Files

StockholmUniversityRDMteam/RDMtoolkit4suRe-use-v3.0.zip

Files (332.5 kB)

Additional details

Dates

Issued
2025-12-22