1. Preservation in Collaboration with NFDI4Chem
1a. Collaboration and stakeholders
The Chemotion repository team collaborates with other services within the
National Research Data Infrastructure for Chemistry (NFDI4Chem) in terms of the
management and conduction of the digital preservation system. This is
particularly important, as the consortium NFDI4Chem maintains and supports
certain software needed to read and reuse data in the Chemotion repository, and
NFDI4Chem in direct contact to standardization organizations such as IUPAC.
Also, collaborators within the NFDI4Chem offer services based on the data of the
Chemotion repository and changes need to be well-negotiated.
1b. Stakeholders: Meeting frequency
The steering committee (including a representative of the Chemotion repository)
meets on a weekly basis to discuss important aspects related to the services
within NFDI4Chem, the working group “repositories” meets every two months.
1c. Training support
The repository admin team is supported by the NFDI4Chem repository team which
provides support to train the repository team and advises them with respect to
preservation standards and measures to be taken. The NFDI4Chem advisors are
invited to the yearly assessment of the preservation measures (see point 5).
2. General assessment of data relevant for preservation
The data stored in
Chemotion repository needs to be
preserved to allow future generations of chemists the easy access to data which
cannot be reproduced easily. The data is needed for spectral comparisons or the
reproduction of chemical reactions. The data and metadata are in most of the
cases in a sense timelessly important as long as no fundamental changes in the
way how to characterize chemical compounds happen and as long as there are no
fundamental changes in the way how to synthesize new chemical structures. It is
in the depositor's responsibility to ensure that the data is in the accepted
formats and that the metadata follows the accepted schema. It is within the
depositor's responsibility that metadata is accurate and complete. The
depositor's responsibility ends when the data is accepted to be published in the
repository. After the acceptance of the data, the repository operator needs to
take care of the data. As a prerequisite for preservation, five topics are of
high importance:
(1) Already at the point of the data ingest, the data has to be added
according to the provided metadata scheme, has to be well-structured and data
files need to be available in the preferred standardized formats or have to be
converted to those.
(2) Easy access to the data and the required software has to be provided,
allowing to read, understand and evaluate data online.
(3) The download of the data files and other stored data in the database in
a readable and reusable form has to be supported
(4) The access to the metadata in updated metadata schemes is to be
guaranteed, and
(5) A strategy for the versioning of data and metadata needs to be
implemented.
3. Measures to preserve data
To reach the preservation goals, the following measures are taken:
(1) The preservation of the data is supported by well-established, open, and
widely used file formats and metadata schemas. Therefore, whenever possible,
vendor specific/ proprietary formats are converted to open formats for long
term preservation. The initial conversion happens (supported by automatic
processing routines) during the ingest of the data by the provider.
(2) Ensuring that data can be read, understood and evaluated online: Ensure
that the supported data and data files standards are well chosen, enabling
the readability by open source viewers that are maintained actively by the
chemistry community in the long run. Data files need to be stored in an open
file format that allows, in case that the standard is versioned, data can be
reprocessed and migrated to the new version. Alternatively, if backwards
compatibility is not possible, suitable data readers need to be additionally
supported and maintained. It has to be ensured that the viewer for
standardized data (currently e.g. jcamp-dx viewer) is enabled and supported
by the repository.
(3) Enable the download of the data files and other stored data in the
database in a readable and reusable form: The data can be downloaded from
the repository via the UI. The analytical data is available in one zip
folder per analytical dataset. The re-use of the data is ensured by open
source software (ChemSpectra and NMRium) which is community maintained to
offer options to read the data independent of the infrastructure of the
repository. Metadata can be downloaded from the UI of the repository as well
(in DataCite xml and json-ld format). In 2025, the repository operators will
include a repository downloader service to allow fast options to download
customized data collections from the repository.
(4) Access to the metadata in updated metadata schemes: The Chemotion
repository supports the DataCite Metadata scheme. Changes of the scheme at
DataCite will require an adaptation of the supported scheme in the Chemotion
repository as soon as possible but at least within the range of 1 year after
the release of a new main version. The migration of old metadata to the new
scheme will be supported and the timeline is to be defined by the repository
admin team in close collaboration with NFDI4Chem to ensure the compatibility
of the changes with services that re-use the data stored in the Chemotion
repository.
(5) Versioning of data and metadata The deletion of data from the Chemotion
repository is not a standard scenario and should happen only in a few cases
(defined in the directive of the repository). The standard way to improve
data in the repository is a versioning of the data. This ensures a
transparent, user-driven adaptation of data with a full record of the
changes. The versioning of data is currently implemented and will be enabled
productively in 2025. The versioning includes the versioning of the DOI in
cases where the metadata is changed.
4. Financing
Currently, all stakeholders finance the digital preservation system and the
human resources responsible for it. If the financial resources are limited or
not available any more, all stakeholders will be responsible to search
alternatives. A basic budget is reserved in the budget of the host institution
KIT, managed by the institute IBCS.
5. Operation of the digital preservation system
The administration team of Chemotion repository meets on a monthly basis to
discuss important changes that may have impact on the preservation of data and
metadata in the Chemotion repository. The results are shared -if applicable-
with NFDI4Chem and persons in charge for different services. The team meets
once a year for a detailed assessment of data and metadata to guarantee the
aims described in (2). If needed, necessary action points are described,
discussed within the NFDI4Chem community within 3 months latest and a timeline
will be defined with all stakeholders, in particular services within
NFDI4Chem. The outcome of the assessment and measures are stored by the admin
team and can be made available on request. Measures are also announced and
published in the documentation of the Chemotion repository. In addition, the
data is checked for re-usability continuously upon community feedback.
Feedback mechanisms are enabled via the UI of the repository (feedback options
per dataset at different detail levels).
6. Responsibility for preservation
Responsible for the preservation of the data is the administration team of the
Chemotion repository.
7. Review of digital preservation process
The measures shall comply with established best practice and standards in the
area of digital preservation. Therefore, the measures shall be evaluated on a
regular basis once a year after the yearly data assessment; the documentation
of such evaluations is available at all times to all employees working with
long-term archiving in NFDI4chem.
8. Archival and exit strategy
Archival of data: To date, the size of the data hosted by Chemotion repository
is small enough to provide full and fast access to the data. In case of
tremendous increase of the data, the Chemotion team will decide on a data
archival process on tapes. This process is also planned in case that the
repository has to run with limited resources or in other unforeseeable cases.
KIT provides access to storage on tapes e.g. via the system bwDataArchive and
exit scenarios including archiving in
bwDataArchive are a feasible
option which is available at KIT. The storage of the data on tapes due to
storage limitation or due to an exit scenario will include the preservation of
all data and metadata. In addition, data stored in Chemotion repository can be
used as it is for a deposition in RADAR (Research Data Repository) as a backup
service in case of an exit scenario.
RADAR cannot replace
the functionality of Chemotion repository but it can be used to preserve data
and metadata. Example datasets were already stored in RADAR to ensure the
suitability and feasibility of the planned process.
9. Review of data preservation policy
In order to ensure that this document is always up-to-date, it shall be
revised annually and adapted and updated as and when required.