# An Overview of Open Science in Italy - Data Set

Data sets accompanying the paper "An Overview of Open Science in Italy", an overview of the current state of open science implementation in Italy.

## Organisations

The data set atenei.csv consists of the list of Italian universities recognised by the Ministero dell'Università e della Ricerca, documented at https://dati-ustat.mur.gov.it/dataset/metadati, enriched with the Research Organization Registry identifier and the OpenAIRE identifier.

The data set altri-istituti-ricerca.csv consists of the list of organisations resulting from the aggregation of data from a crowdsourcing activity born within the Open Access Italia mailing list, from the catalogue of policies of the open-science.it portal, as well as from the survey by the Open Science Working Group of CoPER on the monitoring of institutional policies for the management of scientific data. It is structured into 4 columns: (i) ROR, (ii) NomeEsteso, (iii) URL, and (iv) OpenAIREID. They are used for the Research Organization Registry identifier, the denomination, the Web page of the organisations and the identifier assigned by OpenAIRE respectively.

## Policies

The data set policies.csv consists of the list of documents (among policies, guidelines and regulations) that are related to the observed organisations. It is structured into 23 columns: (i) ROR, (ii) Name, (iii) URL, (iv) HasDocument, (v) DocumentType, (vi) DocumentStatus, (vii) DocumentDesignation, (viii) DocumentLink, (ix) DocumentDate, (x) DocumentDateType, (xi) Product.AnyResearchProduct, (xii) Product.AnyResearchProductHow, (xiii) Product.Literature, (xiv) Product.LiteratureHow, (xv) Product.Data, (xvi) Product.DataHow, (xvii) Product.Software, (xviii) Product.SoftwareHow, (xix) Product.Other, (xx) Product.OtherHow, (xxi) DataManagementPlan, (xxii) OpenInfrastructure, and (xxiii) FAIR.

The first three columns (ROR, Name, URL) are taken from the data set organisations.csv and have the same semantics.

The column HasDocument is used for selecting purposes, since it distinguishes the organisations having a document (policy, guideline and/or regulation). Its values are boolean (‘TRUE’ or ‘FALSE’).

The six columns DocumentType, DocumentStatus, DocumentDesignation, DocumentLink, DocumentDate, and DocumentDateType are used for identifying and for describing an organisation’s document. DocumentType is used for distinguishing among the different types of documents: (a) policy, intended as a high-level declaration of intent, (b) guideline (‘linee guida’), intended as a practical description of a process, and (c) regulation (‘regolamento’), intended as a legally binding document. DocumentStatus is used for distinguishing between current documents (‘current’) and superseded ones (‘superseded’). DocumentDesignation and DocumentLink are used for the title of the document and its URL respectively. DocumentDate and DocumentDateType contain a date and a source or an event linked to it. While an effective date (‘entrata in vigore’) was always preferred, not all documents provide it (sometimes the only reference is the file metadata).

The columns with a ‘Product.’ prefix are used for analysing which research products are considered by each collected document. The prefixes are followed by the research product categories identified for the study: ‘AnyResearchProduct’, for documents having a broader scope, encompassing all research products, while ‘Literature’, ‘Data’, ‘Software’, ‘Other’ are used for documents which explicitly mention literature, data, software of other research products that are not covered by the previous categories respectively. The columns having a ‘How’ suffix reports the enabling measures, in terms of technological solution, identified by every document for each type of product considered.

The columns DataManagementPlan, OpenInfrastructure, and FAIR (standing for Findable, Accessible, Interoperable, and Reusable), are used for indicating references to the creation of a data management plan, to the use of open infrastructures, or to a FAIR data management respectively.

## Production

The data set ita_publications_openaire.csv.zip includes records about italian publications selected from the OpenAIRE Graph based on affiliation relationships. The data set is created via the Zeppelin Note available at 10.5281/zenodo.10640721.

The data set consists of 5 columns: 
(i) ita_publications.id: OpenAIRE id of the publication, as documented at https://graph.openaire.eu/docs/data-model/entities/research-product#id; 
(ii)  ita_publications.publicationdate, as documented at https://graph.openaire.eu/docs/data-model/entities/research-product#publicationdate; 
(iii) ita_publications.accessright: the label of the field bestaccessright documented at https://graph.openaire.eu/docs/data-model/entities/research-product#bestaccessright; 
(iv)  ita_publications.journalname, name of the field container documented at https://graph.openaire.eu/docs/data-model/entities/research-product#publication; 
(v) ita_publications.instance: different manifestations of the same research publication as documented at https://graph.openaire.eu/docs/data-model/entities/research-product#instance.

The data set openaire_observatory_it.csv reports the aggregated data with respect to the Italian open access (OA) and closed research products, from 2015 to 2023, retrieved from the OpenAIRE Open Science Observatory. It consists of 13 columns: (i) year, (ii) P: Gold OA, (iii) P: Both Gold & Green OA, (iv) P: Green OA, (v) P: Unknown OA, (vi) P: Closed, (vii) P: Total, (viii) Dataset: OA, (ix) Dataset: Non-OA, (x) SW: OA, (xi) SW: Non-OA, (xii) Other: OA, and (xiii) Other: Non-OA.

In particular, the columns with a ‘P:’ prefix refers to aggregated indicators with regard to the italian publications, while the columns with a ‘Dataset:’, ‘SW:’, and ‘Other:’ prefixes refer to aggregated indicators with regard to Italian data sets, software, and research products that do not fall within the scope of the other categories respectively.

The columns referring to publications also distinguish among the different OA models (P: Gold OA, P: Both Gold & Green OA, P: Green OA). The column P: Unknown OA is used for the OA publications whose OA model is not known. 

The data set openaire_observatory_eu.csv consists of the aggregated data on the top eleven European countries for OA research production (publications, data sets, software, and other research products), from 2015 to 2023, retrieved from the OpenAIRE Open Science Observatory. It consists of 9 columns: (i) rank, (ii) publications_country, (iii) publications, (iv) datasets_country, (v) datasets, (vi) software_country, (vii) software, (viii) other research products_country, and (ix) other research products.

The column rank is used for the ranking position.

The columns publications_country and publications are used for the name of the countries and the corresponding number of open access publications respectively.

The columns datasets_country and datasets are used for the name of the countries and the corresponding number of open access data sets respectively.

The columns software_country and software are used for the name of the countries and the corresponding number of open access software respectively. 

The columns other research products_country and other research products are used for the name of the countries and the corresponding number of open access research products falling into the other research products category respectively.

The data set coki_years.csv consists of the aggregated indicators on the Italian OA production from 2000 to 2023. It consists of 36 columns, the semantics of which is documented at https://open.coki.ac/data/.

The data set coki_it.csv consists of the selection of indicators on open and closed publications (from 2015 to 2023) from the coki_years.csv. It consists of 5 columns (year, n_outputs_publisher_open_only, n_outputs_both, n_outputs_other_platform_open_only, and n_outputs_closed), the semantics of which is documented at https://open.coki.ac/data/.

## Services

The data sets services_re3data.csv, services_opendoar.csv, services_eoscmarketplace.csv, services_fairsharing.csv, and services_openaire.csv, and services_coki.csv consist of the records identifying italian services for open science downloaded from re3data, OpenDOAR, the EOSC Marketplace, and FAIRsharing respectively.

The data set services_re3data.csv consists of 3 columns: (i) ID, (ii) Name, and (iii) DOI.ID and DOI are used for the re3data identifier and the linked Digital Object Identifier for the service respectively, while Name conveys the denomination of the service. 

The data set services_opendoar.csv consists of 5 columns: (i) ID, (ii) Name, (iii) URL, (iv) Technology, and (v) Type. The first three are for the OpenDOAR identifier, the name of the service and the URL to the service landing page respectively. The column Technology lists the software underneath the services, while Type is used for distinguishing between institutional and disciplinary repositories.

The data set services_eoscmarketplace.csv consists of 3 columns: (i) Title, (ii) Type, and (iii) Description. The columns Title and Description are used for the name and the description of the services respectively. The column Type distinguishes services from data sources.

The data set services_fairsharing.csv consists of 4 columns: (i) FAIRsharing URL, (ii) FAIRsharing DOI, (iii) Record Name, and (iv) Record homepage URL. The first two are for the URL and DOI assigned by FAIRsharing. Record Name and Record homepage URL are used for the name and the original landing page of the service respectively.