This readme file was generated on 2024-05-06 by Gail Steinhart. 

GENERAL INFORMATION

Dataset title: Data for: The State of Open Infrastructure Grant Funding, 2024 State of Open Infrastructure Report

The data were summarized and reported in the “2024 State of Open Infrastructure Report” section “The state of open infrastructure grant funding,” available at https://doi.org/10.5281/zenodo.10934089.
Contributors:
David Riordan, https://orcid.org/0000-0002-6257-1859 - data curation, software
Chun-Kai Huang, https://orcid.org/0000-0002-9656-5932 - data curation, software
Cameron Neylon, https://orcid.org/0000-0002-0068-716X - data curation, software
Gail Steinhart, https://orcid.org/0000-0002-2441-1651 - conceptualization, data curation, project administration

Date of data collection: 2024-02-01 to 2024-03-15

Funding acknowledgement
This work was generously supported by the Andrew W. Mellon Foundation and the Arcadia Fund. 


SHARING/ACCESS INFORMATION

Provided by Invest in Open Infrastructure, http://investinopen.org/
Shared under a CC0 License, https://creativecommons.org/public-domain/cc0/

Recommended citation for this dataset: 

Riordan, D., Huang, C.-K., Neylon, C., & Steinhart, G. (2024) Data for: The State of Open Infrastructure Grant Funding, 2024 State of Open Infrastructure Report. Zenodo. https://doi.org/10.5281/zenodo.10934085. 


DATA & FILE OVERVIEW

This dataset comprises 3 files in CSV format:
sooi_2024_funding_award_data.csv - contains processed award data collected as described in the methodological information section below.
sooi_2024_funding_data_dictionary.csv - contains a data dictionary with definitions of all variables in the award dataset.
sooi_2024_award_categories.csv - contains definitions of award categories.

This dataset incorporates additional data from “Data for: Characteristics of Selected Open Infrastructures, 2024 State of Open Infrastructure Report,” available at https://zenodo.org/doi/10.5281/zenodo.10835677.

The code repositories listed here were used to create this dataset (and are referenced in context in this README):
https://github.com/investinopen/state_of_open_funder_data_scrapers - web scrapers for funder data
https://github.com/investinopen/state_of_open_nsf_funding - XML processing script for NSF data
https://github.com/The-Academic-Observatory/openaire-ingest - COKI Academic Observatory OpenAIRE data ingest script
https://github.com/investinopen/state_of_open_queries - search queries


METHODOLOGICAL INFORMATION

The published analysis of these data contains a shorter and simplified version of our data collection and processing methods. We present a more complete version here.

Data collection

We focused on funder-reported and centrally reported data as the sources of record. We compiled a list of funders of interest from IOI’s earlier exploration of funding for open infrastructures (Dunks, 2022) and funding sources reported by the 58 infrastructures listed in IOI’s initial launch of Infra Finder (https://infrafinder.investinopen.org/). We chose to focus on open infrastructures (OIs) in Infra Finder in order to be able to tie our analysis back to additional attributes of those OIs that are included in the tool, and potentially leverage the data available there. 

We harvested available award data directly where possible by developing a web scraper for each funder using the Scrapy-playwright extension and Chromium browser for interactive websites, and Jupyter with Papermill for running file-based pipelines. Data were output as JSON files. We used this process to harvest grant award data for the following funders:
- Alfred P. Sloan Foundation
- Andrew W. Mellon Foundation
- Arcadia Fund
- Bill & Melinda Gates Foundation
- Chan Zuckerberg Initiative
- Gordon and Betty Moore Foundation
- Institute of Museum and Library Services
- Leona M. and Harry B. Helmsley Charitable Trust
- National Endowment for the Humanities
- Robert Wood Johnson Foundation
- Social Sciences and Humanities Research Council
- The Wellcome Trust

The scrapers are available at https://github.com/investinopen/state_of_open_funder_data_scrapers.

Data from OpenAIRE was collated from the OpenAIRE Research Graph data dump of 16 January 2024 (Manghi et al, 2024) into the COKI Academic Observatory system (Hosking et al., 2023) on February 13, 2024 using a manual ingest script. This provides an image of each element of the OpenAIRE Research Graph schema as a database table within the context. Information on grants, funding and organizations was collected from the “project”, “organization” and “relation” tables. The ingest script is available at https://github.com/The-Academic-Observatory/openaire-ingest.

Data on National Science Foundation (USA, NSF) grants was collected from the NSF website (https://www.nsf.gov/awardsearch/download.jsp) in the form of XML files for the years 2010-2024. The XML files were processed with a custom script to convert to JSON-NL format and to infer a BigQuery table schema. The XML processing script is available at: https://github.com/investinopen/state_of_open_nsf_funding. 

Data for all funders was manually uploaded to BigQuery after inferring the appropriate schema (as per NSF data above) providing a separate table for each funder with all funders excepting OpenAIRE and NSF in a consistent format. All data tables were then subject to a formatting query to ensure all data was in a consistent format and schema.

Changes in data availability since 2022-23
Two of the 22 funders with data available from IOI’s earlier 2022-23 analysis, Arnold Ventures and the Simons Foundation, no longer offer straightforward access to award data. Arnold Ventures has been an important funder of the Center for Open Science (home of Open Science Framework, OSF), and many of its earlier grants were to that organization. We reviewed more recent 990 forms for Arnold Ventures and did not encounter any additional awards to OSF in 2021 or 2022, although it is possible there were awards to other OIs of interest (we did not search for them). Awards from the National Institutes of Health (NIH) were also reported in IOI’s 2022-23 dataset; in this iteration we were unsuccessful in applying our harvesting methods to the NIH’s award database due to its sheer size, but we were able to retrieve some award data from OpenAIRE. This almost certainly results in missing award information for relevant grants from NIH.

Finally, we reviewed the funder-reported data in IOI’s earlier dataset (Dunks, 2022) for awards that we did not capture with our current methods. We added missing information if we could find it (most often title, description, and funder’s award ID). If we could not verify that an award was to an OI on our list, we did not include it in our updated dataset. 

Selecting relevant awards

A predefined list of search terms for each of the columns “DESCRIPTION”, “TITLE,” and “RECIPIENT” is used to filter each data table (including OpenAIRE) down to grants of plausible interest. The search terms include variations of OI names, acronyms, and controls for spacing. When at least one of the columns matches one of the defined search terms for that column, the row (i.e., grant or project entry) is identified as a plausible grant of interest. The resulting grants were combined into a results table (with OpenAIRE results kept in a separate table) for manual checking. The search queries are available at: https://github.com/investinopen/state_of_open_queries.  

Once we reduced the original dataset to a smaller collection of awards of plausible interest, we manually reviewed award titles and descriptions to determine whether they were relevant and excluded those which had no clear relationship to any of the OIs of interest.

Deduplicating awards

Awards which were exact duplicates were flagged for exclusion. OpenAIRE awards were also flagged as duplicates and excluded if they had the same award amount and ID, and all other fields except either the OI or the recipient matched. We assume these are multi-institution awards and that the amount represents the total award to all collaborating institutions. The likely effects of this assumption is that while the total amount of a multi-institution award is available for analysis, attribution of the award to participating institutions or to related OIs is incomplete. Grants that were near perfect matches but had different award amounts were included if they were related to an OI of interest. We assume that in this case sub-awards are correctly assigned to participating institutions.

Data manipulation and enhancement

Dates

Award dates in the data harvested directly from funders are not always straightforward to interpret. Funders provide different types of start dates for their grant awards, and in many cases the exact type of event represented by date fields is unspecified. Our use of start_date makes a best-guess effort for each award to select a date that best represents the beginning of an award’s funding period. In reality, this date may represent the start date of the project, the date of notice or award, or the date that funds were disbursed. Where the funder provides a clear start date and end date for work covered by the grant, we populate both fields from this data. However, if a period of coverage (e.g. 1 year, or 9 months) is provided, we calculate the end date from this information. For all awards, regardless of the source of the data, we created a normalized field for award year, which is the year of the start date, or the award year (if no start date is provided but award year is). Dates were reformatted from the source format to the format YYYY-MM-DD (full dates) or YYYY (year only), depending on the field.

Currency conversions

For awards made in currencies other than US dollars (USD), we used the European Central Bank’s currency converter (https://data.ecb.europa.eu/currency-converter) to convert non-USD to USD, using the start date of the grant or 1 January of the award year if a specific date is not available. If no date information is available at all, no conversion is made and the award does not factor into any analysis of award amounts, but are included in award counts. Award amounts in the original IOI dataset were converted to USD using the 2010-2020 average exchange rates reported in Exchange Rates UK (https://www.exchangerates.org.uk/).

Award categories

We assigned each award to a category based on its title and description. We also use the categories to differentiate between those that constitute direct support to an OI, and those that do not, but that demonstrate the impact these infrastructures have in research and scholarship.

These assignments are somewhat subjective. For example, it may not be completely clear from an award description whether a named infrastructure is enabling a new project, in which case the award might be categorized as “Adjacent,” or whether a completely new instance of the repository infrastructure is being created, in which case the appropriate category would be “Adoption.” Similarly, we attempt to distinguish between new feature development and routine code maintenance and updating (“Research and development” for the former, and “Operations” for the latter), but this is not always completely clear.

Other assumptions

Finally, we make a few more assumptions in the following special cases:
There are a few OpenAIRE awards with no geographic location. Where these are made by a national funding body, we assume these awards are in-country and assign the country of the funder to the award.

Some organizations are the recipients of awards that do not specifically name OIs they maintain, but sustaining those OIs is a significant part of what the organization does. We do include these awards and consider them plausibly related to the OIs of those organizations. The chief examples here are the Public Knowledge Project (home of Open Journal Systems, Open Monograph Press, and Open Preprint Systems), and Center for Open Science (home of the Open Science Framework).

In the 2022 IOI dataset, we attributed all awards to the Public Knowledge Project (PKP) to Open Journal Systems. This may not be entirely correct, but all three PKP OIs are in the solution category “publishing systems,” so while we may misattribute some awards to the wrong OI, they are all assigned to the correct infrastructure category and organization.