Opening up translational data impact through the Data Citation Corpus

Richardson, Reese; Puebla, Iratxe; Portenoy, Jason; Gutzman, Karen; Holmes, Kristi

doi:10.5281/zenodo.15241443

Published April 10, 2025 | Version v1

Poster Open

Opening up translational data impact through the Data Citation Corpus

1. Northwestern University
2. DataCite
3. OurResearch
4. Northwestern University - Chicago

The metadata for 5 million data citations in the Make Data Count Data Citation Corpus serves as a source for this exploratory analysis of the use of datasets from biomedical fields across a number of facets, including date, affiliations, funders and domain. To gather insights about translational data impact, we focused on array-based gene expression profiling datasets produced between 2005-2009 in Homo sapiens from the Gene Expression Omnibus in the Corpus (n=3,427). This time period was selected to allow a 15-20 year timeframe to allow follow-up studies/publications to accrue. GEO is a public functional genomics data repository by the National Library of Medicine’s National Center for Biotechnology Information. GEO archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community. GEO identifiers were selected for this exploratory work because they are unambiguous (see Limitations) and can be easily filtered by human data to more easily highlight clinically-relevant studies. Scatter plots of sample count versus number of citations for these datasets were created (Figure 1) allowing identification of two candidate datasets: “Breast cancer relapse free survival” (GSE2034, 2005) and “Strong Time Dependence of the 76-Gene Prognostic Signature” (GSE7390, 2007) to explore in more depth based on citations in papers that report on clinical trials (Figure 2). The primary articles about these two datasets were also cited in subsequent articles, including articles reporting on clinical trials. Dataset citations were collected through the Data Citation Corpus v3.0, while inter-article citations were collected through iCite v32.

Files

ACTS25_Richardson-Holmes.pdf

Files (3.7 MB)

Name	Size	Download all
ACTS25_Richardson-Holmes.pdf md5:026ccd34367c7c65290198ed02eeac8b	3.7 MB	Preview Download

Additional details

National Center for Advancing Translational Sciences
NUCATS CTSA UM1 at Northwestern University UM1TR005121

	All versions	This version
Views	238	84
Downloads	233	49
Data volume	1.1 GB	219.7 MB

Opening up translational data impact through the Data Citation Corpus

Creators

Description

Files

ACTS25_Richardson-Holmes.pdf

Files (3.7 MB)

Additional details

Funding