Opening up translational data impact through the Data Citation Corpus
Description
The metadata for 5 million data citations in the Make Data Count Data Citation Corpus serves as a source for this exploratory analysis of the use of datasets from biomedical fields across a number of facets, including date, affiliations, funders and domain. To gather insights about translational data impact, we focused on array-based gene expression profiling datasets produced between 2005-2009 in Homo sapiens from the Gene Expression Omnibus in the Corpus (n=3,427). This time period was selected to allow a 15-20 year timeframe to allow follow-up studies/publications to accrue. GEO is a public functional genomics data repository by the National Library of Medicine’s National Center for Biotechnology Information. GEO archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community. GEO identifiers were selected for this exploratory work because they are unambiguous (see Limitations) and can be easily filtered by human data to more easily highlight clinically-relevant studies. Scatter plots of sample count versus number of citations for these datasets were created (Figure 1) allowing identification of two candidate datasets: “Breast cancer relapse free survival” (GSE2034, 2005) and “Strong Time Dependence of the 76-Gene Prognostic Signature” (GSE7390, 2007) to explore in more depth based on citations in papers that report on clinical trials (Figure 2). The primary articles about these two datasets were also cited in subsequent articles, including articles reporting on clinical trials. Dataset citations were collected through the Data Citation Corpus v3.0, while inter-article citations were collected through iCite v32.
Files
ACTS25_Richardson-Holmes.pdf
Files
(3.7 MB)
Name | Size | Download all |
---|---|---|
md5:026ccd34367c7c65290198ed02eeac8b
|
3.7 MB | Preview Download |