Poster Open Access

Detecting Informal Data Use in Literature

Sara Lafia; Elizabeth Moss; Andrea Thomer; Libby Hemphill

The Inter-university Consortium for Political and Social Research (ICPSR) is developing a computational approach to detect informal data use and construct reliable data impact metrics. Formal data citations that use unique identifiers are readily discoverable; however, informal references made to data are challenging to infer and detect as they are described in many ways and tend to occur in article footnotes, tables, figures, or elsewhere where they are not indexed for search. Identifying data citations is an essential step toward characterizing the impact of research data (i.e., who reuses research data and for what purposes). We use features of text including the presence of indicator terms, sections of articles, and frequency of acronyms, to predict the portions of articles that are likely to indicate data use. We then use a natural language processing (NLP) pipeline to extract candidate data references. In production, our model will support the review of publications to ingest into the ICPSR Bibliography of Data-related Literature as part of a broader effort to measure the impact of research data.

Files (956.7 kB)
Name Size
956.7 kB Download
All versions This version
Views 102102
Downloads 5454
Data volume 51.7 MB51.7 MB
Unique views 8989
Unique downloads 5151


Cite as