Presentation Open Access

Keywords for data discovery

Brigitte Mathiak

Finding research data is often described as difficult or challenging (Brickley, Burgess, & Noy, 2019) (Chapman, et al., 2020), especially in comparison to literature search (Kern & Mathiak, 2015). From observation (Krämer, Papenmeier, Carevic, Kern, & Mathiak, 2021) and surveys (Gregory, Groth, Scharnhorst, & Wyatt, 2020) (Friedrich, 2020) we know that data discovery is a complex process, which involves doing literature review, using data portals, reading documentation, and leveraging personal networks. However, the glue that holds all these steps together is the common web search, e.g. via Google. Unfortunately, due to the lack of central, fully indexed repositories, individual data repositories have the responsibility to make their data visible for web search. In this paper we explore how research data is found via general web search by analyzing the queries made to Google using clustering techniques, retrieved via the Google Search Console. The clustering is based on two different keyword features: their probabilities in the queries and their Comparable Click Through Rate (CCTR). The latter is a normalized version of CTR, which allows keywords comparison. We use the query logs from three data portals from the Social Sciences domain, from two different institutions, in addition to a JSON file with mentions of datasets in research papers taken from Social Science Open Access Repository (SSOAR). The use case we are most interested in is the known item search. Here, a dataset is retrieved by name, which has been communicated through the literature or personal communication. These names are often ambiguous, such as acronyms or common nouns, and additional keywords are added by the researchers to find the dataset’s website. The results of our analysis provide a set of keywords which, when systematically added in proper locations of the research data landing pages, can help to make them more discoverable.

Files (144.8 kB)
Name Size
144.8 kB Download
All versions This version
Views 129129
Downloads 3939
Data volume 5.6 MB5.6 MB
Unique views 9797
Unique downloads 3232


Cite as