2019 search and interaction log from the data catalogue: Research Data Australia
Description
In order to provide a better support to user's data discovery activity, we analysed a data search log in order to understand how data seekers interact with a data search system when they search for data. The data search log is from the research data discovery portal: Research Data Australia (RDA). RDA is the data discovery service of the Australian Research Data Commons (ARDC). ARDC is supported by the Australian Government through the National Collaborative Research Infrastructure Strategy Program.
Please read the research paper "Large-scale Analysis of Query Logs to Profile Users for Dataset Search" for detailed description and analysis of the datasets, and the software "Python code for processing and clustering a data search log" for the data process and analysis.
The search log consists of the entire user-front activity log data for the duration of January to December 2019. During this period, the catalogue contained about 150,000 metadata records of datasets.
The dataset (2019_search_log_sessioned.txt) was generated from raw log data with following steps:
- Remove entries that were likely from machines instead of human users. Those recorded machine activities may result from downstream aggregators who harvested metadata from RDA by directly sending queries to the catalogue URL instead of using the API endpoint.
- Identify search sessions from a user - a search session includes all activities a user conducts with a search system in order to satisfy a (information/data) search needs. We followed the following steps to identify search sessions. First, we identified a user by IP address, where a unique IP address was considered a single user. We recognise the limitation of this approach, as several users may share the same IP address, however the IP address is the only information available for identifying a user.
Past research in log analysis usually apply the following two methods to identify a session: 30 minutes from the same IP address, and/or more than 30 minutes of inactivity between the current activity event and its immediate preceding event. We examined both methods carefully for our log data and concluded that both ended with large unwanted sessions from machine activities. Therefore, we take a brutal approach, by taking only a session from an IP address with a maximum 30 minutes duration. - We also removed sessions whose 40% of activities resulted in ’page not found’ or whose activities were all about accessing grants. Within a session, we removed "duplicated" activities that were exactly as their precedent activity with less than one second time span (this could have been a result of reloading a page).
The dataset (id_to_title_subject.csv) lists title and subject headings per record id.
Files
2019_search_log_sessioned.txt
Files
(212.1 MB)
Name | Size | Download all |
---|---|---|
md5:f46a5ecc2212eb58efa205fe02752849
|
170.1 MB | Preview Download |
md5:2c4e558c48613c2b5a4d74b825695ef0
|
42.0 MB | Preview Download |
Additional details
Related works
- Is cited by
- Journal article: 10.1108/JD-12-2021-0245 (DOI)
- Is source of
- Software: 10.5281/zenodo.6321621 (DOI)