Published October 9, 2019 | Version v1
Dataset Open

Web robot detection - Server logs

  • 1. Aristotle University of Thessaloniki

Description

This dataset contains server logs from the search engine of the library and information center of the Aristotle University of Thessaloniki in Greece (http://search.lib.auth.gr/). The search engine enables users to check the availability of books and other written works, and search for digitized material and scientific publications. The server logs obtained span an entire month, from March 1st to March 31 2018 and consist of 4,091,155 requests with an average of 131,973 requests per day and a standard deviation of 36,996.7 requests. In total, there are requests from 27,061 unique IP addresses and 3,441 unique user-agent strings. The server logs are in JSON format and they are anonymized by masking the last 6 digits of the IP address and by hashing the last part of the URLs requested (after last /). The dataset also contains the processed form of the server logs as a labelled dataset of log entries grouped into sessions along with their extracted features (simple semantic features). We make this dataset publicly available, the first one in this domain, in order to provide a common ground for testing web robot detection methods, as well as other methods that analyze server logs.

 

Files

public_v2.json

Files (3.2 GB)

Name Size Download all
md5:2f126f1f33a8cb7851c2863094998788
3.2 GB Preview Download
md5:8c2c2daef9aeb5c21f153b55caedc005
4.3 MB Preview Download
md5:08c9b2efeddd5c15e1e4f66d127408f8
15.9 MB Preview Download

Additional details

Related works

Is compiled by
Journal article: 10.1007/s10489-020-01754-9 (DOI)

References

  • Lagopoulos, A., & Tsoumakas, G. (2020). Content-aware web robot detection. Applied Intelligence, 50(11), 4017-4028.