Dataset Open Access

Web robot detection - Server logs

Lagopoulos, Athanasios; Tsoumakas, Grigorios

This dataset contains server logs from the search engine of the library and information center of the Aristotle University of Thessaloniki in Greece (http://search.lib.auth.gr/). The search engine enables users to check the availability of books and other written works, and search for digitized material and scientific publications. The server logs obtained span an entire month, from March 1st to March 31 2018 and consist of 4,091,155 requests with an average of 131,973 requests per day and a standard deviation of 36,996.7 requests. In total, there are requests from 27,061 unique IP addresses and 3,441 unique user-agent strings. The server logs are in JSON format and they are anonymized by masking the last 6 digits of the IP address and by hashing the last part of the URLs requested (after last /). The dataset also contains the processed form of the server logs as a labelled dataset of log entries grouped into sessions along with their extracted features (simple semantic features). We make this dataset publicly available, the first one in this domain, in order to provide a common ground for testing web robot detection methods, as well as other methods that analyze server logs.

 

Files (3.2 GB)
Name Size
public_v2.json
md5:2f126f1f33a8cb7851c2863094998788
3.2 GB Download
semantic_features.csv
md5:8c2c2daef9aeb5c21f153b55caedc005
4.3 MB Download
simple_features.csv
md5:08c9b2efeddd5c15e1e4f66d127408f8
15.9 MB Download
  • Lagopoulos, A., & Tsoumakas, G. (2020). Content-aware web robot detection. Applied Intelligence, 50(11), 4017-4028.

678
492
views
downloads
All versions This version
Views 678678
Downloads 492492
Data volume 746.2 GB746.2 GB
Unique views 575575
Unique downloads 259259

Share

Cite as