Towards a framework for detecting advanced Web bots

Automated programs (bots) are responsible for a large percentage of website traffic. These bots can either be used for benign purposes, such as Web indexing, Website monitoring (validation of hyperlinks and HTML code), feed fetching Web content and data extraction for commercial use or for malicious ones, including, but not limited to, content scraping, vulnerability scanning, account takeover, distributed denial of service attacks, marketing fraud, carding and spam. To ensure their security, Web servers try to identify bot sessions and apply special rules to them, such as throttling their requests or delivering different content. The methods currently used for the identification of bots are based either purely on rule-based bot detection techniques or a combination of rule-based and machine learning techniques. While current research has developed highly adequate methods for Web bot detection, these methods' adequacy when faced with Web bots that try to remain undetected hasn't been studied. For this reason, we created and evaluated a Web bot detection framework on its ability to detect conspicuous bots separately from its ability to detect advanced Web bots. We assessed the proposed framework performance using real HTTP traffic from a public Web server. Our experimental results show that the proposed framework has significant ability to detect Web bots that do not try to hide their bot identity using HTTP Web logs (balanced accuracy in a false-positive intolerant server > 95%). However, detecting advanced Web bots that present a browser fingerprint and may present a humanlike behaviour as well is considerably more difficult.


INTRODUCTION
The vast amount of content hosted on the Internet has rendered the use of Web bots necessary. Web bots are programs that automate the browsing process and perform specific commands on behalf of the author. Popular uses of Web bots include Web indexing, Website monitoring (validation of hyperlinks and HTML code), data extraction for commercial purposes and feed fetching Web content. To perform these actions, bots visit Web servers repeatedly and, in some cases, for a prolonged period of time [10].
However, allowing bots unrestricted access to Web server content and services is not a good practice. This is because bots are a powerful tool that has often been abused for malicious purposes, such as Web content scraping, vulnerability scanning, marketing fraud, carding, account takeovers, spamming, denial of service attacks and more [10]. Furthermore, it is possible to operate bots from mobile phones and IoT devices (usually without the device owner's knowledge or consent) which makes them a low cost mechanism for distributed attacks [10]. Moreover, advanced Web bots can also avoid detection by imitating humanlike behavior [10]. As a result, in addition to being a crucial part of the infrastructure of the Internet, Web bots have become a ubiquitous threat.
The need to mitigate this threat has inspired a whole new area of research. Once a visitor is identified as a Web bot the process is straightforward − websites simply need to place special restrictions, so that bots cannot perform malicious acts. To detect Web bots, current state of the art approaches both in academia [13,15,31] and in commercial solutions [1,10] propose, besides the rule-based Web bot detection techniques, the use of machine learning to distinguish bots from human visitors. Such approaches rely on the collection of Web server logs that contain both human and Web bot sessions. These data can be used for modelling human and bot activity.
Current research on machine learning based Web bot detection performs automatic annotation of the extracted sessions from Web logs to create the ground truth needed to generate such models [25,26]. Such approaches do not consider that the bots may try to hide their bot nature and/or try to imitate humanlike behaviour [10]. Thus, we performed a more in depth analysis of the Web bot detection problem by examining the detection accuracy of machine learning based algorithms in terms of identifying conspicuous Web bots separately from its performance in detecting advanced Web bots which try to hide their bot nature.
We created a Web bot detection framework which examines HTTP Web logs from a public Web server. The log data were split in three categories: (i) simple Web bots which do not try to hide their bot nature, (ii) advanced Web bots which use a browser fingerprint and may present a humanlike browsing behaviour and (iii) human sessions which we assume to be all the sessions that do not belong to the above categories. Furthermore, differentiating from relevant literature, we studied the behaviour of our framework in a falsepositive intolerant Web server. We opted to do this because in a real-world case scenario miscategorising human visitors is not desirable since it will affect the browsing experience of visitors.
The main contributions of this paper are: • The proposal of a modular machine learning based Web bot detection framework that (i) can be easily combined with any HTTP Web server and (ii) can effortlessly incorporate new machine learning-based Web bot detection algorithms. • The identification of the unique challenges when state-ofthe-art Web bot detection techniques are utilised for detecting advanced bots as opposed to simple bots • The identification of the most important features among the ones proposed in literature for the detection of simple and advanced Web bots.
The remainder of this paper is structured as follows: Section 2 provides the background on the Web bot detection problem and covers the related work. Section 3 describes the Web bot detection framework. Section 4 presents our evaluation methodology and experimental setup and Section 5 contains the evaluation results. Finally, Section 6 discusses the significance of our results to the Web bot detection problem and Section 7 summarizes our work and examines the future evolution of this framework.

BACKGROUND AND RELATED WORK
The Web bot detection problem poses the question of how we can accurately distinguish whether a Web visitor is a bot or a human. Researchers have further categorised bots based on their functionality [12] or purpose (benign/malicious) [4,23,31].
In the past, to detect Web bots, it used to be common to examine the signature of the visitor's request, i.e. the request headers, and whether JavaScript, cookies, and Web sessions are supported. However, tools, such as the selenium 1 , provide APIs that allow bots to mimic the signature and support the majority of the features of most common browsers, including the support of JavaScript, cookies and sessions.
Currently, the most famous techniques for Web bot detection are based on the CAPTCHA (i.e. Completely Automated Public Turing test to tell Computers and Humans Apart) [28] such as the reCAPTCHA 2 offered by Google. The CAPTCHA is a Turing test that is based on a visual challenge, accompanied with an aural one for the visually impaired. The test uses the assumption that a human can extract letters from either a distorted image or the audio file or select an object in an image, while a Web bot cannot.
However, a variety of techniques have been proposed to bypass some of the most popular CAPTCHA challenges, such as the use of public online speech to text engines to bypass Google's re-CAPTCHA [6]. Finally, the CAPTCHA has received a lot of criticism, especially from people with disabilities who sometimes struggle with fulfilling this request and people who feel that their everyday work is slowed down.
To solve the aforementioned problems, current research focuses on the use of machine learning based detection techniques to distinguish Web bots from humans, rather than solely relying on rulebased detection techniques. The first step in generating machine learning models that can be used for the detection of Web bots is the extraction of sessions from Web logs [5,18,22,24,25]. After that, several features are extracted from each session and used to identify whether the visitors are bots or humans. These features include the access frequency of Web pages [27], the type of Web content accessed (i.e. HTML, text, JavaScript, image, css, etc.) [13,22], the access patterns [3] and the HTTP errors produced [5,27]. These features are used as input to generate machine learning models.
The most popular machine learning based Web bot detection problems that appear in research are the classification [25,26] and clustering [2,9,27]. The detection can take place either off-line (i.e. decide after the end of the sessions whether it is from a bot), or online by performing an estimation during the session [8,21]. In both cases, a ground truth of human and Web bot sessions is required. In most recent research, the annotation process relies on comparing each visitor's agent name [26] and IP address [5,7,13,22,25] with the agent names and IPs of known web bots according to lists hosted on external servers. Such lists mostly contain identifiers for bots which are benign in nature, like, for example, search engine bots, although some malicious bots can be found there as well. Furthermore, some researchers examine whether the visitors access a text file which instructs Web robots which Web pages to crawl/scrape 3 [15,27,31].
Although the aforementioned techniques show promising results, they do not address a key aspect of the Web bot detection problem, which is the identification of Web bots which try to evade detection via, for example, presenting a humanlike fingerprint and, in some cases, behaviour. More specifically, such bots can simply Therefore Web bot detection needs to be treated as a multifaceted problem. To this end, we isolated the advanced Web bots so as to accurately assess the performance of our detection framework in terms of detecting advanced Web bots separately from simple Web bots. To achieve that, we propose a novel annotation technique which combines the list of known Web bot agent names (proposed in literature) with an external honeypot to examine whether the visitor's IP has shown activity on that honeypot. Specifically, instead of checking whether the visitor's IP is a known Web bot's IP, we check whether this IP has shown activity on a honeypot server. The interaction with the honeypot indicates bot activity, since no human user would have visited that server. We made this choice because several bots in our dataset were already labeled as bots in the honeypot, but were not yet in the list of known Web bots.

THE WEB BOT DETECTION FRAMEWORK
In this section we present the Web bot detection framework that we created to examine how well common Web bot detection algorithms perform in detecting simple and advanced Web bots under various configurations. This framework is based on and combines the most prevalent techniques that have been proposed in literature for Web bot detection using supervised machine learning [25,26].
The architecture of the Web bot detection framework is shown in Figure 1. The input of the framework is a directory path in which the HTTP logs from the Web server are stored. The framework uses a regular expression to extract the relevant content from HTTP logs. Thus, the process of applying different log files as input is trivial, since any new format of interest can be incorporated by only adapting this regular expression rule.
After the successful connection of the framework with the HTTP server log files, the session extraction procedure takes place, where HTTP log data are split into sessions (Section 3.1). For each session, a feature vector is created using a set of features proposed in literature (Section 3.2). After that, each session is annotated as Web bot or human using an automated way (Section 3.3). Furthermore, the importance and effectiveness of each feature is evaluated and a subset is selected (Section 3.4). Finally, the selected feature vectors are used to create the classification models (Section 3.5).
In the testing phase, the previously created classification models are used to identify Web bot sessions in new unseen data. When new data are available, their sessions and features are extracted accordingly (Sections 3.1 and 3.2). Each classifier uses a unique subset of the available features which consists of the ones that were deemed most important during their training stage (Section 3.4). The trained classifiers take the new data as input and determine whether each visitor is a bot or a human (Section 3.5).

Session extraction
The first step in identifying whether a visitor is a human or a Web bot is the extraction of the visitor's session(s) from the log files. Since several visitors (including bots and humans) might share the same IP, considering only the IP field to extract sessions initiated from different visitors is insufficient. For this reason, current research proposes the combination of the IP with the browser agent name for the creation of a unique identifier per visitor [5,13,15,22,25].
The aforementioned technique will not necessarily result in distinguishing all users from each other, since there might be two users with the same IP and agent name or one user changing several agent names in rotating order. To this end, relative research proposed more advanced fingerprinting techniques, such as browser-based characteristics (e.g. ActiveX support, Flash enabled, language enumeration, etc.), OS and applications features (e.g. OS and kernel version, Windows registry, etc.) and hardware features (e.g. screen resolution) [19]. However, since this information was not available in the log data that we used, we followed the default approach of identifying separate sessions by the combination of a unique IPagent name pair [5,13,15,22,25].
To define when a user session has ended, current research uses a 30 min threshold [5,13,15,22,25]. More specifically, when a session id stays idle for more than 30 minutes, a new session is created upon a new request. Furthermore, sessions that have a total number of HTTP requests lower than a threshold k were not taken into consideration because it is not feasible to distinguish bots and humans based on only a few HTTP requests [22].

Feature composition
The information included in each session is encoded into measurable values and used as input to train the classification models. To decide which measurable "properties" or "characteristics" (features) to consider, we accumulated the most promising features that have been proposed over the past 5 years. These features are presented in Table 1, along with a short description for each feature and relevant literature. In short, to distinguish Web bots from humans we can examine the method and response code of the HTTP request, the type of file(s) requested and the browsing behaviour.

Automatic annotation
The extracted sessions that are used for training the classifiers are annotated as "bot visitor sessions" or as "human visitor sessions". Bot visitor sessions contain two different types of sessions; (i) those in which the Web bots are conspicuous, i.e. they are not trying to hide the fact that they are bots, and (ii) those in which the Web bots are inconspicuous, i.e. they replace one or more of their normal bot characteristics with those of a human visitor to remain undetected. The annotation process which we followed is depicted in Figure  2 and is a two step process. The first step is to identify simple Web bots by examining whether their agent name is a browser one, while the second one is to identify, to the best of our ability, Web bots that present a browser fingerprint and, in some cases, a humanlike behaviour by using an external honeypot. Initially, we used the API provided by Useragentstring 4 , a server that classifies agent names in several categories such as "browser", "crawler", "library", "link checker". After that, we used the API provided by the GreyNoise 5 , a server that collects and analyzes untargeted, widespread, and opportunistic scans and attacks or malicious activities, to check whether the IPs have been found to perform any of the above.
The main idea behind our approach is that it is not common for a human visitor to change the agent name of their browser. Thus, if a session has a non-browser agent name, it can be safely annotated as Web bot. However, all sessions that have a browser agent name session simple bot IP has shown malicious activity 2 has browser agent name 1 yes no yes no human advanced bot [1] useragentstring.com [2] greynoise.io

Feature analysis and selection
Feature selection is described as a method whereby specific features are selected from the set of all available features. In machine learning, and more specifically, classification problems, some features might result in the decrease of the models' accuracy. For this reason, feature selection is widely used in machine learning based problems and has also been used in security related tasks that use machine learning techniques [9].
The framework supports the analysis of all the available features and the selection of the most important ones. Two selection modes are supported, one which selects the most promising features based on the feature analysis regardless of the classification algorithm employed and one that is classifier dependent. In both cases, each classifier is accompanied by a boolean array that indicates which features are to be used from all the available features for its training. The same features are used in the testing process.

Classification
The classification process consists of two phases, the training phase and the testing phase. In the training phase we feed the framework with known Web bot and human sessions as input to create the classification models. These models are stored into the framework so that they can be used for the testing process.
The testing process is similar to the training one. The new (unseen by the classifiers) Web sessions are used as input to the framework and the classifiers generate the respective label (Web bot or human). To evaluate the framework, the labels of the sessions used for testing are known (but kept secret). However, in a real case scenario, the nature of these sessions would be unknown.

EVALUATION
To assess the effectiveness of the proposed framework in terms of identifying Web bot sessions from HTTP log files, a series of experiments were conducted using real HTTP traffic collected from a public Web server. This section describes the evaluation methodology (Section 4.1), the dataset (Section 4.2), the feature analysis and selection process (Section 4.3), the evaluation metrics considered (Section 4.4) and, finally, the classification algorithms tested and their configuration (Section 4.5).

Evaluation methodology
The purpose of this paper is to identify the unique challenges that arise when state-of-the-art Web bot detection techniques are utilised for detecting advanced Web bots as opposed to simple bots. To this end, we evaluated our framework in how well it can identify simple and advanced Web bots separately. Initially, we identified the most important features in the case of simple and advanced Web bots. We used these features to generate the respective classification models and evaluate our framework after the models' deployment.
Furthermore, we took into account the fact that, in a real world case scenario, it is imperative to have a low false positive rate in order to avoid miscategorising human visitors. Thus, we tested our framework's general performance in various working points (i.e. classification thresholds) and analysed its performance on the working point in which the false positive rate is relatively low.

Dataset
The framework was tested on HTTP log data collected from MK-Lab's public Web server 6 . Instead of feeding the framework with data real time, we used a year's worth of HTTP log data (from 20/3/2016 to 20/3/2017) as input in a simulated time mode. We only considered Web sessions with more than k=30 requests per session to ensure that the framework has adequate data to identify the nature of each visitor [22]. The value of k was chosen heuristically.
We annotated the dataset by examining the agent name of the visitor as well as whether its IP has shown malicious activity (see Section 3.3). The total unique agent names extracted from sessions with more than k=30 requests were 2793. From them, the 2723 were annotated by the useragentstring's API as "browser" (2628) or "bot" (95) and 70 were annotated as "unknown". For the "unknown" agent names, we manually annotated them as browsers (66) or bots (4).
As we mentioned in Section 3.3, the IPs of the sessions that were annotated as "browsers" by the useragentstring's API (15452) are also checked for malicious activity using the GreyNoise's API. Thus, we end up changing the annotation of 299 unique IPs (554 sessions) which were originally annotated as "browser" but their IPs have been marked as bots by the GreyNoise. The total unique agent names and IPs per class (i.e. browser, simple bot, advanced bot) are shown in Table 2. To evaluate the framework, we split the dataset into two sets, one for training and one for testing. Our Web server by default splits the HTTP log data into files based on a log rotation technique 7 . The total files that were generated over a year were 13. We grouped the files into two packages, (i) the training one using the first 8 files (from 20/3/2016 to 4/12/2016) and (ii) the testing one using the other 3 files (from 4/12/2016 to 20/3/2017). For each of these files we extract all sessions with more than k=30 requests per session [22].
The final number of extracted sessions is shown in Table 3. To assess the framework's performance in identifying simple bots and advanced bots separately, we created two "sub-datasets", the D1 that contains the human and simple bot sessions and the D2 that contains the human and the advanced Web bot sessions.

Feature analysis and selection
To analyse the importance of the extracted features in the detection of simple as well as advanced Web bots we utilised the Principal Component Analysis (PCA) and the x 2 (chi-square) feature selection techniques. For both of these techniques the data were scaled. When scaling the data for the PCA, we subtracted the mean values and then divided by standard deviation for each feature in the training set. In the case of x 2 , we divided by standard deviation without subtracting the mean to avoid negative values. We then used the mean and standard deviation values calculated from the training set to perform the same process in the testing set. PCA can be used to assess the importance of each feature by calculating its contribution to the generated principal components. To do that, we can measure the mean of each feature "contribution" to all the generated components of the PCA using the training set [17] (D1 for simple bots and D2 for advanced bots -see Section 4.2). Usually, the smaller principal components (i.e. with lower variance) are associated with noise and thus they can be omitted. Thus, the features with the lowest cumulative "contribution" to all the principal components can also be associated with noise [17]. However, the high variance principal components are not necessarily all useful, since they might not be correlated with the respective class (i.e. Web bot or human) or they may refer to noise existing within the data. Thus, we combined the results of the PCA technique with the x 2 feature selection technique to see the most important features to the Web bot detection problem.
To select the features that will give us the highest score for each classifier, we used the greedy Sequential Feature Selection (SFS) technique [20]. The SFS works as a wrapper on top of each classifier. It is an iterative process where in each iteration the feature with the highest metric (in our case balanced accuracy) on the training set is chosen and added to the features that are used for each classifier. By this way, the features that perform the worst will be added in the end and can omitted if they do not contribute to the performance. Thus, SFS provides different results for each classifier, which is useful because each feature may contribute differently depending on the classifier that was used.

Evaluation metrics
Several researchers used accuracy as the evaluation metric for Web bot detection [21,22,26,29]. Since it is possible for an algorithm to have high accuracy while maintaining low precision, other researchers use the precision and recall metrics as well to assess the performance of the proposed approaches more accurately [2,11,13,21,25,26]. Furthermore, researchers also calculated the harmonic mean of the precision and the recall which is called F-measure, F1 score or simply F-score [11,13,21,25,26].
Due to the unbalanced classes in our dataset, we decided to use balanced accuracy as opposed to accuracy. Furthermore, to evaluate the framework's performance in both the case of Web bot detection as well as human user detection we calculated the precision, recall and their harmonic mean, F-score, for both classes. Finally, to gain a more general understanding of the performance of the classifiers irrespective of the working point (i.e. classification threshold) we considered the Area Under Curve (AUC) evaluation metric that can be calculated by plotting the Receiver Operating Characteristic (ROC) curve for a classifier [14].

Classification algorithms tested
Our framework is built to allow for the effortless incorporation of any machine learning algorithm. For our experiments, we incorporated 4 well known machine learning algorithms, all of whom have been used by other researchers for the Web bot detection problem. More specifically, we incorporated the Support Vector Machine [21,26], the Random Forest [25], the Adaboost [25] and the MLP classifiers [7,21,26]. Furthermore, we added an ensemble classifier, which we call the Voting classifier, that performs a class probability averaging of all the available classifiers [25].
The parameters for each classifier are shown in Table 4. We performing an exhaustive search over specified parameter values for each classifier and chose the ones which have the highest balanced accuracy with a 2-fold cross validation on the training data. Furthermore, in the case of SVM and MLP Classifier, the data are scaled to avoid the problem of domination of some features over the others. To scale the data, we followed the same scaling technique that we used in the PCA (Section 4.3) For the implementation of these algorithms the scikit-learn 8 Python library was used. Furthermore, all the experiments were performed on an Intel processor at 3.4GHz and 32GB RAM for loading large datasets during the experiments.

Feature selection.
Each feature might contribute differently in the performance of different classification algorithms. Thus, we performed the greedy SFS technique to pick the features that give the highest score (in our case balanced accuracy) for each classifier in the case of simple (D1) as well as advanced (D2) Web bots. We decided to keep as many features as possible as long as they do not noticeably decrease the balanced accuracy in training. The selected features for each classifier and dataset are shown in Table 6.
The SFS results show that each feature contributes differently in each classifier. Furthermore, the initial features selected by the PCA in combination with the x 2 were selected and highly ranked in some classifiers and rejected in other classifiers. Such an example is feature 17, which was initially selected and is highly ranked in the case of SVM and MLP Classifier, but rejected in the case of the Random Forest and Adaboost.
For this reason, depending on the size of the dataset and the processing power we have, we can either select the most promising features according to the combination of the PCA with the x 2 technique or perform the greedy SFS over all the features and pick the ones that perform better on the training set. In our case, since the dataset was relatively small, we followed the latter.

General Performance
To evaluate the general performance of our framework, we plotted the ROC curve of the Voting classifier when the framework was tested on simple and advanced Web bots ( Figure 5). We also marked a few working points (i.e. classification thresholds) based on the respective False Positive Rate (FPR). We opted to do this because in a real-world scenario a Web bot detection framework must be false-positive intolerant to avoid affecting visitors' experience.
The performance of our classifiers show that detecting simple Web bots is a trivial task. The framework is able to effectively The framework performs poorly and, if a low FPR is required, the framework detects very few Web bots.
To further analyse the behaviour of our framework on the selected working points in the case of advanced Web bots (D2), we calculated the confusion matrix of the Voting classifier on the two working points selected in Figure 5 (Tables 7 and 8). The choice of a working point depends on how strict we want our detection framework to be in each case. For example, choosing a working point with FPR = 0.4, we would correctly identify 2 out of 3 advanced Web bots, but most humans would be misclassified  (Table 7). Choosing a higher threshold (lower FPR) results in fewer misclassified human visitors, but the framework's effectiveness in detecting advanced Web bots is decreased.

Performance on a false positive intolerant Web server
To assess the framework's performance on a false-positive intolerant Web server, we calculated the precision, recall and F-measure in the working point of FPR=0.01 for all employed classifiers. Furthermore, we calculated the balanced accuracy, which represents more accurately the performance of the framework in the case of unbalanced datasets. The results are shown in Figure 6.
The performance of the classifiers shows that identifying advanced Web bots is more challenging than identifying simple Web bots. When choosing a working point with a low FPR, simple bots are detected with very high precision (∼95%) and recall (∼97%) which makes an F-measure higher than 96%. Furthermore, in the case of detecting human visitors (class 0), we achieved a precision and recall of more than 99% each. However, the framework achieves low precision and recall in the case of advanced Web bots which results in a low balanced accuracy (∼55%).
Moreover, the performance of different classifiers varies. To achieve a more balanced behaviour we chose the Voting classifier to be the main classifier. Generally, voting classifiers are not always guaranteed to have a better performance. However, they can be more "stable", since, if one of the employed classifiers underperforms, its behaviour will be masked by the other classifiers. For example, Random Forest achieves the highest balanced accuracy in the case of advanced Web bots (D2) but, at the same time, very low recall of the human class ( Figure 6).

DISCUSSION
There is a huge incentive for individuals and companies alike to create Web bots that can bypass Web bot detection techniques. This has led to the introduction of advanced Web bots that try to evade detection. Our dataset, which is comprised by the logs from a public web server, contains several sessions from such bots. We used these logs to determine the effectiveness of state-of-the-art Web bot detection techniques against advanced web bots. The results have shown that, although detecting simple bots is relatively easy, detecting advanced Web bots that present a browser fingerprint and maybe a humanlike behaviour is much more difficult. Furthermore, if we try to detect such bots with current detection techniques, we will end up misclassifying benign visitors, which is a non-desirable behaviour in a real-world case scenario.
Literature has focused on identifying all kinds of Web bots, treating Web bots as one group of visitors. However, since advanced Web bots will be considerably fewer than simple bots, the aforementioned technique will present biased results masking its ability or lack thereof to detect advanced bots. To this end, we performed a more in depth study of the Web bot detection problem by dividing the simple from the advanced Web bots and showed that there is an efficiency gap in the detection of advanced web bots compared to simple ones. For this reason, we have concluded that the features proposed by literature were not suitable for the detection of advanced bots.Future work includes examining browsing behaviour holistically instead of relying exclusively on requested pages and time. For example, we could use features extracted from visitor mouse movements and keystrokes. By incorporating such features in our framework we will be able to more accurately identify advanced Web bots.
In summary, the question of whether existing detection mechanisms can be used as an effective solution comes down to the threat model. If we choose to only target simple Web bots (which is the majority of bots that will visit our Website) we can easily detect them using using hard-coded rules. Even if such bots try to evade detection by presenting a browser fingerprint, they can easily be detected by their behaviour by using machine learning models from features extracted from HTTP logs. However, if we are targeting advanced Web bots, we need to have a better understanding of their behaviour and use more advanced features generated from more sources than simply the commonly used by research HTTP logs.

CONCLUSIONS
This work presented an in depth analysis of the Web bot detection problem by examining the performance of a machine learning based Web bot detection framework in terms of identifying simple Web bots separately from its performance in detecting advanced Web bots which try to hide their bot nature. To do that, we generated the ground truth to train our models by using a novel automatic annotation mechanism that examines (i) the fingerprint of the visitor (in our case the agent name) as well as (ii) whether its IP has shown malicious activity using an external honeypot.
The proposed framework was tested on real HTTP Web log data collected from a public Web server. The results of our evaluation experiments indicated that the Web bot detection problem is a multifaceted one, characterised by the coexistence of simple bots that can be detected easily and advanced Web bots that are considerably more difficult to detect. Furthermore, if the framework is applied on a false-positive intolerant Web server, its effectiveness regarding detecting advanced Web bots is significantly reduced.
Future work includes the introduction of more advanced features that can not be easily simulated by bots to facilitate the identification of advanced Web bots. These features will be aggregated to our framework. Furthermore, we are planning to examine and improve the framework's performance in adversarial settings, where adversaries try to create bots that adjust their behaviour dynamically to avoid detection.

ACKNOWLEDGMENTS
This work was supported by the TENSOR (H2020-700024) and Ideal-Cities (H2020-778229) projects, funded by the European Commission.