Anomaly Detection in the Elasticsearch Service
Creators
Description
The Elasticsearch Service is a distributed search and analytics engine widely used across CERN. Currently,
issues in the service are resolved manually after being detected through internal monitoring by service
managers. However, the number of clusters and metrics are large which makes them difficult to track, and
issues are often discovered and reported by users. This is time consuming and disturbs the workflow of the
service users. In light of this, the main objective of this project is to develop a model capable of identifying
anomalies in the Elasticsearch Service clusters, in order to predict and eliminate service issues before they
cause problems. This is done by analyzing the history of cluster data using machine learning methods. In
this way, a single metric signaling service issues can be obtained and used to alarm service managers of
upcoming issues. In 2017, a deep neural network model was developed for this purpose. However, several
issues were identified with the model, the most severe being convergence issues in the autoencoder. In this
project, a revised autoencoder based on long short-term memory neural networks (LSTM’s) is developed,
tuned and evaluated. Finally, it is used on new Elasticsearch Service cluster data. The final model shows
improved convergence compared to the previous model, and is able to detect real service issues based on
the anomaly scores obtained. By combining the anomaly scores with those obtained by a model simply
predicting the cluster state as a moving average of preceding states, the rate of false positives is reduced.
The conclusion is that that a combined model, reporting anomalies based on a combination of the anomaly
scores obtained by the LSTM based model and the moving average model, is the most sensitive to real
service issues.
Files
Report_Jennifer_Andersson.pdf
Files
(1.9 MB)
Name | Size | Download all |
---|---|---|
md5:af2f56ef873fc3aa32c84fae6b4044d7
|
1.9 MB | Preview Download |