Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published November 4, 2019 | Version v1
Presentation Open

Anomaly detection using Unsupervised Machine Learning for Grid computing site operation

Description

A Grid computing site consists of various services including Grid middlewares, such as Computing Element, Storage Element and so on. Ensuring a safe and stable operation of the services is a key role of site administrators. Logs produced by the services provide useful information for understanding the status of the site. However, it is a time-consuming task for site administrators to monitor and analyze the service logs everyday. Therefore, a support framework (gridalert), which detects anomaly logs and alerts to site administrators, has been developed using Machine Learning techniques. Typical classifications using Machine Learning require pre-defined labels. It is difficult to collect a large amount of anomaly logs to build a Machine Learning model that covers all possible pre-defined anomalies. Therefore, Unsupervised Machine Learning based on clustering algorithms is used in the gridalert to detect anomaly logs. Several clustering algorithms, such as k-means, DBSCAN and IsolationForest, and its parameters have been compared in order to maximize the performance of the anomaly detection for Grid computing site operations. The gridalert has been deployed to Tokyo Tier2 site, which is one of the Worldwide LHC Computing Gird sites, and is used in operation. In this presentation, studies about Machine Learning algorithms for the anomaly detection and our operational experiences of the gridalert will be reported.

Files

CHEP2019_252.pdf

Files (821.6 kB)

Name Size Download all
md5:bc29d7265c24cd87942c61dbef3c38f8
821.6 kB Preview Download