Published March 26, 2020 | Version v1
Report Open

EOS Winston: Expert Systems for Automated Diagnosis and Remediation

Creators

Description

This report describes EOS Winston, an event driven alerting and mitigation automation platform. 
Through the use of expert rules and online anomaly detection algorithms, it catches events which 
represent a departure from the normal behaviour and tries to automatically take corrective actions, 
often involving administrators in case it encounters events which it hasn't seen before and which 
haven't been described through the predefined rules. 
The structure of the report is as follows: in the first section we describe what EOS is and the scale at 
which it operates, laying emphasis on the fact that operating at a peta-scale would involve anomalies 
and irregularities to creep in. We also describe the commonly occurring issues which previously 
involved manual intervention, and automatically mitigating of which was the main focus of this 
project. In the next section we discuss various alternatives for automation engines and anomaly 
detection algorithms, describing the pros and cons of each of these and explaining the rationale 
behind our choices. 
The third section involves the description of the main platform - EOS Winston. After a brief overview 
of its architecture, we describe the Stackstorm pack, created as part of this project, which contains 
various corrective actions and the rules which link these with a variety of triggers. We also describe 
how the anomaly detection models were trained and how they are accessed through a REST API 
and the chatops  integration which allows workflows to be executed from conversational channels. 
Finally we discuss the various challenges faced during the project, how they were overcome and lay 
out some future possibilities which could further enhance the scope of the platform. 

Files

Report_Ishank_Arora.pdf

Files (1.5 MB)

Name Size Download all
md5:75a88cbdb9af73761b3d5ed0e555009f
1.5 MB Preview Download