EOS Winston: Expert Systems for Automated Diagnosis and Remediation
Creators
Description
This report describes EOS Winston, an event driven alerting and mitigation automation platform.
Through the use of expert rules and online anomaly detection algorithms, it catches events which
represent a departure from the normal behaviour and tries to automatically take corrective actions,
often involving administrators in case it encounters events which it hasn't seen before and which
haven't been described through the predefined rules.
The structure of the report is as follows: in the first section we describe what EOS is and the scale at
which it operates, laying emphasis on the fact that operating at a peta-scale would involve anomalies
and irregularities to creep in. We also describe the commonly occurring issues which previously
involved manual intervention, and automatically mitigating of which was the main focus of this
project. In the next section we discuss various alternatives for automation engines and anomaly
detection algorithms, describing the pros and cons of each of these and explaining the rationale
behind our choices.
The third section involves the description of the main platform - EOS Winston. After a brief overview
of its architecture, we describe the Stackstorm pack, created as part of this project, which contains
various corrective actions and the rules which link these with a variety of triggers. We also describe
how the anomaly detection models were trained and how they are accessed through a REST API
and the chatops integration which allows workflows to be executed from conversational channels.
Finally we discuss the various challenges faced during the project, how they were overcome and lay
out some future possibilities which could further enhance the scope of the platform.
Files
Report_Ishank_Arora.pdf
Files
(1.5 MB)
Name | Size | Download all |
---|---|---|
md5:75a88cbdb9af73761b3d5ed0e555009f
|
1.5 MB | Preview Download |