Introduction
The CES package contains the new tool CHIP (Central Hint and Information Processor) which replaces the obsolete OnlineRecovery package. The CHIP supervises the ATLAS data taking,
takes operational decisions and handles abnormal conditions. It automates procedures and performs advanced recoveries. Furthermore it has the possibility
to interact with the Test Management service which allows to make informed decisions based on the outcome of the test results. CHIP is written in Java and based on a third party open source complex event processing engine.
More information about CHIP including a list of recoveries and automatic procedures can be found on the CHIP twiki page .
Core functionality
The CHIP takes the decision what to do in case of applications of the Run Control tree going into the error state, failing, crashing, etc.
Normally the action taken will be according to the configuration settings for the specific application (IF-FAILS, IF-DIES, IF-ERROR)
with the following exceptions:
-
An application with decision set to RESTART will be restarted up to a maximum of 5 times in a time interval of 30 minutes. When the maximum number of restarts is reached, the IF-FAILS action is taken instead.
-
Through the configuration setting HANDLE an application specific recovery can be defined.
Known bugs, problems and limitations
The CHIP does not yet implement all automatic procedures and advanced recovery mechanism which were available in the OnlineRecovery package and/or are listed on the twiki page linked above.
The current release contains the following recoveries:
-
Stopless Removal/Recovery
-
Module Removal/Recovery
-
Resynch
-
TTC Restart
-
Various HLT recoveries triggered either by crashed HLT applications or ERS messages.
The current release contains the following auto procedures:
The CHIP functionality will be expanded in future patches (Clock switching,..).