Introduction
The CES package contains the new tool CHIP (Central Hint and Information Processor) which in Run 2 replaces the obsolete Run 1 OnlineRecovery package. CHIP supervises the ATLAS data taking,
takes operational decisions and handles abnormal conditions. It automates procedures and performs advanced recoveries. Furthermore it interacts with the Test Management service which allows to make informed decisions based on the outcome of the test results. CHIP is written in Java and based on a third party open source complex event processing engine.
More information about CHIP including a list of recoveries and automatic procedures can be found on the CHIP twiki page .
Core functionality
The CHIP takes the decision what to do in case of applications of the Run Control tree going into the error state, failing, crashing, etc.
Normally the action taken will be according to the configuration settings for the specific application (IF-FAILS, IF-DIES, IF-ERROR)
with the following exceptions:
-
An application with decision set to RESTART will be restarted up to a maximum of 5 times in a time interval of 30 minutes. When the maximum number of restarts is reached, the IF-FAILS action is taken instead.
-
With the configuration setting HANDLE an application-specific recovery can be defined.
Automatic procedures and recoveries
CHIP implements various automatic procedures and advanced recovery mechanism which are detailed on the twiki page linked above.
The current release contains the following recoveries:
-
Stopless Removal/Recovery
-
Module Removal/Recovery
-
Resynch
-
TTC Restart
-
Various HLT recoveries triggered either by crashed HLT applications or ERS messages
-
L1Calo-specific recoveries
-
RPC-specific recoveries
The current release contains the following automatic procedures:
-
Switching of ATLAS reference clock between LHC clock and internal one
-
Warm Start/Stop
New procedures, either recoveries or automatic ones, can be added to CHIP on request of the sub-detector communities.