Working paper Open Access

Redundancy and Reliability for an HPC Data Centre

Erhan Yılmaz

Ladina Gilly

Defining a level of redundancy is a strategic question when planning a new data centre, as it will directly
impact the entire design of the building as well as the construction and operational costs. It will also affect
how to integrate future extension plans into the design. Redundancy is also a key strategic issue when
upgrading or retrofitting an existing facility.
Redundancy is a central strategic question to any business that relies on data centres for its operation. In
the traditional data centre reliant industries such as Internet Service Providers (ISP’s), banks, insurances, or
credit card services redundancy is of paramount importance, as a loss of availability has an immediate and
sometimes drastic impact on revenue or legal due diligence for example. For this reason, the industry has
formed a number of clear standards and guidelines that address the topic of redundancy and reliability.
Both these topics are of course just as important for HPC centres too, but not always in the same way given
that some of the trade-off mechanisms may differ substantially and thus make it difficult for an HPC centre to
rely fully on the existing standards used by the traditional data centre industry.
This white paper aims to discuss the key factors to be taken into account when selecting a level of
redundancy and reliability for an HPC centre, providing managers with a set of topics that need to be
considered when designing a new HPC centre or upgrading an existing one. These factors all have an impact
on the design and cost of construction as well as on future operational costs for your centre.

Files (452.1 kB)
Name Size
452.1 kB Download
All versions This version
Views 1111
Downloads 1616
Data volume 7.2 MB7.2 MB
Unique views 1010
Unique downloads 1515


Cite as