SoftRM: Self-Organized Fault-Tolerant Resource Management for Failure Detection and Recovery in NoC Based Many-Cores

Many-core systems are envisioned to leverage the ever-increasing demand for more powerful computing systems. To provide the necessary computing power, the number of Processing Elements integrated on-chip increases and NoC based infrastructures are adopted to address the interconnection scalability. The advent of these new architectures surfaces the need for more sophisticated, distributed resource management paradigms, which in addition to the extreme integration scaling, make the new systems more prone to errors manifested both at hardware and software. In this work, we highlight the need for Run-Time Re-sourcemanagement tobeenhanced withfaulttolerance featuresand proposeSoftRM,aresourcemanagement framework which can dynamically adapt to permanent failures in a self-organized, workload-aware manner. Self-organization allows the resource management agents to recover from a failure in a coordinated way by electing a new agent to replace the failed one, while workload awareness optimizes this choice according to the status of each core. We evaluate the proposed framework on Intel Single-chip Cloud Computer (SCC), a NoC based many-core system and customize it to achieve minimum interference on the resource allocation process. We showcase that its workload-aware features manage to utilize free resources in more that 90% of the conducted experiments. Comparison with relevant state-of-the-art fault tolerant frameworks shows decrease of up to 67% in the imposed overhead on application execution.

this decision. As a consequence, there is the need of a self-organizing process to take place amongst the cores.
This process should offer a unique outcome and guarantee that this outcome will be received and respected by all cores (as shown in Figure 1 (t 2 -stable). A simple replacement tactic i.e. "The first core that detects the failed one will act as its replacement", is bound to fail due to the distributed nature of the system. There is no guarantee that only one core will detect the failure and volunteer to be the replacement (see '!' at t 1 in Figure 1). In this case, more than one cores will have the same responsibilities which will create collisions and lead to a new unstable system state (t 2 -unstable).
In summary, the focus of this work is to provide a solution and design alternative to the following question. Given a many-core NoC based system managed in a hierarchical and distributed manner through assignment of distinct roles on different PEs (agents), is there a way to guarantee that a manifested, unrecoverable fault in any agent will be dynamically mitigated in a coordinated way, propagated to all and respected by all healthy PEs on the system?
Considering the above, our novel contributions are: i. We introduce SoftRM, a fault tolerant distributed resource management framework which is able to dynamically react to failures of PEs, in a self-organized way. ii. We provide guarantees of our recovery strategy based on a consensus agreement algorithm enhanced with workload awareness to achieve an effective replacement policy. iii. We incorporate a failure detection algorithm, which takes advantage of the resource allocation related communication of PEs in SoftRM in order to significantly reduce the inflicted communication overhead of detection. iv. We implement and evaluate SoftRM on a NoC based many-core system in order to capture the correlation of its design parameters to the efficiency of resource allocation and fault 144:4 V. Tsoutsouras et al.
recovery actions. Comparison against state-of-the-art fault tolerance techniques highlights the advantages of self-optimization against spare core provisioning.
SoftRM is designed on a purely distributed concept, implementing a hierarchy of different roles of PEs to achieve resource allocation and negotiation between applications. Self-organization is preferred over PEs provisioning in order to fully utilize system resources and avoid design-time limits on the number of failures that can be tolerated. The hierarchical design of SoftRM allows its high level fault tolerant features to be cooperatively combined with other techniques for increased dependability, such as hardened PEs [19] or customized fault tolerant schemes per application [20]. In addition, we show that SoftRM mitigates all manifested errors by effectively re-assigning the workload of failed cores to idle resource in over 90% of the examined scenarios, while its overhead on application execution is up to 67% less compared to other state-of-the-art fault tolerance techniques based on spare cores provisioning.
The rest of the paper is organised as follows. Section 2 provides an in-depth discussion on related research. Section 3 introduces PAXOS, the utilized algorithm for consensus agreement. The design of SoftRM is presented in Section 4 by describing its key points, i.e. the employed system model (Section 4.1), its Resource Allocation scheme (Section 4.2) and its Fault Tolerance Infrastructure (Section 4.3). The experimental evaluation of SoftRM is presented in Section 5, while Section 6 concludes the paper.

RELATED WORK
A considerable amount of work is focused on extending the lifetime reliability (i.e. minimizing core failures) of many-core systems by appropriately mapping tasks using metrics such as Mean Time to Failure (MTTF). A runtime task mapping design, which extends lifetime of the system by using a wear-based heuristic is presented in [15]. However, this work does not examine component failures and assumes that such failures are detected by the operating system or another hardware resource. Authors in [11] present a task-mapping technique to maximize MTTF of an MPSoC considering the ageing of NoC-links. However, in case of a permanent core failure the system collapses and has to be restarted in order to work properly.
The works in [26,40], utilize Branch and Bound based algorithms to perform application mapping on reconfigurable many-core NoCs, where the configuration of PEs functions and interconnections is adjusted dynamically. Mapping decisions are centralized and the optimization objectives take into account reliability, energy and performance. The authors assume that spare cores can substitute any faulty PEs and thus focus on tolerating transient errors of the communication links between them. Application mapping is also targeted in [14], where a centralized feedback loop approach is employed to achieve maximum system performance under dark silicon thermal constraints and meet the reliability requirements of the applications under ageing and thermal effects.
Task mapping is combined with Dynamic Voltage and Frequency scaling in [12] to increase the reliability of the system, while transient faults in hardware are tolerated via selective task replication. A genetic algorithm is used to perform design space exploration and determine the task to PE binding, as well as its VFS values. Triple or Double Modular Redundancy (TMR or DMR) is extended in [29] to provide prolonged fault free system lifetime. In case of failures, instead of task remapping to sustain the number of redundant tasks, the authors propose to preserve the mapping and adjust the functionality of the included voter task. N Modular Redundancy (NMR) is the main fault tolerance mechanism in [28], targetting commercial many-core SoCs such as [3]. Transient faults in PEs are encountered by comparing the output of the N-times replicated task and repairing any instance that exhibits output divergence. The required voter task is also doubled and the two voters constantly repair one another in a round robin way. Authors in [6] propose a self-adaptive engine for fault management in multi-core systems. Failure detection is achieved by task replication and comparison of the different outputs. Application activity is coordinated by a centralized core called fabric controller, which executes the fault management layer and is considered hardened by-design.
Authors in [18] and [32] opt for increased system dependability by examining the combination of reliability techniques in different layers of many-core systems. Regarding resource allocation, in [18] the run-time system layer performs reliability-driven core allocation and task mapping. Similarly in [32], the resource manager utilizes redundant multi-threading for task execution resilience accompanied by a variation-aware task mapping. Both approaches depend on centralized resource allocation and no recovery mechanism is presented in case of resource manager failure.
In [9,21] run-time fault-aware resource management is addressed using three types of cores, i.e. regular, spare and manager. Managers overview the system and monitor the status of other cores, while regular and spare cores execute incoming applications. In case of a permanent or transient failure of a regular core, spare cores are used to replace it and workload is migrated technique from the faulty to the spare core. The difference between the two works is that [9] employs static spare core provisioning while [21] dynamic, i.e. a subset of the cores provided to the application are provisioned for fault tolerance. However, a manager failure is not examined at all and in such a case the fault-aware nature of the proposed schemes disappears. Our work bridges this gap by focusing on the failure of similar managerial agents and provides a design alternative to both static and dynamic provisioning of spare cores. In [16] fault mitigation in networks consisting of nodes of re-configurable PEs (ReCoNodes) is proposed by leveraging both HW/SW techniques. An individual OS overviews each ReCoNode and lightweight "shadow" tasks replicate applications to provide error mitigation. The proposed system, provides an efficient intra-application fault tolerance infrastructure under the assumption that the administrative part (OS) is fault free, where our focus lies.
A System-level and Hierarchical Fault-Aware (SHiFA) approach is presented in [13] to provide run-time tolerance of manifested faults both in PEs and links of the NoC. To achieve efficient fault tolerant mappings and resilient application execution, SHiFA requires two kinds of distributed agents, i.e. system mobile master (MM) and application managers (AM) in its hierarchy. Despite the fact that this hierarchy is critical for the correct system operation, the authors provide no mechanism for recovery actions in case of one or more agent failures. Likewise, in [38] resource management is based on a hierarchical scheme, where a Global Manager (GMP) supervises the system and Local Managers (LMP) supervise application execution. Authors employ both hardware and software mechanisms to tolerate faults both in PEs and links of the NoC. Specifically, in the software stack a fault is mitigated in a cooperative manner between an LMP and GMP. Nevertheless, there is no provision for a dynamic scheme which will ensure system stability in case that one of these dedicated PEs fail. A holistic infrastructure for fault tolerant resource management is also presented in [4]. A System Health Monitoring Unit has a global overview of the status of the links and PEs in a many-core NoC based system and collects and classifies information about manifested errors. This information is propagated to the Mapper-Scheduler unit, which produces appropriate run-time mappings of the applications.
In conclusion, reliability aware mapping techniques concentrate their effort on prolonging the lifetime of a system often neglecting to provide fault tolerance. Even if faults are addressed, the existing solutions mainly rely on utilizing redundancy techniques. On the other hand, our approach belongs to the category of fault tolerant frameworks which aim at mitigating all manifested errors, while providing the infrastructure for application deployment and execution. In this case, reliable application mapping can still be used as a key component of resource management infrastructure. We highly differentiate in comparison to state-of-art [9,13,21,38], by providing fault tolerance without spare core provisioning, thus constantly exploiting the full potential of our target manycore system as well as providing fault tolerant guarantees for all involved agents of the system, in all levels of the resource management hierarchy.

PAXOS CONSENSUS PROTOCOL
The core component of the self-organized aspect of the SoftRM is a consensus protocol. It ensures that in a situation where different values are proposed by different processes, only one of them is chosen. Paxos, proposed by Lamport [24], is an algorithm for consensus achievement in a network of unreliable processors. The safety requirements for consensus achievement are (i) only a value that has been proposed may be chosen (non-triviality), (ii) only a single value is chosen (safety) and (iii) a process never learns that a value has been chosen unless it has actually been (liveness). Processes can have any of three different roles in Paxos; proposers, who propose values to be chosen, acceptors, who accept or reject the proposed values and learners, who will eventually learn the chosen value once it has been decided. In a general scenario, a single process may have multiple roles simultaneously. We summarize below the three phases of the complete flow of the protocol.
Prepare Phase: Each proposer picks a unique proposal number n greater than any n max previously sent by any proposer and sends a prepare request with number n to all acceptors. On the acceptor side, if a prepare request is ever received with number n greater than that of any prepare request to which he has already responded (n > n max ), then he responds with the highest numbered proposal that he has accepted (if any). Additionally, he makes a promise not to accept any more proposals numbered less than n (n max ← n). A request is rejected if the acceptor receives a prepare request with n lower than the highest n max proposal number ever received.
Accept Phase: After a proposer has received a response to his prepare requests from the majority of acceptors, he sends an accept request to those acceptors with a value v. This value is either the highest numbered proposal among the responses received in Prepare phase, or any value if the responses reported no other proposals. On the acceptor side, if an accept message is received with a proposal number n, he accepts the proposal by replying with an accepted message, unless a higher proposal number n max has been received in the Prepare phase.
Learn Phase: Similarly to the Accept phase, once a proposer has received accepted messages from the majority of acceptors, he realizes that his proposed value has been accepted and broadcasts a learn(v) message to all learners. Figure 2 illustrates an example of the Paxos Protocol with two proposers P1, P2 (black nodes) and three acceptors A1, A2, A3 (grey nodes). At time t 0 (t 0 < t 1 < . . . < t 5 ) both P1 and P2 send prepare messages to all acceptors with proposal numbers n 1 and n 2 (grey and black arrows respectively), with n 1 < n 2 . At times t 1 and t 4 acceptors A3 and A1 receive the prepare (n 1 ) from P1 whereas at t 3 acceptor A2 receives the prepare (n 2 ) message from P2 and they promise not to accept any other proposals with a lower proposal number. Since this is the first time the acceptors receive a prepare (n) message, they automatically reply with an accept message. Acceptors A3 and A1 receive the prepare (n 2 ) from P2 at t 2 and t 5 respectively. They reply with an accept message since n 2 > n 1 . However, when A2 receives the prepare (n 1 ) from P1 this message is rejected (red arrow), due to the promise to P2 not to accept any messages with proposal number lower than n 2 . In the Learn phase, v 2 proposed by P2 has been accepted by the majority of acceptors, so P2 broadcasts this value to all learners. In this work, we propose a version of the Paxos algorithm which takes into account the workload status of different processors when a consensus agreement effort is made.

SOFTRM: A FAULT TOLERANT DRTRM 4.1 System Model
Target HW model: We target homogeneous 1 many-core systems with N Processing Elements (PEs) deployed on the same SoC. Inter-core communication is achieved in a synchronous message passing way via a Network-on-Chip. Each PE on the system is uniquely identified by an identifier i ∈ I, where I = {0, 1, . . . , N − 1} is a static, finite set of N identifiers known by all cores. PEs may exhibit failures but we assume that there is always a working communication link between any two functional PEs of the system. Target Application model: We assume a number of K parallel applications which can dynamically adjust to their allocated PEs. Each application is characterized by a unique type T a , a workload W and a function speedup(C num ) which estimates the gained speedup if the application is executed in C num PEs where 1 ≤ C num ≤ max(C num ). max(C num ) is the highest number of worker cores that an application can run on without speedup saturation. Each application is managed by a designated PE which henceforth will be referred to as Manager core. Its responsibility is to orchestrate workload allocation to the available C num worker PEs of the application. In addition, it is involved in resource negotiation rounds with other Managers in order to optimize the total speedup of the running applications on the many-core platform.
Error Model: Exhibited errors in many-core systems are broadly categorized as permanent, intermittent and transient. Permanent faults result in sub-component malfunction which is not restored during run-time of the system. On the contrary, intermittent faults regard errors which occur repeatedly, interchanged with periods when the system is fault free, while transient faults are temporary.
Without loss of generality, we demonstrate the properties of our proposed design focusing on permanent faults. We focus on faults manifested on PEs and do not examine faults on the links of the target NoC based system. We model the probability of permanent fault on a PE of the system using a Weibull distribution [23] with Probability Density Function (PDF) defined as: where η and β are the scale and shape parameters respectively. According to Equation (1), the reliability of the system is defined as: 144:8 V. Tsoutsouras et al. Additionally, the failure rate of an individual component is: The correlation between the parameters of the PDF is presented in Equation (4), associated with Mean Time To Failure (MTTF) and Γ function. Assuming an MTTF of some years it follows that: To further enhance the utilized error model, we refrain from using a constant error rate but adjust the parameters of the error PDF according to the simulated lifetime of the target chip. In more detail, a chip exhibits different error rates in different periods of its lifetime. Without loss of generality, this variability has been modelled as a "bathtub" curve [39], as shown in Figure 3. In its infant period, there is a very high but decreasing failure rate, until a plateau of minimum, constant failure rate is reached at its grace period. After a prolonged period of execution, ageing and wear-out effects are accumulated and aggravated leading to its breakdown period, where there is an ever increasing probability of error manifestation.
Last but not least, we must notice that the fault tolerant capabilities of the proposed framework are not bound by the employed system error characteristics. As presented in Section 4.3.2, our framework incorporates failure detection mechanisms and thus the only limitation of our design is that the manifested errors must affect the high level functionality of a PE. Therefore, intermittent and transient faults can also be mitigated as long as they are manifested long enough to be identified by our detection process. In the opposite case, they are silently masked and might affect only the outcome of the executed application. To mitigate such silent errors, the designer can take advantage of our proposed hierarchical design, in order to employ fault resilient deployment of the application, e.g. redundant execution of its tasks [28,29].

Resource Allocation
SoftRM was designed on concepts introduced by state-of-the-art Distributed Run-Time Resource Management frameworks as in [2,22]. First, a number of PEs need to serve as the backbone of SoftRM. These cores are referred to as Controller cores and serve as the means of keeping track of the status of the system as well as providing this information to the rest of the PEs. This role is crucial, since there is no central point of system monitoring. Controllers are dedicated to their task and are not involved in application management or workload execution. We name Cluster the area of the system monitored by each Controller. These areas are not overlapping and their number and topology is parametric and can be defined at every system initialization but cannot change at run-time.
Application management is the task of Manager cores. The relation between a Manager core and an application is one to one, meaning that one Manager can handle only one application and the management of an application cannot be divided to several Managers. This enhances the ability of a Manager core to adopt any workload allocation scheme, without any resource sharing between applications. A PE that has been assigned to an application is identified as a Worker core. We define w j i = 1 if core i is a worker of core j and 0 otherwise. The design choice of having both Controllers and Managers is in order to decouple system monitoring and application management. All cores that have not being appointed any of the above roles, are considered Idle cores.
Regarding inter-core communication, all PEs maintain their own neighbourhood set Π i with the ids of all cores inside their Cluster. In addition, each Manager core j maintains a workers set W j = { i | w j i = 1 } which consists of all its Worker cores. Finally, each Controller with id i maintains a set with all the Managers that possess cores inside its Cluster. This is called the Distributed Directory Figure 4 shows an abstract overview of the platform at some point of execution. In this example, there are two Controllers cores (0 and 9) and two Manager cores (6 and 10). The Cluster of Controller 0 consists of cores 0-7 whereas the Cluster of Controller 9 consists of cores 8-15. In addition, Workers 5,7 and 14 belong to Manager 6 whereas Worker 11 to Manager 10. The respective information is also included in the DDS sets of the two Controller cores.
The resource allocation objective of SoftRM is to maximize the average speedup of applications and thus minimize the average application execution latency. For a new application, an initial mapping is performed with the use of Controller cores and then gradually the new Manager core searches for new resources for its application. This search involves looking up for free resources or asking for resource migration from other Manager cores with elevated resource utilization. In this way, resource allocation is cooperative i.e. applications exchange resources in order for the system to reach a state of maximized average application speedup. The system reaches a stable state when all applications have maximized their speedup or there are no free resources and occupied resources have been fairly distributed to running applications.
At any time t the overall state of framework F is represented as a tuple is a vector of size N with the current states of all cores. A is a set containing all the applications a i currently executed on the system and S t +1 are the next possible states of all cores, which depend on both S t and A. We define s t i = 3 for an Idle core, s t i = 2 for a Worker, s t i = 1 for a Manager and s t i = 0 for a Controller.

Workload-Aware Paxos Algorithm.
When examining the case of a core failure at run-time, Paxos algorithm can be employed in a straightforward way in order to designate a replacement core. Having identified the failure, the rest of the PEs can volunteer to take its place by proposing themselves as the replacement core (i.e. the proposed value is their id). This process while effective, inherently lacks of the ability to capture the state of the proposer PEs. For example, it is much more efficient to replace the failed PE with an Idle one instead of a Worker core, which is occupied with execution of application workload.
Inspired by this inefficiency, we introduce the willingness factor w f j to capture the suitability of a PE with id j to act as a replacement core under its current workload characteristics as dictated by its role on the system. The willingness factor is defined as: where s t j is the current state of core with id j as described in 4.2 and k ∈I w k j p ∈I\j w k p is the number of co-workers in an application. The constant c is calculated based on the number of maximum workers of an application, so that c k ∈I w k j p ∈I\j w k p < 1. In our case, the value of c was set equal to 0.1. The formulation of w f favours idle cores (higher s t j ), to express higher willingness to act as replacement cores. Additionally, in the case of a worker, increased number of co-workers leads to higher willingness in order to prevent resource starvation in applications with few cores.
Once a core with id i detects that a Controller or Manager has failed, it acts as a proposer and sends a prepare (n i ) message, where n i is unique and greater than any n max previously sent by any proposer (Π i is defined in Section 4.2).
Since each core has a unique identifier i, determining n i can be achieved by picking the smallest sequence number n i greater than any previously sent (n max ) such that n i mod n max = i. Initially, the value of n max is -1. Note that n max and modulo are used for two different purposes: n max guarantees that the accepted proposal numbers increase monotonically, whereas modulo guarantees that all proposal numbers are unique. Subsequently, when a core with id j receives a prepare (n i ) message (Section 3), if n i > n max where n max is the highest numbered proposal received then it replies with its highest accepted proposal (v acc ) or −1 if no value has been accepted. The reply is paired with its willingness factor (w f j ). N max becomes n i and a promise is made not to accept any proposals numbered less than n max . ∀prepare (n i ) → send : response (u acc , w f j ), When the proposer receives an accept from a majority 2 of acceptors, it sends an accept (n i , v i ) where v i is the value of the highest accepted proposal replied in prepare phase. If no such value has been replied (i.e. all replies were negative), then it proposes the core with the highest willingness factor as the replacement core: In case that the maximum value is proposed by more than one core, the choice is made in a First Come First Served manner. Afterwards, an accept (n i , v i ) message is sent to all acceptors.
On the contrary, if a value v acc is replied, it is a previously chosen value by another proposer based on willingness factors, since no other value had been accepted when the choice was made. This value has already been accepted by the majority of acceptors and is replied so that the safety requirement of Paxos is ensured. As mentioned in Section 3, when an acceptor receives an accept (n i , v i ) message it replies with accepted iff it has not seen a higher proposal number before.
∀accept (n i ) → send : accepted (), Eventually, when the proposer receives accepted messages from a majority of acceptors, it realises that its proposed value v i has been accepted and broadcasts the accepted value (i.e. the id of the replacement core) to the platform.
It should be noted, that the willingness factor designation builds on top of PAXOS algorithm without interfering with its functionality. More precisely, the requirements of uniqueness and monotonicity of the proposal number (Section 4.1) are not affected and only the proposed value v is customized according to the suitability of one PE to substitute a faulty one. Thus, our proposed technique does not affect the correctness of the employed consensus algorithm.
It must be highlighted that the described infrastructure is proposed in order to mitigate faults of the administrative tasks of a hierarchical resource management scheme, such as the presented Controllers and Managers. Regarding worker nodes executing the parallel workload, a failed worker is similar to a shrink operation, while the existence of a Manager per application allows the adoption of intra-application error resilient resource management techniques such as Double or Triple Modular Redundancy [28,29], which can be seamlessly integrated in SoftRM.

Failure Detection.
Failure detectors proposed in [8] are responsible for detection of node failures or crashes in distributed systems. A failure detector D is classified by its completeness, meaning the suspicion of faulty processes and accuracy, meaning the suspicion of non-faulty processes. A perfect failure detector P satisfies both strong completeness and strong accuracy. Every faulty process is eventually permanently suspected by every non-faulty and no process is suspected by anybody before it has actually crashed. On a synchronous system with a known upper bound communication delay Δ, the simplest Perfect Failure Detector (PFD) [7] can be implemented as follows (HB stands for Heart Beat):  iii. If node j did not reply with a HB_REPLY message within Δ, then detect j as faulty.
Although this approach is tolerable by general purpose systems, it comes with limited scalability and excessive message traffic since [N − 1] 2 messages are sent every Δ seconds, where N is the number of PEs. We propose a tweaked version of PFD (tPFD), tailored for on-chip communication summarized as follows: Set a timer to expire in Δ seconds. If cores i and j from the same Cluster (Π i = Π j ) have exchanged any message until timer expiration then both are considered alive. If not, a heart beat request is sent from i to j and vice versa and timer is reset to Δ seconds. If one of i or j has not replied until timer expiration, then it is considered a failed core.
The algorithm takes advantage of the frequent inter-core communication for resource negotiation and application management on SoftRM to establish that a core is alive. A HB_REQU EST is sent from one core to another only if they have not exchanged any messages during a Δ interval. If a HB_REPLY has not been received within Δ seconds, then the recipient j core is declared as faulty. Liveness check signals are exchanged only inside a Cluster to provide a scalable detector.
To quantify the inflicted overhead on SoftRM by the presented detectors, we examine the average exchanged messages per second for failure detection in scenarios of 16 incoming applications. Scenarios differ in intensity of average application workloadW (See Sections 4.1, 5.1). Table 1, summarizes the results which indicate that the proposed tPFD imposes more than half the overhead in exchanged messages compared to the original. Interestingly, as application workload increases less messages are sent by the tweaked detector, since inter-core communication is more frequent as applications are active for more time. The trade-off of the proposed detector is that it requires at most 2Δ latency to detect a faulty PE. This behaviour is acceptable for the examined applications since they are not bound by time-critical recovery characteristics.

Recovery.
The recovery phase involves all the necessary actions for the system to be restored to a stable state. It takes place after all PEs have received the learn signal, which communicates the id of the replacement core as outcome of the execution of Paxos algorithm. All sets presented in Section 4.2 are updated to reflect the new state of the system, omitting the failed core j. In similar manner, the new Controller builds its Cluster and DDS sets using information gathered by the rest of the system and according to the Cluster region topology of the failed Controller, as this administrative region cannot change dynamically.
In case of a Manager core failure, the replacement Manager signals all its Worker cores to inquire about the portion of workload that has been executed and thus make an assessment about the remaining workload of the application. In addition, by inspection of the logged checkpoints of the application all the portions of the workload under execution when the failure occurred are marked for re-execution to ensure a correct outcome. Then, workload is redistributed to Worker cores and application execution is restarted. In case of a Worker core failure, application execution is paused and workload is re-distributed to the rest of healthy workers. Erroneous results are not propagated to healthy workers due to the blocking, MPI-like send/receive style of data exchange between workers. If a Worker core has been designated as the replacement core, it concludes its workload execution until the next checkpoint and then proceeds to its new duties. In parallel, its former Manager informs its Controller the worker loss and re-distributes the remaining workload to the rest of its workers. SoftRM run-time example: Figure 5 illustrates an instance of the operation of SoftRM. Figure 5(a) presents a point in time where Controller 9 has failed. Core 14 identified the failure and triggered workload-aware Paxos algorithm. The willingness factors of the various PEs have been annotated on Figure 5(a) (all idle cores have equal willingness factor). To designate the id of the replacement Controller core, an instance of the workload-aware Paxos is executed inside Cluster 2. An idle core has been chosen as replacement since it had the highest willingness factor i.e. the highest suitability to act as a new Controller. Then, the necessary recovery actions are performed ( Figure 5(b)) and Manager cores with workers inside Cluster 2 subscribe to the DDS set of the new Controller.

Experimental Setup
SoftRM was developed and deployed on Intel Single-chip Cloud Computer (SCC) NoC based platform [27]. The SCC chip consists of 24 dual-IA-core tiles connected by a 2D-grid on-die network, where each tile contains two P54C cores. Inter-core communication is achieved in a message passing way which accurately fits our targeted HW model (Section 4.1).
An evaluation scenario includes a number of applications requiring to be executed on the system. The applications involve stencil computations, i.e. matrix multiplication. We consider the execution kernel over 4096 × 4096 data matrix as the baseline workload, i.e. W = 1. We model workload variability by considering matrices scaled up W times. Each application is characterised by its workload W and the distribution of workload values for 128 incoming applications is presented in Figure 6, for three different scenarios of workload intensity (Low, Medium, High). The values in X axis correspond to the range of possible workload values per application. In addition, the outcome of an examined scenario is affected by the arrival rate of applications on the system. In this experimental evaluation, this rate is generated randomly according to a Poisson distribution 144:14 V. Tsoutsouras et al.  with λ equal to 48, as in [37]. Each of the examined scenarios explores a different design parameter of SoftRM and is considered complete when all input applications have performed their execution life-cycle.

Evaluation
Our first evaluation in Figure 7, shows the correlation between the Δ interval of tPFD (Section 4.3.2) and the required time for SoftRM to recover from a Controller core failure, presented in mean value and standard deviation for a configuration of 4 Clusters. The moment of core failure is designated according to Equation (3) of the presented error model (Section 4.1). In all presented experiments MTTF is set at 6.56 years as in [10]. For this experiment, β is equal to 1 and all measurements have been repeated 10 times, to filter out the randomness introduced by the parallel execution of framework instances on the target HW. We observe that as Δ increases, both the average value and its deviation rise in absolute numbers. However, in relative numbers, in all cases the deviation is up to 15% of the mean value. The bars present the overhead induced on SoftRM for the detection of failed cores. Increased frequency of liveness check signals results in high toll on exchanged messages. Considering this trade-off, Δ was set to 1000 msec for the rest of the experiments.
The next experiment aims at analyzing the correlation between the granularity of selforganisation and Resource Allocation (RA) effectiveness. The evaluated design parameter is the Cluster size, since the cores inside it need to self-organize to elect a replacement core. The bars in Figure 8 correspond to the overhead in percentage of the Fault Tolerance (FT) infrastructure on the rate of exchanged messages for RA per second by SoftRM. Smaller Cluster size leads to overhead reduction, since a Paxos instance is executed amongst a reduced number of PEs. However, smaller size equals more Clusters, more Controller cores and thus reduced number of available PEs for workload execution. This is reflected in the total execution latency of each scenario (triangles),  where a significant increase is observed for small Cluster sizes due to the prolonged execution latency of applications because of the limited resources assigned to them. For a Cluster of size 6, application slow down is so severe that the overhead of FT infrastructure is sky-rocketed. In Table 2, an evaluation of the workload-aware Paxos protocol is performed against two other strategies for replacement core designation. More precisely, in Basic Paxos, each proposer volunteers to be the replacement core in step Phase 2 a (Section 4.3). In Fixed IDs strategy, the replacement core is the one with id min(Π j ) that resides in the same Cluster as the failed core j. The evaluation quantifies what was the state of the core before being chosen as the replacement one. The proposed workload-aware Paxos protocol manages to highly outperform its rival implementations and achieves to designate an idle core as the replacement with frequency of over 91%. This highly reduces the recovery overhead, compared to when a Worker or Manager core is chosen since it introduces no interference to the executed applications. Figure 9 illustrates the average distribution of core roles the moment that the various strategies engaged in determining a replacement core. Different examined scenarios involved different numbers of Cluster areas on the system and therefore distributions are grouped according to this variable.
To quantify the efficiency of SoftRM versus state-of-the-art techniques at tolerating multiple errors we present in Figure 10 the overview of system behaviour in a scenario of 64 incoming applications. Errors have been generated according to the utilized error model, taking into account all phases of the chip lifetime (Section 4.1) and affecting any PE of the system. Comparison is performed against two techniques based on spare cores provisioning, either dynamically at runtime [21] or statically at design time [9]. Spare cores are distributed among all clusters (e.g. in a configuration with 4 clusters, 16 spare cores would equal to 4 spare cores per cluster).
The X-axis of Figure 10 represents time, showing the arrival of applications and the time points in which errors occurred (thunder symbols). The different fault tolerance schemes are compared in terms of the number of completed applications within a specific time frame. SoftRM facilitates the  completion of a significantly higher number of applications compared to the rest of the techniques, while managing to mitigate all errors. This happens because no core is excluded from workload execution unless it is mandatory for fault recovery and at the same time the overhead of selfoptimization is considerably low. The dynamic spare core provisioning scheme [21] as well as the static of 24 cores also manage to mitigate all errors but they impose a heavy toll on application execution, since a high number of PEs is inactive until an error occurs. Most importantly, all other static spare core allocation configurations fail at mitigating all errors, a fact that we consider a breaking point in the functionality of the system. The less the number of pre-allocated spare cores, the faster this breaking point is reached.
To further quantify the reduced overhead of SoftRM, Figure 11 summarizes the system throughput expressed as successfully completed applications per minute, achieved by SoftRM versus the spare cores provisioning techniques, taking also into account different Cluster configurations. In all cases, our approach results in higher throughput, since no cores are excluded from application execution. As expected the highest overhead, reaching 67%, is observed when 24 cores (half the available ones) are provisioned for fault tolerance. We should note that this high number does not offer higher error resilience as equal number of errors can be mitigated by SoftRM. Dynamic core provisioning leads to overhead of up to 25%, however our implementation of this approach was 144:17 Fig. 11. Comparison between SoftRM, static and dynamic spare core allocation for fault free execution. conservative in the sense that only one spare cores is dynamically provisioned per application. This implies that only one error can be tolerated per application and in case of more, its execution fails.
In the last conducted experiment, presented in Figure 12, we quantify the behavior of SoftRM under variable error manifestation time series. By adjusting the value of β parameter of the utilized error model, we are able to customize the error manifestation rate on the system. For fair comparison purposes, we maintain the number of errors constant and equal in all scenarios. The influence of errors is quantified as the percentage of execution penalty on the input applications, using the fault free scenario as a baseline. Results confirm that high execution penalty is observed when more errors are manifested earlier in the lifetime of a scenario (lower β). This is because a high number of error corresponds to less available PEs for workload execution and the faster these PEs are damaged, the higher is the number of application that are affected thus leading to higher execution latency penalty.

CONCLUSIONS
In this paper we introduced SoftRM, a self-organizing and fault tolerant Distributed Run-Time Resource Management framework for NoC-based many-core systems. Its main aspects were presented i.e. fault detection, replacement core determination and recovery actions for the framework to be restored to a stable state. The replacement policy is based on Paxos consensus reaching algorithm, enhanced with workload-aware features for effective use of free resources on the system. To showcase the effectiveness of SoftRM it was implemented and evaluated on Intel Single-chip