Resource usage templates and signatures for COTS multicore processors

Upper bounding the execution time of tasks running on multicore processors is a hard challenge. This is especially so with commercial-off-the-shelf (COTS) hardware that conceals its internal operation. The main difficulty stems from the contention effects on access to hardware shared resources (e.g., buses) which cause task's timing behavior to depend on the load that co-runner tasks place on them. This dependence reduces time composability and constrains incremental verification. In this paper we introduce the concepts of resource-usage signatures and templates, to abstract the potential contention caused and incurred by tasks running on a multicore. We propose an approach that employs resource-usage signatures and templates to enable the analysis of individual tasks largely in isolation, with low integration costs, producing execution time estimates per task that are easily composable throughout the whole system integration process. We evaluate the proposal on a 4-core NGMP-like multicore architecture.


INTRODUCTION
The research on timing analysis for multicore processors is still in its infancy. Especially so for COTS multicores, whose timing analysis is a complex challenge that needs to be solved before their adoption in safety-critical real-time systems industry may become viable. Deriving an Execution Time Bound (ETB) 1 for tasks running on multicores is challenged by the contention, also known as inter-task interference, occurring on access to hardware shared resources. Unless otherwise restrained, contention causes the execution time of any one task, hence its ETB, to depend on its co-runners. This has disastrous impact on system design and validation, as it conflicts with the incremental development and verification model that industry pursues to contain qualification costs and development risks. This industrial goal is sought by allowing individual subsystems to be developed in parallel against an agreed master specification, then qualified in isolation and incrementally integrated, with virtually no risk of functional regression at system level. In the time domain, incremental integration and qualification postulate composability in the timing behavior of individual parts, whereby the ETB derived for a task determined in isolation, should not change on composition with other tasks.
Several approaches have been proposed to deal with contention for multicore on-chip resources. On the one end of the conceptual spectrum in the state of the art, some authors propose computing ETBs so that they upper bound the effect of any possible inter-task interference a task may suffer on access to hardware shared resources. ETBs computed this way are fully time composable [9] [10]. They therefore enable incremental integration and qualification, but at the cost of pessimism that may cause untenable over-provisioning, as the timing behavior actually occurring in operation may fall much below the level determined considering the worst-case interference possible in theory [22,17,11]. On the opposite end, other authors [5] propose -currently only for research platforms -to determine ETBs simultaneously for multiple tasks in specific configurations. Those ETBs are non-time composable, as they only hold valid for the tasks being analyzed and for their specific configuration. If any such parameter changes, all ETBs become invalid and the entire analysis has to be repeated.
In this paper, we tackle resource contention in multicores by proposing the new concepts of resource usage signature (RUs or S) and template (RUl or L). RUs and RUl aim at making the ETB derived for an interfered task τ , time composable with respect to a particular usage u of the hardware shared resources made by the interfering co-runner tasks. The tasks' ETBs are determined for a particular set of utilizations U such that the ETB derived for any u ∈ U upper bounds τ 's execution time under any workload so long as the co-runners of τ can be proven to make a resource usage smaller than u. We explain later what "smaller" means and how this can be determined. This abstraction allows deriving time-composable ETBs for individual tasks in isolation for each u ∈ U, so that the system integrator can safely pull those (interfering) tasks together as long as the resource usage made by their individual set of co-runners is upper-bounded by some u. All that the system integrator has to care in that regard is to characterize the the tasks' access to hardware shared resources (a low-cost abstraction of the task execution time), ignoring any finer-grained detail of that access behavior. In this paper we present an approach to produce ETBs in that manner, using measurement-based timing analysis techniques.
RUs and RUl are, on purpose, made to be agnostic to the particular timing distribution of the resource access requests to be considered. Hence, two tasks generating the same number of accesses to a resource, though with different patterns, have the same signature. The challenge in the proposed method is in determining an effect on the interfered task that upper bounds the interference caused by contending accesses, regardless of the time distribution of those accesses as made by the interfered and the interfering tasks. In this paper we make the following main contributions: 1) We develop the novel concepts of RUs and RUl for the timing analysis of COTS multicores and sketch an algebra of operators over RUs/RUl to enable their practical use. 2) We provide exemplary RUs and RUl for the cases when requests accessing shared resources incur either fixed or variable response latency.
3) We present an implementation of RUs and RUl for a 4core NGMP-like [1] architecture, focusing on the bus and the memory controller as exemplars of on-chip shared resources. In our experiments we assume that the L2 cache is partitioned, as it is the case of the NGMP.
Our results show that when RUs and RUl are tailored to upper bound the access load caused by a task's co-runners, the ETB of that task is 1.36 times bigger than its execution time in isolation. If templates upper bound the highest number of accesses that any workload could produce, the (fully time composable) ETB would instead be 2.57 times bigger. RUs and RUl thus provide an effective way of abstracting resource usage in the quest for tight and trustworthy ETBs.

FORMALIZATION OF RUS AND RUL
RUs and RUl allow analyzing, for the most part in isolation, the timing behavior of tasks, by abstracting the perturbation that they may incur from the contention for hardware shared resources occurring on a multicore caused by co-runner tasks.

Resource Usage signature (RUs)
A RUs abstracts the use of resources of a given interfered task, τA. Once computed, it will be used for τA's multicore timing analysis instead of τA itself.
We describe the use of a hardware shared resource through a set of features, which correspond to quantitative values. A RUs for task τA, is a vector SA = (a1, a2, ..., an) that contains the aggregate of relevant features that characterize all the hardware shared resources, for the evaluation of contention effects. Since RUs are quantitative, the RUs of distinct tasks are comparable and can also be combined together to form a joint RUs.
Consider the reference multicore architecture shown in Figure 1(a), where the bus and the memory are shared. Further consider two types of accesses to those shared resources, for read and write operations respectively. In this case, RUs have at most 4 features: bus reads (n bus rd ) and writes (n bus wr ); memory reads (n mem rd ) and writes (n mem wr ). RUs are thus defined as SA = (n bus rd , n bus wr , n mem rd , n mem wr ) = (a1, a2, a3, a4). If the bus were the only shared resource, the RUs of a task τA would be abstracted as a RUs with two features: n bus rd and n bus wr . If both types of requests hold the bus for the same duration, the RUs would consist of a single feature corresponding to the sum of n bus rd and n bus wr , i.e., SA = (n bus rd + n bus wr ) = (a1 + a2). The addition of SB to SA is given by SA + SB = (a1 + a2 + b1 + b2). For comparison, instead, we say that SA dominates SB, SA SB, if the interference by the former is greater than that by the latter: a1 + a2 ≥ b1 + b2.
This reasoning easily extends to the more realistic scenario in which the bus holding times are asymmetric; for example, with reads holding the bus longer than writes. In that case, the RUs for τA could be either single-feature, considering all accesses as "long" accesses (counting writes as reads in the example), or multi-feature (two, in the example), i.e., SA = (a1, a2) = (n bus rd , n bus wr ). In the latter formulation, addition and comparison change as follows: addition is defined as vector addition, i.e., SA + SB = (a1 + b1, a2 + b2); for comparison, SA dominates SB, SA SB if (a1 ≥ b1) ∧ (a2 ≥ b2).

Resource Usage template (RUl)
RUl have the same form as RUs, namely, a vector of features LK = (k1, k2, ...kn), but with a different use. RUs abstract tasks according to their use of the shared resources while RUl abstracts the use of the shared resources so that LK can be used as an upper bound to the interference effects caused by any task τi whose RUs Si is such that LK Si (i.e. Si is dominated by LK ).
Tasks are made time composable against some RUl LK so that the ETB derived for a given task τA and for that RUl, denoted ET B K A , upper bounds τA's execution time inclusive of the interference that the contenders of τA, whose RUs do not exceed LK , may cause.
Returning to the example in which the bus is the sole shared resource with all accesses to it incurring the same contention effect: for a LK that captures a given number of accesses to the shared bus, we want to determine the highest impact by LK on ET BA, so that ET B K A can be regarded as a time-composable bound for τA in any workload in which LK i Si for all co-runner tasks τi of interest.
A maximally time-composable template LT C exists, which is an upper bound for any workload. LT C corresponds to the case in which all accesses from the signature suffer the highest contention from the Nc − 1 contending cores. In that case, every access from SA contends with Nc − 1 other accesses, i.e., LT C = (Nc − 1) × SA. Any LK LT C would produce exactly the same result as LT C , since τA cannot be interfered more than the accesses in its signature SA.

RUs and RUl through an example
In this section we return to the case in which the bus is the sole shared resource and all accesses to it incur the same contention effect. For now we limit our attention to two cores. The task under analysis, τA, runs in one of the two cores. The contending requests from the two cores are arbitrated with the round-robin policy. Figure 1(b) depicts the process we follow when the proposed approach is applied to this case. First, we obtain the RUs of τA, denoted SA. In the example architecture, the RUs of tasks using the shared resource is the number of accesses they make, a for τA, hence SA = (a). Our approach treats contention such that the ETB of τA can be derived by upper bounding τA's execution time considering the interfering effect that it incurs when its co-runner task, whatever it is, makes up to k contending accesses to the shared resource. To this end we define a RUl LK , which is the system integration parameter that defines the inter-task interference to be considered in the determination of τA's ETB. The abstraction captured by LK with LK = (k) is a RUl.
Once the SA and LK are defined, we determine ∆ K A , the increment to be applied to the execution time that τA may incur, to capture the contention effect from LK . This corresponds to step 3 in Figure 1(b). More precisely, ∆ K A upper bounds the increment that the execution time of a task τA with at most a accesses to a shared resource may suffer from k contending requests. ET B K A (i.e τA's ETB determined under the RUl LK ) is computed as the summation of ET isol A , the execution time of τA when running in isolation, without contention, and ∆ K A , the increment that upper bounds the contention effects from any k interfering accesses. This corresponds to step 4 in Figure 1(b). Overall, ET B K A is time composable against any co-runner task τB with signature SB = (b), as long as the RUs of the co-runner is lower than LK , which means that τB makes b ≤ k contending accesses. We denote this as tc(ET B K A , τB), which holds if b ≤ k.
RUs abstract the distribution of requests over time. Taking into account the exact distribution of requests over time, for instance in the form of requests arrival curves [20], would potentially enable deriving tighter ETB. However, deriving such distributions is complex, as programs normally have multiple paths of execution, each with its own access pattern (distribution). And, paradoxically, considering these particular distributions would decrease timing composability. Instead, our approach only requires the tasks' access count for every individual shared resource, as well as ET isol i (execution time in isolation) for each individual task τi. Notably, both are already had with high accuracy by state-of-the-art technology, e.g., [23]. With our approach, the ability to abstract away from the need to know the exact points in time at which requests would be made to shared resources releases the system integrator from the obligation of adopting rigid and inflexible scheduling decisions (which fares poorly with the development unknowns of novel systems) or from the labour-intensive cost of exact analysis.
Our approach requires the user to set the RUl to capture the potential co-runner tasks precisely. The spectrum of this capture has two ends. On one extreme we find the time-composable templates, LT C , which represent an upper bound for RUl. However, if RUl is close to that template, the ETB of tasks might be unnecessarily increased. On the opposite extreme, if RUl is too small, it constrains the choice of tasks that may be allowed to run in parallel. A simple solution consists in deriving for each task an ETB under different RUl, such that at integration time, the smallest RUl that upper bounds the signature of the actual co-runner tasks is used. With this, the residual part of the timing verification at system integration is small and simple. Selecting the proper number of RUl represents a trade-off between effort and accuracy: the higher the number of RUl the lower the over-estimation of ETB and the greater the analysis time, and vice-versa. Finding appropriate RUl is a standard optimization problem that is part of our future work.
In the example considered in this section we have made several simplifications to facilitate understanding: two cores, one single type of access, synchronous accesses (i.e. the core stalls when the access occurs until served) and a single shared resource. In real processors we have different types of accesses to the shared resource (synchronous and asynchronous), each with a distinct access latency. Hence, simply bounding the effect of contention by adding access counts is not enough.

RUS & RUL FOR MEASUREMENT-BASED TIMING ANALYSIS
Next we present one concrete realization of RUs and RUl for use with measurement-based timing analysis (MBTA), specifically for a NGMP-like processor architecture [1].

Methodology
Our approach uses micro-kernels [22,17,11], a set of singlephase user-level programs with a single execution behavior designed so that all their operations access a given shared resource, e.g. the bus. Micro-kernels consist of a main loop whose body includes a substantial number (e.g. 256) of instructions designed to generate a steady stress load on target resources. The fact that the loop body executes repeatedly the same instruction causes the target resource to be continuously accessed. Moreover, placing a high number of identical instructions in the loop body drastically reduces the impact of control instructions (down to 2-4%) [11]. For the architecture in Figure 1(a), a loop body including load instructions that hit in the L2 cache stresses the bus. We consider two types of micro-kernels: Resource stressing kernels, RStK, place a configurable load on a given shared resource, so that running a task against a RStK may represent contention scenarios of interest.
In theory, one could design a worst-contender kernel that generates the maximum contention that a task τi can suffer. However, such kernel would be specific for the task to be interfered and for the target processor [22]. Consider for example, a single shared resource arbitrated by a least-recently-used policy, where the task that accessed the resource last gets the least priority. In that case, the worst-contender kernel should generate a request in exactly the same cycle as the task of interest, so that every request from that task gets delayed by the contender, and for the next round of arbitration the task has the lowest priority again. The level of control required on the application behavior and the granularity of intervention are too fine-grained and laborious to be used in practice [22]. Resource sensitive kernels, RSeK, are designed to upper bound the execution time increase suffered by any other task, with a smaller or equal signature, owing to the interference from a given template LK . Consider a scenario in which bus accesses hold the bus for a constant duration. Further assume that we want to determine ∆ K A for τA, i.e its ETB increment due to a template LK with k accesses. Intuitively, one could get an estimate of it by running τA several times against a RStK that makes k accesses. However, in order to gain confidence in the ETB obtained, the experiment should be repeated with different alignments of the RStK, so that the interleaving of accesses varies enough and the worst case can be observed in a measurement. In practice, this may require excessive experimentation effort. The need for repeating the experiments with different alignments stems from the uncertainty on the time distribution of accesses, which is hard, if at all possible, to measure and control by timing analysis technology. We can therefore conclude that studying the task under analysis against micro-kernels is not viable. Instead, we use micro-kernels to model both the interfered and the (set of ) interfering tasks: RStK and RSeK are designed to account for bad alignments of requests: RSeK is made of instructions that cause accesses to the shared resource and that continuously contend with RStK requests.
We define ∆ RStK RSeK = ET RStK RSeK − ET isol RSeK , where ET RStK RSeK is the execution time when a given RSeK with the same signature as task τA runs against a RStK implementing a template LK with k accesses; and ET isol RSeK the execution time when the RSeK runs in isolation. For task τA, let ∆ K A = ET K A − ET isol A be the execution time increase τA suffers when it runs against LK . RSeK and RStK are designed so that ∆ RStK RSeK ≥ ∆ K A holds for any request alignment of τA under LK contention. To that end, we run the RSeK in isolation and then against Nc − 1 copies of RStK so that all RSeK 's accesses to the shared resource suffer high contention, causing a measurable ∆ RStK RSeK to emerge. In the next section we show how to derive the number of accesses of the RSeK and the RStK, based on the number of accesses of the template and signature under consideration. ∆ RStK RSeK is used to compute the ETB estimate for τA as follows: A is composable with any set of interfering tasks against which τA runs in parallel, if their total number of accesses is lower or equal to k. That is, the addition of the signatures of the interfering tasks is dominated by LK : (Si + Sj + ... + S l ) LK . Interestingly, given a task τB whose signature is dominated by τA, i.e. SB SA, the obtained ∆ RStK RSeK for τA can be used to upper bound τB's execution time: ET B K B = ET isol B + ∆ RStK RSeK . Overall, RUs and RUl provide powerful abstractions for the interfered and the interfering tasks, which simplify the integration of multiple tasks by combining their signatures.

The case of a NGMP-like architecture
Our reference multicore architecture [1] comprises Nc = 4 symmetric cores, see Figure 1(a), each equipped with private instruction cache (IC) and data cache (DC). The cores have an in-order time-anomaly-free design [16]. Load operations are blocking, whereby the pipeline is stalled until the load is re- solved. Each core has one 2-entry write-buffer that holds store requests until they are resolved, without stalling the processor. The processor is stalled solely to preserve memory consistency, when a store finds the write-buffer full or a load operation finds the write-buffer non-empty.
Bus. Our example processor implements round-robin bus arbitration so that if, in a given round, core ci, i ∈ {1, .., Nc} is granted access to the bus, the priority ordering in the next round is : ci+1, ci+2, ..., cN c , c1, c2, ..., ci. A lower priority core can use the bus when all higher priority cores do not use it. The bus access jitter that a task incurs on access to the bus, depends not only on the number of co-runners but also on the way their requests interleave. The worst contention situation happens when a task τB assigned to core ci requests the bus in a given round of arbitration, simultaneously with tasks in all other cores and the previous round was assigned to ci.
L2 cache. The L2 cache processes up to one miss per core at a time and allows hit-under-miss and miss-under-miss so that when a miss from a core is processed, hit/miss requests from other cores can be served. The 4-way L2 is partitioned so that every core is allowed to use 1 way 2 .
Memory controller. The L2 sends a request to the memory controller on every L2 miss. Requests are stored in a FIFO request queue, with one entry per core. The memory controller assumes a single DRAM device with close-page policy.

Bus
The bus handles three distinct request types, which differ in the contention they induce and suffer. Stores (st) either hit or miss on the L2, which are served immediately by the L2 and hold the bus for 2 cycles. L2 load hits (l2h) hold the bus for 7 cycles because they are not split by the bus and insert wait states on the bus for the hit latency of the L2 (5 cycles). L2 load misses (l2m) that are split by the L2 and perform a new arbitration whenever the L2 responds to the miss, holding the bus 2 cycles in each arbitration. Figure 2 shows the contention suffered by a source (interfered) request by another (interfering) request for all request types. l2h generate the highest contention and l2m are the most affected since they suffer two rounds of arbitration: l2m can therefore be interfered twice by two concurrent contending requests, one round of arbitration per each such request.
Our approach based on RUs and RUl does not require knowing the exact time of request issue, but whether they have asymmetric timing behavior in the impact they suffer and they cause to other request types so that RStK and RSeK can be designed with the appropriate request types. The RStK and RSeK for the bus are called BStK and BSeK : BSeK (abstracting interfered task bus usage). The signature of a task τA running in this architecture may take different forms, with different levels of tightness and experimentation effort. The canonical signature for the bus contains the number of accesses of each type made by the task. That is: S bus A = (ast, a l2h , a l2m ). This can be simplified by realizing that l2h and st access the bus once whereas l2m do it twice with exactly the same timing as l2h and st. Moreover, the delay suffered by an access does not vary whether the access was generated by a l2h, st or l2m. Hence, signatures have the form: S bus A = (ast + a l2h + 2 × a l2m ). BSeK can be implemented with either l2h or st. l2m are not appropriate as it is not possible to place high pressure on 2 The ARM A9 and the NGMP do implement this feature. the bus with l2m since they miss in cache and take long to be served from memory, leaving the bus idle in the meantime. l2h and st instead can place very high pressure on the bus. Our approach considers BSeK to only have st operations. BStK (abstracting interfering task(s) bus usage). Templates can be mono-(L1D) or bi-dimensional (L2D).
L2D. st and l2h generate different impact on the bus (recall that l2m are equated to 2 st). In particular, l2h produces the highest impact and st the lowest. This allows generating bi-dimensional templates: L2D = (k l2h , k 2×l2m+st ), whereby BStK s comprises load L2 hit accesses and store accesses to generate each respective type of interference.
L1D templates comprise only l2h, which generate the highest interference. A given L1D = (k l2h ) with k l2h accesses upper bounds the impact that one or several tasks, whose bus access count is lesser or equal to k, can generate on any other interfered task. L1D are easier to generate and simplify experimentation, but they increase the pessimism of ETBs, since st are considered to generate the same impact as l2h.
Putting it all together. Deriving the access count for BSeK and BStK varies for L1D or L2D as we show next.
SA −L1D. Let a and k be the number of accesses in the signature SA and the template LK respectively. Running BSeK and BStK concurrently, we derive an upper bound to the increase in execution time (the delta) that k accesses of the template can have on the a accesses of the signature. If k ≥ (Nc − 1) × a then each request of SA suffers the impact of Nc − 1 contenting requests. If this is not the case, only k/(Nc −1) requests from SA suffer impact.
The number of request accesses generated by the BSeK is given by N = min(a, k/(Nc − 1) ). By running this BSeK against Nc − 1 BStK copies, each having a number of accesses largely above N , we derive an upper bound to the impact that LK has on SA. The impact that a task can suffer due to a template LK with k l2h is upper bounded as: The ETB derived for a given task τA and template LK is: In this case we account for the fact that requests sent by the interfered task, τA, suffer different interference by the l2h and l2m/st sent by the interfering tasks, abstracted in L2D. In this approach we pair up every request in τA with Nc −1 requests in L2D causing the highest interference (l2h) on the former. If the number of those requests in L2D is exhausted, we pair up τA requests with those in L2D causing the second worst interference (st).
We generate two BSeK and BStK pairs to capture the impact that accesses in SA suffer from l2h and l2m/st in L2D so that: BSeK1/BStK l and BSeK2/BStKs capture the interference on τA's accesses caused by the l2h and l2m/st in L2D respectively. BSeK1 and BSeK2 have different number of st operations, N1 and N2. BStK l comprises l2h operations whereas BStKs comprises st operations. Let assume for example a = 30, k l2h = 60, and kst = 80. In this case, BSeK1 has N1 = min(30, 60/3 ) = 20 st, which we pair up with 20 accesses in SA; and BSeK2 has the rest of accesses in SA, N2 = 30 − 20 = 10 st, which we pair up with 3 × 10 requests out of the 80 accesses in kst. The remaining 50 st in kst are not paired since they will not cause further impact on SA. Overall, an upper bound to the impact that an application can suffer due to L2D is given by: For the memory controller we follow the same principles as for the bus, with the particularity that the impact from/to the read/write request types is homogeneous. Hence we only need

Multi-resource signatures
In the presence of multiple shared resources, the signatures and templates must cover the hardware features so as to soundly upper bound contention in each of them. For the reference architecture considered in this work, signatures and templates are as follows: S bus+mc A = (ast + a l2h + 2a l2m , amem) and L bus+mc K = (kst + 2k l2m , k l2h , kmem). It is possible that a task suffers contention in several shared resources simultaneously, so that the impact of the contention does not accumulate but rather overlaps. However, determining trustworthy bounds to the degree of overlap in the contention suffered on requests to different resources is complex. Signatures and templates are intentionally made agnostic to the distribution of requests over time. As we focus on the number of requests to each resource rather than on their timing, it is difficult to determine how contending requests overlap. Our current approach assumes no overlap in contention, which in our time-anomaly free processor design is a safe assumption on the maximum impact of contention. Overall, in the presence of a template for the bus L bus and the memory L mc (a.k.a. L bus+mc ), a task is assumed to suffer the sum of the contention generated by both templates:

EVALUATION
For our evaluation, we model a 4-core NGMP-like symmetric multicore [1] at 150 MHz comprising a bus connecting cores to the L2 cache and an on-chip memory controller. This processor model is relevant as it constitutes a potential baseline for the space domain. To model the DRAM memory system we use DRAMsim [6], a well-known memory simulator with which we model a close-page DDR2-667 [15] memory. As part of a study carried out for the European Space Agency we evaluated the performance estimates provided by our simulator against a real NGMP implementation, the N2X [3] evaluation board, using a low-overhead kernel that allowed cycle-level validation. The results obtained for the EEMBC Automotive [21] benchmarks, the reference benchmarks used in this paper, showed an execution time deviation of less than 3% on average. For the NIR HAWAII benchmark [13], the inaccuracy was less than 1%.
Our RSeK /RStK approach works on the premise that the contention suffered by each request of the RSeK upper bounds the contention suffered in any other scenario. The authors of [25] show that round-robin arbitration can have anomalous cases when a higher number of contenders introduces less contention on the bus. In fact, we show in [8] that the RStK cannot necessarily generate the worst (maximum) contention on RSeK, due to the alignment of requests. To solve this, we applied a so-lution based on adding nop operations between RSeK requests to modify their alignment. For instance, in the case of the bus, since we use store requests for the RSeK (see Section 3.3), we prove in [8] that each RSeK 's request suffers the maximum contention. In our reference architecture, if load operations were used in the RSeK, each request would suffer exactly one cycle less than the maximum contention on each request as shown in [8], which can be compensated with a correction factor.

Experimental results
Our evaluation was carried out along 2 axes. First, we compared the tightness of 1D and 2D templates against fully-time composable ETB, that can be obtained by software [11] [22] or hardware [19] methods. Secondly, we compared 2D templates, for which tighter results are obtained, to the case in which the task under analysis runs agains RStK.
1D vs 2D signatures. Figure 3 compares the scenario with a fully-time composable template, LT C , valid for any workload (any workload template in the figure), with 1D (L1D) and 2D (L2D) templates fitting the potential interference in the corresponding workload. We analyze 10 randomly generated workloads and show results for the benchmark running on core 0. Similar results are obtained for the other cores.
For instance, for workload W8 <pntrch(PN), basefp(BA), a2time(A2), tblook (TB)>, we consider PN as the task under analysis and a template that corresponds to the aggregate of signatures of the three other benchmarks. This causes L1D to have 564, 227 bus accesses (as many as the addition of bus accesses of BA, A2 and TB ). This is abstracted by RUs/RUl so that only 564, 227/3 = 188, 076 bus accesses from PN suffer high contention and the rest suffers no contention. To measure this effect, we run a BSeK with 188,076 accesses against 3 BStK with a large number of accesses. The same process is followed for the memory. L2D is generated analogously, but considering separately l2h and st bus accesses. Figure 3 shows the ETB for the first benchmark in the workload (under anyworkload, L1D and L2D), normalized to its execution time in isolation. We observe that fitting templates to actual contention (L1D and L2D) in the workload tightens ETBs significantly. This effect is particularly noticeable for WL1 and WL4. Also, in all cases L2D provides tighter ETBs than L1D. This is so because with L1D all accesses to the bus are assumed to be l2h, which generate the highest contention, while L2D better captures the fact that there are two type of requests generating different contention (l2h and l2m-st). For instance, WL4 has a normalized ETB of 4.37 (more than 4x the execution time in isolation) when using a template valid for any workload. If we use L2D for this workload, the ETB is only 1.53. Overall, our approach allows reducing the ETB from 2.57 to 1.8 with L1D and 1.36 with L2D templates on average for the 10 workloads.
Owing to strict page limits we are unable to report the contention impact generated by the memory. Notably however, in our processor set-up the bus has higher impact than the memory, as the L2 cache filters out most memory accesses. Of the contention impact in L2D, 78% stems from the bus and only 22% from the memory.
RUs/RUl vs. EEMBC/RUl. In order to assess the pessimism incurred in ETB obtained with L2D we compared them with the execution time for the task (i.e EEMBC), denoted ET , taken when the task run as part of a workload comprising RStK [11] [22]. This workload represent a pessimistic yet possible contention scenario that the task can suffer. Figure 4 shows ETB obtained with L2D relative to ET . Notably, the incurred pessimism was always below 45%, 20% on average. We contend that the benefits provided by RUs/RUl in the simplification of timing analysis upon system integration, pays off for the increase in WCET estimates.

RELATED WORK
Contention on access to hardware shared resources has been thoroughly studied in the state of the art. A taxonomic summary of the relevant works can be found in [7]. Authors in [4] propose a methodology to obtain the signature of tasks and replace them with kernels that mimic their shared resource usage pattern as a way to reduce the variability in measurementbased analysis. Instead, we use signature and templates to abstract the contention tasks cause and suffer, bounding contention effect [8]. Works addressing off-chip contention assume no contention for on-chip resources, which are assumed replicated. Off-chip contention for the bus is handled with TDMA buses [2] whose analysis case is the worst possible alignment of the task requests to their TDMA slots. Works assuming dynamic arbiters [24] consider the particular pattern of accesses of each contender to the bus. For on-chip resources, two main approaches have been followed, both requiring some hardware support: isolation or bounded interference. The former uses TDMA arbitration and partitioned caches to prevent interaction among tasks [14]. The latter bounds the maximum impact that one task may generate on co-runners [19]. However, as far as we can tell, such specialized hardware support is not fully or readily available to industry: while cache partitioning has been implemented in hardware, e.g. in the Cobham Gaisler NGMP and the ARM A9, for the bus and the memory controller instead such support is not provided. When the shared cache is not partitioned, alternative solutions -around the concept of partial time composablity -have been proposed to approximate the time composability properties provided by templates and signatures [10].
In the absence of hardware support in COTS processors, contention effects can be analyzed, bounding the memory latency (for instance for Intel Core-i7 [12]), or even deriving WCET estimates (for Freescale P4080 [18]). In the latter research, authors use a static timing analysis approach with run-time monitoring of the resource usage that benefits from the knowledge of the workload to be able to derive tight WCET estimates. As a consequence of the limitations in the state of the art for COTS, the execution time of a task becomes dependent on its co-runners, which is a major impediment to incremental development and qualification. This is the challenge we have tackled with our approach based on resource signatures and templates.

CONCLUSIONS
We presented a novel approach to studying the contention on the bus and memory controller, building on the concept of RUs and RUl that abstract the resource usage made by the task under analysis and by its contenders. These notions help abstract the interference impact suffered by the task under analysis and the interference effects generated by its contenders. The notions embodied in our proposal provide a simple yet powerful mechanism to aid time-composable integration of multiple tasks in a multicore. A wise selection of RUl allows obtaining tight upper bounds to execution time, for modest cost and effort, thereby facilitating incremental development and qualification for systems targeting COTS multicore processors.