Towards State-Based RT Analysis of FSM-SADFGs on MPSoCs with Shared Memory Communication

Scenario-Aware Data-Flow Graphs (SADFGs) were introduced to capture the behavior of embedded applications achieving a good trade-off between expressiveness and analyzability. On the one side, they support the timing analysis of real-time applications, especially those running on MPSoCs, due to the clean separation of computation and communication phases in their executing nodes. On the other side, SADFGs allow the expression of a more dynamic behaviors than Synchronous dataflow graphs by allowing dynamic token-rates of single nodes depending on pre-defined typical scenarios. The fact which leads to more efficiency and better throughput. In this paper, we describe the extension of a previous model-checking based real-time analysis approach to allow the analysis of timing bounds for FSM-SADFGs mapped on a shared memory multiprocessor architecture. We demonstrate our approach on an MPEG decoder application being viable to obtain the worst-case end-to-end latency of its implementation under different scenarios on a 2-tiles MPSoC.


INTRODUCTION
In the last decade, video processing applications are moving from merely being utilized in the infotainment domain to become a major topic in the embedded systems domain. For example video processing applications are nowadays used in Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. safety-critical automotive applications to detect pedestrians crossing the street or to recognize traffic signs. To guarantee the safety of such systems, real-time analysis methods must be utilized to validate the fulfillment of their hard real-time requirements. The behavior of such data-driven applications like video processing can be captured in the form of Synchronous Dataflow Graphs (SDFGs) introduced by Lee [10]. The advantage of SDFGs lies in the simplicity of the model which makes them easy to analyze. Yet, due to their simplicity and static behavior, capturing applications with behaviors of high dynamism is not supported by SDFGs.
Scenario-Aware Data-Flow graphs (SADFGs), first introduced in [15], achieve a good trade-off between expressiveness (allowing the expression of more dynamic behaviors than SDFGs) and analyzability. They extend SDF by the possibility of a dynamic data rate.
Finite-State-Machine Scenario-Aware Data-Flow (FSM-SADF) [14] graphs are a simplification of the general SADF graphs. Both extend SDFGs with scenarios. In an FSM-SADF Model of Computation (MoC) a set of typical scenarios is predefined through a finite state machine for a specific SDF application. The SDF application reacts to every scenario in a different manner leading to more efficiency and better throughput. Fig. 1 shows an example FSM-SADFG of an MPEG decoder as introduced by [15]. The vertices of the graph can be either kernels (VLD, MC, IDCT and RC) or detectors (FD). The solid edges are called data channels while the dashed edges are called control channels. Every input port of the kernels/detector in an FSM-SADF has a consumption rate and likewise each output port has a production rate. In each scenario, kernels can have different production and consumption rates. Those ports with different varying rates are denoted by variables while ports with a fix rate over all scenarios are denoted by the rate itself. While general SADFGs enable multiple scenario changes during one iteration 1 by allowing multiple scenario detectors, FSM-SADFGs support only one detector per graph. This simplifies the analysis of such models. Initial tokens are visualized by dots on the edges. For example the data channel from the kernel RC to the detector FD is initialized by three tokens, so that the detector can be executed three times before the kernel RC 1 An iteration is the minimum non-zero execution (i.e. at least one kernel has been executed) such that the initial state of the graph is obtained [5].

Shared Memory
Communication Driver Communication Driver Figure 1: FSM-SADF graph of a MPEG-4 decoder inspired by [14] mapped on a MPSoC. must be executed again to generate new tokens. In this work, a mathematical (static) analysis is performed on a formal but abstract representation of both the software application and the hardware platform. This analysis takes into consideration all possible inputs and combinations of the running applications (i.e. abstracted as FSM-SADF actor activation, computation and communication) with all different hardware states (i.e. abstracted as resource access patterns) of the platform. We use timed automata (TA) and the UPPAAL model-checker [3] to capture and verify the behavior of the considered MPSoC system.

BUS
We claim the following contributions: 1. We enable a state-based real-time analysis (based on [5]) of multiple FSM-SADF applications mapped to an MPSoC platform with shared communication resources. State-based RT analysis methods, deliver very accurate results especially when handling heavy statedepended properties (e.g. First-Come-First-Serve (FCFS) arbitration protocol).
2. We will also present a technique which simplifies the finite state machine representing the scenarios by limiting the nondeterminism. This technique allows to reduce the state space of the resulting model under analysis.

RELATED WORK
Utilizing purely analytical approaches to obtain upper and lower timing bound of dataflow applications execution time is a wide spread research topic. Closest to our research are [12,9,11,2]. While such approaches are fast and able to handle large systems, unfortunately they deliver pessimistic results especially when handling state-based bus arbitration protocols typically used in Multi-Processor Systems on a Chip (MPSoCs). In previous work [5] it was shown how far state-based RT analysis of SDFGs on MPSoCs can tighten the results compared to an analytical approach [12].
In the following, we give an excerpt of the main research related to our work mainly using state-based methods to analyze the performance of dataflow applications on MPSoCs.
Some previous work [6,8] used model-checking to optimize buffer sizes in SDF applications. Yang et al. [16] introduce a state-space exploration approach to verify the hard real-time performance of applications modeled with SDFGs that are mapped to a platform with shared resources. Nevertheless, it does not consider a shared communication resource. A more recent work [1] followed the same path of our work, presenting a translation of single SDFGs to timed automata templates in order to analyze their behavior using modelchecking. In contrast to our work, they focused on finding a maximal throughput on a given number of processors.
In [17,18] the authors (similar to the work in [5]) transform a system model which includes an SDFG and a multiprocessor platform to a (priced) timed automata network and utilize an extended model-checker (UPPAAL CORA) to obtain optimal schedules combining optimization goals with optimal throughput and energy consumption.
In this work, we extend a previous model-checking based real-time analysis approach [5] for the analysis of timing bounds for FSM-SADFGs mapped on a shared memory multicore architecture. In [5] only analysis of SDFGs was supported. We utilize timed automata (TA) as a common semantic model to represent worst-case execution times (WCET) of kernels, detectors and shared communication resource. The analysis model furthermore supports analysis of access protocols for buses, DMA, private (local) and shared memories of the MPSoC. For a given FSM-SADFGs and an MPSoC architecture different mappings and scheduling strategies can be examined by analyzing the resulting network of TA. Using the UPPAAL model-checker, safe timing bounds of the FSM-SADFG implemented on a MPSoC can be provided. The closest work to ours was published recently in [13] where a translation from FSM-SADF graphs to TA was presented. In difference to our approach, the authors concentrated in their translation and analysis only on the FSM-SADF MoC and did not consider MPSoC mapping and their resulting resource sharing aspects (i.e. contention on communication resources).
To the best of our knowledge, no other approach uses model-checking for the timing validation of hard real-time FSM-SADFGs on a MPSoC platform, considering the contention on shared on-chip components. Fig. 2 shows the models used in this paper in the context of the synthesis process (as defined by [7]) which will be described in the following sections.

Model of Computation (MoC)
The formal syntax of a FSM-SADF graph (such as in Fig. 1) are defined as follows (inspired by [14,13]). Details of the semantic are described in [14]. 2. An actor is a tuple A = (P, F ) consisting of a finite set P ⊆ P, and F a label, representing the functionality of the actor. If K is the nonempty finite set of kernels and d / ∈ K denotes the unique detector then A = K ∪ {d} is the total set of actors. The different execution phases of an SADF actor are shown in Fig. 3. 3. P is the set of ports of all actors. A Port P ∈ P is defined as a tuple P = (Dir, Rate) where Dir ∈ {I, O} defines whether P is an input or output port, and Rate = [r1, r2, . . . rn] is an array of rates for each scenario s ∈ S. The rate ri ∈ N0 of a port specifies the number of tokens consumed (Ratec) or produced (Ratep ) by the corresponding port when the corresponding actor a ∈ A fires. If Pc is the set of all input ports (consumer) where ∀p ∈ Pc : p.Dir = I and Pp the set of all output ports (producer) where ∀p ∈ Pp : p.Dir = O, then P = Pc ∪ Pp, 4. D ⊆ K × A is the set of edges (called channels in context of FSM-SADF). If D ctrl is the set of control channels in the FSM-SADF and D data is the set of data channels then D is defined as Pp ∈ Pp is the port of the producer and Pc ∈ Pc is the port of a consumer and i ∈ N0 is the number of initial tokens on an channel. All ports of all actors are connected to exactly one channel, and all channels are connected to ports, which has a configurable bus connection. In addition, every PE has a private memory. A bus is used to connect the tiles to shared memory. This enables actors mapped to different tiles to communicate via buffers mapped to shared memory using non-preemptive arbitration protocols. Only explicit communication (message passing) between actors will be visible on the interconnect and the shared memory.
Definition 2. (Tile) A tile is a tuple T = (P E, Mp) with processing element P E = (P Etype, f ) where P Etype is the type of the processor and f is its clock frequency, and Mp is the size of memory.

Synthesis
The system synthesis (see Fig. 2) includes the processes of binding and scheduling the behavioral model onto the defined architecture. Mapping the FSM-SADFG onto our MoA is defined as follows: The channel mapped to a private or to a shared storage resource represents a consumer-producer FIFO buffer in an actual implementation.
The following definitions allow us to express the scheduling behavior of each SADFG of a scenario mapped to tiles on the platform: Definition 5. (Self-timed (static-order) schedule) For an FSM-SADFG with a repetition vector γ for each scenario, a static-order schedule SO is an ordered list of the actors (to be executed on some tile), where every actor a is included γ(a) times.
A repetition vector of an FSM-SADFG is defined as the vector that specifies the number of times every actor has to be executed in a specific scenario such that the initial state of the graph is obtained.
Because the number of executions of an actor depends on the scenario, each scenario may have a different scheduling. Self-timed means that FSM-SADFGs are executed in a static cyclic order as soon as the input data is available.
Definition 6. (Scheduling Assignment) Let SO be the set of all SO schedules for all FSM-SADFGs as result of a scenario. A scheduling assignment is a function S : T → so, which assigns to every tile t ∈ T a subset so ⊆ SO.

Model of Performance (MoP) Extraction
In order to verify that the performance of the FSM-SADFG stays within the required bounds, we must keep track of all possible timing delays in all scenarios of all mapped FSM-SADFGs to the MPSoC platform. To achieve this, a MoP is extracted from the synthesis process which includes all the SW/HW components with their properties influencing the timing in the considered system.
From the hardware abstraction point of view, we consider a Time-accurate Bus-Functional-Model (BFM) [4] abstraction. In this model, the application layer issues read/write transactions on the interconnect (see for example the FCFSarbitrated bus in Fig. 1) and upper/lower latency bounds are calculated, abstracting details of the communication protocol. This model is appropriate in case no accurate (constant) timings can be obtained when transferring data of specific size through an interconnect with a specific communication protocol.
After synthesis, the following system components are annotated with execution times and delays: communication drivers, schedulers, actors, interconnects, private and shared memory. • ∆A : A × S × T → N>0 × N>0 which provides an execution time interval [BCET, W CET ] for each actor representing the cycles needed to execute the actor behavior (compute phase) for every scenario on the corresponding tile. We assume this delay can be obtained through static timing analysis or an appropriate measurement approach.
• ∆H : H × T → N>0 × N>0, ∆C : C × T → N>0 × N>0 assigns (in analogy to ∆A) to every scheduler and communication driver a delay interval, which can be estimated in the same way as ∆A.
• ∆D : D × (MP ∪ (I, MS )) → N>0 × N>0 assigns to each communicating edge d ∈ D mapped to a communication primitive a delay ∆D which depends on:   • ∆P : A × T → N>0 × N>0 which provides a polling delay that should be waited by an actor when blocking on a shared storage resource.
Regardless of whether the channel is a control-or data channel, the buffer size is fixed to the maximum size required among all scenarios by an actor within an iteration.

CAPTURING MOPS AS TA
In this chapter we present the steps that are necessary to get a representation of an FSM-SADF application mapped and scheduled on a MPSoC platform. Furthermore we present a technique to abstract the scenario state machine to get a deterministic scenario sequence to avoid state explosion when applying model checking techniques. Figure 4 shows an abstract representation of the timedautomata templates used to capture the model of performance (MoP). It shows one actor of an FSM-SADFG running on one tile (compared to Fig. 1 where 5 actors are mapped on 2 tiles) The actor's channels are mapped to a shared memory. An actor can be either a kernel or a detector, as defined in section 3.1 and the shared memory is connected via an interconnect to the tile.

Actors and Channels on MPSoC
For every actor, an actor-TA-template is instantiated capturing its three phases of execution (see in Fig. 3). The first phase is the Read phase in which all tokens are read from all channels. Next the Computation phase in which the read data gets processed. At the end of this phase, if the current actor is of the type Detector, it triggers a scenario change (see Fig. 4). And finally the Write phase is executed in which new tokens are written to the output ports.
The scenario-state-machine TA is modeled as a part of the detector and therefore implicitly mapped to the tile where the detector is running. Additionally to the FSM-SADFG actors, a scheduler and an interconnect-driver are modeled as TA for each tile. The scheduler signals an actor to fire. The interconnect-driver models the behavior of the communication to shared memory using arbitration protocols. In addition, an interconnect-TA-template models the interconnect temporal behavior, the number of transfered tokens, the interconnect latency and the arbitration strategy used. The memory latency is captured by the channels (FIFO buffers) timed automata, where for every channel a dedicated TA is instantiated. For more details about the implementation of the different TA templates please refer to [5].
A scenario update can take place at every detector firing. The detector signalizes the scenario update for dependent actors through the control channels. During the Compute phase (see Fig. 4) of the detector, the scenario-FSM can be triggered to change to a successor scenario. The scenario is valid for other actors after the Write phase of the detector is completed where the control tokens are written to the corresponding control channels. When a kernel actor fires it should read the control token in the control channel first in order to detect the scenario. This requires a prioritization of the channels leading to read the control channel before all other channels.
Since the change of scenarios takes place in the FSM which is activated in the Compute phase of the detector, the scenario change timing delay is considered to be a part of the [BCET ; W CET ] interval of the Compute phase.
Kernels of the FSM-SADF that can work in different scenarios get notified of their scenario by control tokens produced by the detector.
Control tokens are propagated from the detector to other kernels using the same shared resources that data tokens use. Therefore control channels are modeled just like data channels.

Simplification of the Scenario-FSM
To handle non-determinism in the scenario-FSM of an FSM-SADF graph, multiple possible successor states of an initial state must be abstracted to a single state. The method to do this simplification is shown in Fig. 5.
First all scenarios need to be analysed separately to get their worst case and best case execution times. These times will be the WCET-and BCET-cost of the nodes of the graph that shall be simplified. To get a worst case scenario sequence, the path between two scenarios with the highest WCET-cost must be determined. The best case scenario sequence is the path with the lowest BCET-cost. The new state sequence is not necessarily the longest path of the original graph for the worst case scenario sequence, or the shortest path for the best case sequence since the WCET and BCET of each node gets considered.
The simplification for our MPEG-example shown in Fig. 5 was done for the path from node I to node I of the next iteration. With the highest cost for the path I, P80, P99 and the lowest for I, P0, P30.
This abstraction is valid since we are interested in getting the worst-case value of some timing constraint (such as endto-end latency). That is why a complex state machine can be simplified into a deterministic sequence of states. The same procedure can be done for the best-case scenario sequence. This leads to new deterministic FSMs which describe the token rates in a more coarse way.   For the evaluation, we analyze the worst-case end-to-end latency of an MPEG decoder from [14] running on a 2-tiles MPSoC with a shared memory accessible via a FCFS arbitrated bus (Fig. 1). The kernels MC and RC and the detector FD are mapped to Tile 1, the kernels VLD and IDCT to Tile 2. For the interprocess communication of actors running on the same tile, the private memory is used. Channels between actors on different tiles are mapped to shared memory.

EVALUATION
The port rates of the worst-case-scenario and the bestcase-scenario are shown in the right part of Tab. 1. For example the port rate b (input of kernel MC) can be either 0 or 80 depending if worst-case or best-case analysis gets applied. Before the abstraction it had multiple rates ranging between 0, 40, 60 and 80 (see Fig. 5). All of them would have been considered for the analysis in a non-deterministic way. So a previous analysis of each of the possible scenarios (see Tab. 2) showed that the WCET and BCET for the rates of scenario P40 and P60 are between the BCET of scenario P0 and the WCET of scenario P80. For this reason in Fig. 5 the state P 0/80 is equivalent to the original state P80 in the worst case (see Tab. 1) since in this scenario, the kernels RC and MC (see Fig. 1) are executed most frequently and the highest number of data tokens needs to be transferred between kernels. For the best-case scenario, P 0/80 becomes equivalent to the original state P0 (Tab. 1) where most data channels transfers holding zero rates and kernels are activated at most once per iteration.
The separate analysis of each of the possible scenarios (Tab. 2) confirmed this. This abstracted deterministic FSM constructed in Fig. 5 was implemented as a TA as shown in  6. Each time the detector FD is executed, the FSM gets triggered by the updateScenario sync.-channel inside the TA model to change the scenario (see scenario in Fig. 6).
For the implementation of the TA model, we used the UPPAAL model checker [3]. In our experiments we used the UPPAAL 64 Bit version 4.1.19 for Linux. The hardware was a system with a 2.5 GHz AMD Opteron TM Processor 6282 SE and 512 GB of RAM. All experiments took about two weeks, consumed less than 29 GB of RAM. In addition, the largest number of states explored was about 656 million states. The results of our state-based RT analysis are shown in Tab. 3. The latency of the worst-case scenario sequence using the worst-case rates for the scenarios P 0/80 and P 30/99 (see in Tab. 1) ranges between 8928 and 8969 cycles. For the best-case scenario sequence using the best-case rates (see rates in Tab. 1) the latency ranged between 2770 and 6758.

CONCLUSION
In this work, we presented a state-based real-time analysis method which enables the real-time verification of FSM-SADF applications mapped to MPSoCs with shared communication resources. We showed that our method was able to analyze the worst-case latency and the best-case-latency of both the worst-case and the best-case scenario sequence of a MPEG4 decoder captured in the FSM-SADF MoC and run on an MPSoC of two tiles with shared memory. To avoid an unmanageable state space of the already complex model, we avoided nondeterminism by reducing the scenario state machine to a deterministic worst-case or best-case version of the original one.
To manage more complex applications executed on more complex MPSoCs, we will consider other methods, for instance the promising probabilistic RT approaches, in future work.