Metis: Optimal Task Allocation Framework for the Edge/Hub/Cloud Paradigm

Increased demands for real-time decision support and data analytics facilitate the need of performing significant computing away from the cloud and onto the IoT devices. In this paper, we propose Metis, a mathematical programming based framework, able to deliver an optimal task allocation when targeting a specific performance metric. Metis is currently suitable for systems which consist of an edge node, an intermediate node and the cloud infrastructure, but can be expanded to multi-Edge/Hub systems. Evaluation results using a real-life use-case scenario demonstrate that Metis provides the optimal task allocation by minimizing the overall latency of the system while taking into consideration the application's specific requirements and resource constraints of each computational unit.


INTRODUCTION
Traditionally, power hungry and heavy computing tasks have been executed on powerful servers. As the number of connected and portable devices grows, and their application spectrum rapidly expands, the centralized computing model has been altered. Traditional servers have been replaced by cloud infrastructure, and intermediate nodes, which perform most of the processing/analytics. In the past few years, the number of IoT (Internet of Things) devices increased exponentially and is expected to surpass the 50 billion by 2020 [4]. The overall infrastructure is now integrating these devices, thus yielding the edge/hub/cloud paradigm. This paradigm integrates the edge devices which are not just sensing devices, instead, they also have computing capabilities. Moreover, as emerging applications require real-time decision support and data analytics, there is an increased need for performing significant computing on these edge devices. This paradigm is relatively new but it has significant challenges that must be overcome. One of these challenges is the task allocation, in terms of which tasks should be executed at the edge, hub or cloud.
A number of studies [2,7,11,14], show that by shifting tasks from a central cloud data center towards the edge nodes enables real-time data processing, and results in a network with reduced traffic. However, as tasks move towards the edge of the network, the nodes become resource constrained, both in terms of computation capability and memory capacity. Hence, although it makes sense to allocate all application's tasks at the edge node to minimize latency, due to the uniqueness of every application, the allocation must be decided based on these unique characteristics. Performance, and more specifically latency, is essential for time-critical applications, such as critical infrastructure monitoring and search-and-rescue applications, in which even a fractional delay in processing can trigger a disastrous event. Therefore, one might need to further analyze the allocation of each task between the different computational units, to maximize the performance.
To allocate an application's tasks between the different computational units of the paradigm model, we need to consider several constraints, both in the application but also in the infrastructure itself. However, an optimal task allocation with tasks that require energy, memory and execution time, remains a challenge, primarily because the edge/hub/cloud paradigm has significant constraints that must be taken into account.
We, therefore, propose Metis, a framework which uses a Mathematical Programming (MP) approach to optimally partition an application's tasks within the paradigm, to achieve maximum performance, in terms of latency. In a nutshell, Metis yields the optimal task allocation such that the overall latency of the system is minimized under energy and resource constraints for each one of the computational units. To do so, Metis takes into account the execution time, the energy consumption, the memory footprint and the storage requirements of the application's tasks. Furthermore, it takes into consideration the communication latency and the communication energy, which are computed based on the load generated from each of the application's tasks as well as the bandwidth of the communication channel. Metis is suitable for a range of applications that involve remote decision and support, machine learning inference applications, health-care applications, e.t.c., where an optimization within the edge/hub/cloud paradigm is a necessity.
The remaining sections of the paper are organized as follows. In section 2, preliminaries and existing related work are discussed. In section 3, a detailed description of our framework and a problem formulation are presented. In section 4, the results and the evaluation of the implementation based on a real-life use-case are given, followed by concluding remarks in section 5.

BACKGROUND AND RELATED WORK
Notwithstanding that a large number of studies approached the subject of task allocation and investigated the idea of computational offloading for several years, they dealt mostly with the edge-cloud, fog-cloud computing paradigm. Recently, Ghosh et al. [5], formulated an optimization problem for energy-aware placement of Complex Event Processing queries across a collection of edge and cloud resources, with the goal of minimizing the end-to-end latency. They proposed a brute force approach and a genetic algorithm pattern to solve the problem. Loke in [10] examined the chance of handing off jobs to other devices through communication interfaces. In addition, in [15] Wang et al. presented a cross-end analytic engine architecture for wearable computing systems. They proposed three design rules to optimize energy efficiency for a single functional cell and they developed an automatic generator to find an optimal partitioning for cross-end architecture.
In [12] Ouahouah et al. suggested a computation offloading scheme amongst unmanned aerial vehicles (UAVs) that handle IoT tasks. Their purpose was to enhance the UAVs' lifetimes and reduce their response times. In [8], Kovachev et al. introduced an adaptive computation offloading middleware for mobile devices to offload their computation tasks to a cloud server. Their results showed that the local execution time can be substantially reduced by computation offloading. In [3] Dinh et al., introduced an optimization framework of offloading from a single mobile device (MD) to multiple edge devices. Here, they minimized the total cost which included both the MD's energy consumption and total tasks' execution latency. Additionally, in [13], Ryden et al. reduced the execution time by aiming to pick the best-performing server nodes for computation. To minimize the overall system's execution time, an optimal task allocation is required.
Most of the studies focusing on the allocation of computing resources, consider a mobile edge computing architecture where its computing nodes have the same computing capability. Moreover, regarding the allocation of tasks and computing resources as well as the computation offloading, recent researches typically ignore the fact that the offloaded data has to be transmitted over the network. Therefore, it is necessary to take into account the communication resources when dealing with applications where data offloading is a part of them.
The edge/hub/cloud paradigm is newly introduced, therefore, no standard architecture is available regarding the management and allocation of resources and tasks across the paradigm. Our work proposes a mathematical programming (MP) based scheme for achieving an optimal partitioning of an application across the edge/hub/cloud paradigm model by minimizing the overall latency of the system, taking into account the latency and energy consumption of both nodes and communication links, as well as the memory needed for the tasks to be executed.

METIS FRAMEWORK 3.1 Extended Task Flow Graph
We propose an extended task flow graph as the input for our framework. The extended task flow graph (sample is shown in Figure 1) consists of three components: 1) The first component is the group of candidate tasks. For each task specified in the application's description, there is a group of candidate tasks (i.e. a set of candidate tasks S c ∈ {1, . . . , n}, where n is the number of computational units), which indicates the possible computational units where the task can be allocated to. Hence, the extended task flow graph contains one node per candidate task. Each candidate task has its own unique characteristics in terms of computation and energy consumption and eventually, only one candidate task from the group will be selected for execution. The allocation of each candidate task specified in the application's description is based on the application's constraints and the constraints imposed by each one of the computational units.
2) The second component involves the connectivity between the tasks, that is, the edges of the graph. Each edge indicates the dependency between two tasks, as well as the communication cost (communication latency and communication energy), which is calculated based on the data generated by each candidate task and the bandwidth of the communication channel. In our case we assume wireless communication between the nodes; however, our model is generic and it can easily be applied to wired communication protocols as well.
3) The third and final component is an extra communication cost (communication latency and energy) in case there is no direct communication between a pair of computational units. In this case, an intermediate node will act as a communication bridge between the two candidate tasks.

Properties of Candidate Tasks
Each vertex of the extended task flow graph denotes a candidate task T. Each candidate task T has the following six attributes: (1) the computation latency which is the time required for the task to complete, (2) the energy consumption which is the energy required for the task to complete, (3) the memory footprint, that is, the amount of RAM memory needed for the task to run, (4) the secondary memory requirements which is the disk storage occupied by the task, (5) the generated data by task, and (6) the set of predecessors of the task.

Energy Model
We use a similar energy model as the one used in [15], where the system energy consumption E can be split into two main parts: the energy needed to execute the allocated tasks, E x , and the energy needed for transmitting data between the computational units, hence the communication energy, E c , as shown below: The following equation illustrates the energy model for the execution of the tasks: where P i and t i are the power consumption and the execution time of the candidate task T i , respectively. The constant s indicates the number of the allocated candidate tasks, that is, the number of the application's original tasks.
We also model the energy consumption of transmitting and receiving data into our wireless communication energy consumption: where B t is the total number of bits for transmitting and B r is the total number of bits for receiving. The A t and A r parameters are the average energy model for one-bit data transmission and one-bit data reception, respectively.

Properties of graph edges
Each edge of the extended task flow graph denotes a communication edge between two candidate tasks, e.g., T i → T i+1 . Its attributes are the communication latency and communication energy. To calculate the communication energy and latency, the bandwidth B of each one of the communication channels must be defined. We also need the output data D that is generated and transmitted from each candidate task. Let T i T j , ∀i, j ∈ {1, . . . , s}, where s is the number of the application's tasks, denote the candidate task T i that transmits and candidate task T j that receives the data respectively and D T i the data generated from the candidate task T i . It should be noted that each task T i encapsulates the information concerning to which computational unit it corresponds. Moreover, let U x U y denote the computational unit U x that transmits the data and computational unit U y that receives the data. Moreover, let B U x U y , ∀x, y ∈ {1, . . . , n}, where n is the number of computational units, denote the bandwidth of the channel U x → U y . The communication latency (C L T i T j ) is defined as follows: This can be re-written as follows: Equation (5) gives the communication latency required for task T i allocated to Computational Unit U x to send data to Computational Unit U y . The communication energy needed for the transmission (C E T i T j t ) and reception (C E T i T j r ) of the data are defined as follows: As communication latency, the communication energy can be rewritten as follows:

Variables in the MP formulation.
To formulate the problem, let i denote a candidate task T i and j,k denote a computational unit U j and U k respectively. Also, let E i j , L i j , M i j and S i j be the energy, latency, memory footprint and secondary memory requirements of the T i candidate task allocated to be executed on the U j computational unit, respectively, where E i j , L i j , M i j and S i j ∈ R + 0 . Moreover, let C L i jk and C E i jk be the communication latency and communication energy needed for T i , which is allocated to be executed on U j and sends a specific amount of data to U k , where C L i jk and C E i jk ∈ R + 0 . Given the extended task flow graph G, E i j , L i j , M i j , S i j , C L i jk , C E i jk and the energy budgets E j ∈ R + 0 together with the memory budgets M j ∈ R + 0 and secondary memory budgets S j ∈ R + 0 , we want to find the optimal task allocation such that the overall latency L is minimized. The problem is modeled using MP, with the following variables: • A binary variable x i j as x i j = 1 if task i is allocated to be executed on unit j, and 0 otherwise. • A binary variable x i jk as x i jk = 1 if the outgoing link j → k of task i is activated, and 0 otherwise. • A variable p, which indicates the difference between the ingoing and the outgoing links on a node. In our formulation we define a super-source node which communicates directly with tasks with no predecessors and a super-sink node which is reached directly with all nodes with no immediate successors. This is not shown in the extended task flow graph ( fig. 1).
3.5.2 Objective Function. The Objective Function of the MP is to find a solution that minimizes the overall latency of the system. Let I be the set of the candidate tasks of the extended task flow graph and J be the set of the computational units.
subject to j ∈J j ∈J k ∈J k :(i j→k) x i jk − k :(k →i j) Equations (12) and (13) indicate that each task i is assigned only to one of the computational units j. Equations (14) and (15) indicate that only one of the outgoing links j → k of task i will be activated each time. Equation (16) correlates the two decision variables. Equation (17) indicates the flow balancing constraint, that is, if there is an ingoing link to a task i then it must also have an outgoing link. In this way, the path between the nodes is guaranteed to be connected. Equations (18) to (20) indicate the resource constraints. For the main and secondary memory usage as well as the energy consumption we use a simple summation of the memory, physical storage, and computation and communication energy needed by each task i, respectively. At any given time, the total main and secondary memory usage along with the total energy consumption should not exceed the specified memory and energy budgets. If a constraint is violated, Metis is able to identify this and report that there is no solution due to a constraint violation. It should be noted that there is no formulated constraint in the event that a task cannot be executed on a specific computational unit, but this is encoded in the extended task flow graph instead.

Case Study
To evaluate Metis, we use a real-life problem that involves an autonomous UAV-based power line inspection. The UAV first detects a pylon and then follows the power lines in an attempt to detect transmission line anomalies, such as vegetation, faulty insulators, etc. The problem is very relevant to our targeted applications as it involves a dynamic decision and support system at the UAV, which is also the edge device. The application consists of two main parts as shown in Figure 2. The first part is the power transmission lines detection process and the second major part is the pole and pylon detection process. The power transmission lines detection process takes as input an image captured from the UAV's camera and then uses a Hough transform based approach to detect the power transmission lines. The main component of the pylon and pole detection process is the detector based on a convolutional neural network (CNN) [9]. Overall, the process involves several tasks which can run either on the UAV, on a hub node such as a ground station or on a remote server (i.e cloud).
Metis receives as an input the extended task flow graph to generate the optimal solution. Based on the application's constraints, the groups of candidate tasks are constructed to create the extended task flow graph. Next, using profiling tools and a digital energy monitoring device we computed the various metric values for each candidate task such as the latency, the energy consumption, the memory footprint, the storage requirements, as well as, the communication latency and communication energy, depending on the targeted computation and communication platform. More details are provided in section 4.2. Using these values, we generated the extended task flow graph. It is worth mentioning that in our use case, there is no direct communication between the edge node and T' 5h T 2e T 2h T 2c

Image Preproccesing
Candidate task to be allocated Bridge node

Group of candidate tasks
Predecessor Task Dependency Key Figure 1: Extended Task Flow Graph of the considered reallife use-case.
the cloud unit. Therefore, this connection can be established only through an intermediate node, in our case the hub node (ground station). Figure 2 shows the application flowchart, which can also correspond to the original task flow graph, and Figure 1 shows the extended task flow graph used by Metis. Observe that tasks T 1 and T 15 in Figure 2 can only be executed on the edge and hub devices, respectively. This is reflected in the extended task flow graph in Figure 1, where there is a single node T 1e for T 1 and a single node T 15h for T 15 .
It should be noted that any of the UAV's flight control computation has been excluded from our framework as it involves the UAV's flight capabilities and is not part of the application.

Experimental Setup
For our setup, we used four different computational devices, the Odroid XU4 and the Raspberry Pi 3 (RPi3) as the possible edge devices that can be attached as UAV's payload to perform the computation, and the Mi Notebook Pro and the Samsung Tab S2 as the possible hub devices. The chosen devices have been selected because they belong to a family of devices that are typically deployed in these use-case scenarios and they offer different computation and communication characteristics, enabling a qualitative and quantitative evaluation of our framework. For the cloud unit, we used a server equipped with an Intel Xeon E5-2670v2. To evaluate our system we used four different configurations using the different computational devices, two for the edge node and two for the hub node, while the cloud unit remained the same for all configurations. Table 1 shows in detail the various configurations we used to test our framework as well as the parameters and constraints for the edge and hub devices and the cloud unit. Furthermore, we ran our framework twice for each configuration, to reflect for any bandwidth variations, depending on where the application is deployed, e.g., rural, urban, and the distance between the nodes [1]. During the first run the bandwidths of the communication channels Hub − → Cloud and Cloud − → Hub are set at 2.0× compared to the second run. This is shown in Table 2. To solve the formulated problem and to find the optimal solution, we run the Gurobi Optimizer 8.1 [6] on an Intel® Core i5-6500™ CPU @ 3.20 GHz (Memory: 7.7 GiB of RAM). The solution for the given extended task flow graph was extracted in the order of milliseconds (130ms).    Figure 3 and Figure 4 report the overall latency of the system for each of the four configurations per run. More specifically, they illustrate the computation latency for each computational device edge, hub, and cloud as well as the communication latency of each communication channel edдe − → hub, hub − → edдe, hub − → cloud, and cloud − → hub, that occurred during the execution of the algorithm for some specific task allocation. The computation (communication) latency components are given in solid (patterned) color. It should be noted that for each configuration (A, B, C, D) we consider four different scenarios of task allocation, that is, the three extreme cases where, if possible, all tasks are allocated to a single computational unit, and the allocation of Metis. More specifically, in each of the three extreme cases, all tasks, if possible, were allocated only at the edge, hub or cloud, respectively. Only tasks T 1 and T 15 have a fixed allocation, as explained in Section 4.1.

Results and Discussion
For each configuration per run, the results show that Metis returns the optimal task allocation, that is, minimizing the overall latency of the system. In Run 1 (fig. 3), the second-best performance in terms of latency, was the case in which all tasks were allocated at the hub device, while the worst performance was provided by the case in which all tasks were allocated at the edge node. This was evident in all four configurations. In Run 2 ( fig. 4), the second-best performance in terms of latency, for all configurations, was again the case in which all tasks were allocated at the hub device. The worst performance, though, was observed in the scenario where all tasks were allocated at the edge node, only for configurations C and D, while for configurations A and B the worst performance was observed in the scenario where all tasks were allocated at the cloud, due to the fact that the Odroid XU4 is faster than the RPi3 and the overall computation latency cannot overcome the communication latency to the cloud. While it may seem trivial that allocating all tasks at the edge or on the cloud would be a sensible approach, our experiments show otherwise. Computation latency is the bottleneck when allocating all tasks at the edge, whereas communication latency is the main limitation in the case where all tasks are allocated on the cloud. Due to these observations, a hybrid approach is required.
The bandwidth of the communication channels as well as the data generated from each task, play a key role in the allocation solution and the overall latency of the system. Besides that, the compute capability of each one of the computational units is a crucial factor which can change also the allocation solution. Table 3 shows how the tasks are allocated to the different computational units (represented by the letter E, H, and C for the edge node, the hub node and the cloud unit, respectively). Evidently, the allocation is altered when both the bandwidth of the communication channel hub − → cloud and the compute capability of the hub device are reduced.
These results show how Metis can assist a designer since the designer can determine whether the bandwidth of a specific communication channel is enough or not to have an adequate communication latency, and whether the compute capability of a specific computational unit is powerful enough to run the application's tasks with minimum latency. As such, Metis can serve as a tool that facilitates architectural choices for the edge/hub/cloud computing platforms, both in terms of computation and communication capabilities.

CONCLUSION & FUTURE WORK
In this work, we proposed a novel framework, Metis, to optimally allocate an application's tasks into the different computational units of the edge/hub/cloud paradigm. Specifically, we introduced the extended task flow graph, a more suitable way to represent an application's tasks for this kind of problems. Furthermore, we presented an MP formulation of the task allocation problem and illustrated