Task Scheduling in Heterogeneous Multiprocessor Environments–An Efficient ACO-Based Approach

ABSTRACT


INTRODUCTION
There are considerable improvements and advances in technology and computer architecture that have been achieved over the years and among these, are heterogeneous multiprocessor systems. Gaining increasing popularity for their diverse and incredible capabilities [1], these high performance environments continue to offer several benefits, including increased throughput and the potential for faster scheduling through increased parallelism. As such, task scheduling continues to be actively addressed in order to fully exploit and extract the benefits that these systems have to offer. Task scheduling, is defined as the assignment of tasks of a parallel application to different processors in a manner that minimizes the overall completion time or schedule length (SL) of the application while ensuring that all constraints are fully satisfied [2]. In a heterogeneous environment, scheduling of these interdependent tasks becomes even more challenging, because of the varying speeds associated with the different processors and hence the different computational cost associated with each task [3].
A program or parallel application may be modeled by a task graph in the form of a weighted directed acyclic graph (DAG), G = (V, E), where V denotes the set of nodes (n i ) which represent the tasks of the application and E denotes the set of edges that indicate the data dependencies between the various tasks.

The ACO Metaheuristic
Ant Colony Optimization (ACO) algorithms were first introduced by Dorigo and his colleagues in the early 1990s, and form part of a wider research area known as Swarm Intelligence, which models solutions to combinatorial, and optimization problems, based on the behavior and processes exhibited in nature [25]. ACO is inspired by the indirect communication of a foraging ant colony, where the survival of the entire colony governs the ants' behavior and not simply individual survival. This indirect communication, known as stigmergy, enables ants to find very short paths between food sources and their nest [26].
In the initial stages of foraging, the ants explore the area randomly, depositing chemical pheromone trails as they traverse. When food is encountered, the quality and quantity is assessed and pheromone, from the food source to the nest, is deposited. Subsequent foraging ants utilize these pheromone trails to guide them to the food, with the probability of utilizing paths marked by strong pheromone concentrations, which reinforces the pheromone density and thus increases their attractiveness for later ants. This reinforcement leads to convergence to the most attractive path. Evaporation of pheromones on the trails provides the limiting mechanism for this positive feedback, so less frequented paths have decreased pheromone concentration.
The ACO metaheuristic ( Fig.1) applies the foraging behavior of natural ants in a computational environment and iteratively constructs candidate solutions using artificial pheromone and local heuristics to guide the artificial agents (ants) through the investigated search space. The pheromone trails bias future agents toward high quality solutions, until a termination condition is satisfied.
Contrary to foraging ants in nature which deposit a continuous trail of pheromone, ACO approaches have implemented various alternatives [27]. For example, in the original Ant System (AS) [28] ants deposit pheromone to only completed solutions. Alternatively, the Ant Colony System (ACS) [29] makes step-bystep online (local) pheromone deposits by every agent during the construction of solutions and introduces a further offline (global) update of pheromones to the best solution of the iteration. Additionally, some kind of evaporation mechanism is implemented, allowing the ants to consider new areas of the search space [27]. Furthermore, some ACO techniques employ local and global optimization strategies to further increase the quality of the solutions produced.
The ACO technique has been applied to various optimization, classification and scheduling problems [25]. It has been combined with other random search algorithms, for example, the Genetic Algorithm and Tabu Search. ACO has also been combined with list scheduling, for instance, the ANT-LS algorithm [27] and the ACO-TMS [30]. This combination of pheromone trails and list scheduling heuristics facilitates further guidance for the ants toward good quality schedules. Given the versatility of ACO algorithms, we present an ACO-based algorithm which uses the foundation of the ACO. Our proposed algorithm incorporates the upward ranking concept used in the HEFT algorithm [8] in our prioritization methodology, an insertion-based policy along with pheromone aging, to produce efficient schedules. Our research investigates the application of an efficient solution to the static task scheduling problem in a heterogeneous environment where dependencies between the tasks are taken into consideration.
For our scheduling system model, the target computing environment consists of a set of processors P, where P = { p 1 , p 2 , p 3 , …p |P| }, and |P| denotes the number of processors. Our model assumes heterogeneous non-preemptive processors that are connected in a fully connected topology and interprocessor communication is contention-free. The main objective of the task scheduling problem is to determine a mapping of tasks of a given application to processors that minimizes the schedule length.
The remainder of the paper is organized as follows: We describe our proposed algorithm in Section 2 and outline our methodology for performance evaluation in Section 3. Results obtained from a performance comparison of the ACS [29] and ACO-TMS [30] with our proposed work are presented and discussed in Section 4 and we summarize our conclusions and future works in Section V.

THE PROPOSED ALGORITHM 2.1. Overview of Our Algorithm
Our proposed algorithm ( Fig. 2) is known as the ranking-Ant Colony System (rACS) and combines the foundation of the Ant Colony System (ACS) with the heuristic function which was inspired by the list scheduling algorithm HEFT. ACS exhibits flexibility with the utilization of the offline pheromone update and HEFT has yielded good performance as a list scheduling algorithm, with the use of the upward rank value for prioritization.
Firstly, we initialize our two matrices for our pheromone representation. V × P, which we denote as τ, and P × V which we denote as τ 1 . The entry ) , ( p i  indicates the pheromone on the edge between task i and processor element p, whereas ) , ( indicates the pheromone on the edge between processor element p and task j. Therefore, if nentry → p2 → n3 → p1 → n2 → p| P| →….→ nexit → p1 is a possible solution (a complete mapping of the task graph within the search space where an ant, starts at the entry node (n entry ) moves from task to processor and from processor to task, until a processor has been selected for the exit node (n exit )), then τ (n3 , p1 ) ϵ V × P and τ 1 (p1, n2 ) ϵ V × P. Initially, a small pheromone deposit is made to all elements of each matrix and the ready list (RL) is initialized containing the entry node.
Our iterative ant colony algorithm then, executes as follows: for each ant, in each iteration, an ant list of length V that stores both a task and its selected processor is created. The ant selects a task from the ready list using the state transition (ST) rule (1) and a processor using the state transition (ST) rule (2) to construct a schedule. The selected task is removed from the ready list, and appended, along with the processor, to the ant list. The ready list is then updated to contain all the unscheduled children nodes of those parents who have already been scheduled. This process is repeated until all the tasks have been mapped. During the first iteration, our algorithm rACS, does not employ the state transition rules to select either task or processorthey are both selected in a random manner -thus mimicking the ants' natural environment. Throughout the execution of the algorithm, an insertion-based policy is employed whereby the task or node is checked to see if it can be scheduled earlier on the chosen processor, thus affording our approach, the opportunity to achieve shorter schedules. After each iteration, an online or local pheromone update is applied to the best q ants (q , K  where K is the number of ants per iteration), according to the local pheromone updating (LPU) rule using (6), (7) and (8). Following this update, there is also an offline or global update to the best ant solution of the iteration according to the global pheromone updating (GPU) rule using (9), (10) and (11).
Our algorithm also attempts to alleviate stagnation by employing a pheromone aging mechanism (represented as Φ). This condition monitors how the best solution changes over the course of the execution of the algorithm. If the value for the best schedule length remains unchanged after a predetermined number of iterations, a deliberate evaporation of the pheromone trail of that schedule is invoked. When invoked, a random value is generated for Φ, which is always less than or equal to the initial pheromone, and applied. Repetition of this condition is dependent on the random generation of a value which is equal to, or less than the total number of iterations of the algorithm. This deliberate evaporation facilitates the increased probability of exploring new, possibly better-quality solutions.

Criteria for Task and Processor Selection
With our proposed algorithm, the task and processor selections are governed by two state transition rules as follows: Task Selection Rule: Each ant selects a task (i) from the ready list (RL) according to the probability calculated by

Criteria for Pheromone Update
The

RESEARCH METHOD
We conducted a comprehensive performance evaluation of our algorithm by utilizing a two-pronged approach: (1) an evaluation of some of the attributes of our algorithm and (2) a comparison of our proposed work with two published ACO-based algorithms. During the analysis of our proposed algorithm, we investigated the efficiency of the following properties:  Utilization of Idle Processor Time  Randomness with First Iteration  Efficiency of Pheromone Aging Mechanism With the experimentation of randomness on the first iteration, we investigated the influence of the guidance vs randomness during the first iteration. To test this, we searched the literature for DAG instances published in the literature [8], [22], [3]. This was done so as to avoid biasing any specific DAG. Table 1 shows the published makespans of the the selected DAGs. In the second phase of our evaluation, we compared our proposed work with the ACS [29] and ACO-TMS [30] algorithms by utilizing randomly generated task graphs. For this comparison, a total of 13,500 random graphs with the various characteristics were generated and then executed. The algorithms were then compared based on selected comparative metrics.

Attributes of Randomly Generated DAGs
In our experiment, the following input parameters were used for the generation of the task graph, which were also utilized in [8]:  Number of tasks in the DAG (|V|).  OutDegree of a node (O deg ). This is the maximum number of children of a node.  Shape parameter of the graph (α).  Communication to computation ratio (CCR). It is the ratio between the average communication cost and the average computation cost.  Range percentage of computation costs on processors (β). It is the heterogeneity factor for processors. A higher percentage value indicates a significant difference in the computation cost across the processors, while lower values are indicative of more subtle differences in computation costs. For each experiment, the values discussed above, were assigned from the sets given below.

Comparative Metrics a. Speedup:
The ratio between the sequential time and the parallel execution time of a process is defined as the speedup. The sequential time is calculated by adding, sequentially, the computational cost of each task in the graph. This is done for each processor and then the smallest value is used. The parallel execution time is the completion time of the graph, which is also referred to as the Makespan or Scheduled Length (SL). Therefore b. Schedule Length Ratio: The Schedule Length (SL) is the main performance measure of a scheduling algorithm. In our experiment, a large set of task graphs with varying properties is used and therefore it becomes necessary to normalize the schedule length to the lower bound. This is called the Schedule Length Ratio (SLR). The SLR is defined as follows: The denominator is the summation of the minimum computation costs of the tasks on the CriP MIN (minimum Critical Path). The CriP MIN is derived by first setting each task (n i ) to its minimum computational cost and calculating the length of the Critical Path ( |CP| ) using these values.

Preliminary Analysis of Our Proposed Work
ACO-based algorithms can obtain shorter schedules when they (i) incorporate functionality that ensures processor idle time is kept to a minimum and (ii) allow ants to randomly select schedules -thereby mimicking their natural environment. ACO-based algorithms found in the literature, generally, apply a local optimization strategy after generating a solution. The idea behind this strategy is normally to make adjustments, where feasible, to improve the solution obtained. One such strategy is to effectively utilize idle processor time [21], [22], [26], thus, reducing the overall schedule length. Given this basis, we designed our algorithm such that, as each ant constructs a solution, when a task is selected, the identified processor is searched for possible idle slots where the task is inserted, so that it can achieve the earliest possible finish time. While our utilization of idle slots is consistent with the literature; with our approach, the use of idle processor slots is determined as the schedules are built, not after.
We also experimented in the first iteration, with the ants selecting tasks from the ready list, and processor in a random manner. From Table 2, it is noticeable that shorter schedules were and can be produced when this approach is implemented. We also experimented with deliberate evaporation of pheromone so as to mitigate stagnation or escape local optima. The impact was not as significant as expected. We postulate that the value used to generate evaporation had minimal impact because of the timing of invocation and the amount. However, because of the randomness of this activity, when invocation occurred during the early iterations where the pheromone concentration was not high, newer opportunities were provided. We anticipate that a more impactful and useful approach would be to, at the beginning of each iteration, allow a random number of ants to randomly create solutions. These new schedules would be incorporated if they are worthy.
The performance of the proposed algorithm (rACS) was further evaluated by comparison with its progenitor, the ACS algorithm [29], and the ACO-TMS algorithm [30]. For this comparison, random directed acyclic graphs (DAGs) of varying attributes were generated and then executed by the algorithms.

Comparison of Proposed Work with Selected Algorithms
The rACS, ACS and ACO-TMS algorithms were first compared based on the average makespan attained with the varying shape parameters. For our first experiment, DAGs of varying degrees of parallelism were generated. From the results in Fig. 3, it was found that rACS outperformed the other two algorithms for short graphs of high parallelism (larger α values). When α ≥ 1, it was 18 percent better than ACO-TMS and 48 percent better than ACS. With α < 1, where the graphs have greater depth and a low parallelism, the rACS experienced on par performance with ACO-TMS, and was 52 percent better than the ACS. The next experiment examined the variation of the average SLR of the algorithms as the number of nodes of the DAGs was increased. Fig. 4 shows that as the number of nodes increases, the difference of the average SLR values when compared to our proposed algorithm and that of the ACO-TMS shows a steady increase. This is indicative of better performance from our proposed algorithm for large applications with more tasks when compared to smaller applications. rACS is better than ACO-TMS by 6 percent and the ACS by 24 percent. Fig. 5 illustrates the behavior of the algorithms, from our next experiment, which investigated the average speedup as the DAG size was increased. Our proposed algorithm experienced a steady increase in the average speedup, outperforming both the ACS and the ACO-TMS algorithms. The ACS experienced minimal increase in the speedup throughout this experiment. The average speedup experienced by the ACO-TMS was steady, however, not as pronounced as rACS. Further, as the number of nodes of the DAG was increased from 80 to 100, our proposed algorithm yielded the most prominent outperformance of the other two algorithms. Overall, our proposed was better than ACS and ACO-TMS by 25 and 7.5 percent respectively. A larger speedup value is indicative of a smaller execution time in a parallel environment. Our results suggest that, generally, our parallel execution times were consistently smaller than the sequential execution times, even as the number of nodes increased.

CONCLUSION
In order to fully exploit the high performance of heterogeneous multiprocessor environments, versatile and robust scheduling strategies, which yield efficient results, are required. Our proposed algorithm is an ACO-based algorithm (rACS), which utilizes an upward rank value along with an insertion-based policy to further guide the ants toward quality solutions. In our experimental study we compared our proposed algorithm, rACS, with the ACS algorithm and the ACO-TMS algorithm using a set of various randomly generated task graphs. The rACS yielded better results, outperforming the algorithms in the various experiments such as average speedup and average SLR for increasing DAG size, as well as for varying DAG shape. Our planned future work is to investigate and add, to rACS, local optimization strategies to further increase its efficiency as an algorithm to tackle the static task scheduling problem.