Job scheduler for streaming applications in heterogeneous distributed processing systems

In this study, we investigated the problem of scheduling streaming applications on a heterogeneous cluster environment and, based on our previous work, developed the maximum throughput scheduler algorithm (MT-Scheduler) for streaming applications. The proposed algorithm uses a dynamic programming technique to efficiently map the application topology onto the heterogeneous distributed system based on computing and data transfer requirements, while also taking into account the capacity of the underlying cluster resources. The proposed approach maximizes the system throughput by identifying and minimizing the time incurred at the computing/transfer bottleneck. The MT-Scheduler supports scheduling applications structured as a directed acyclic graph. We conducted experiments using three Storm microbenchmark topologies in both simulation and real Apache Storm environments. In terms of the performance evaluation, we compared the proposed MT-Scheduler with the simulated round robin and the default Storm scheduler algorithms. The results indicated that the MT-Scheduler outperforms the default round robin approach in terms of both the average system latency and throughput.


Introduction
At present, we live in the big data era, in which a variety of applications such as stock trading, banking systems, healthcare databases, IoT sensors, and social media networks [1] generate colossal amounts of real-time data.Such distributed data stream processing systems (DDSPSs) usually compute unbounded streams of data in real time, and they are dynamic in terms of their resource capacities [2,3].To realize such continuous data generation via streaming applications, the underlying distributed processing systems must perform prompt yet efficient management and analysis, especially in the case of heterogeneous systems [4,5].One of the key objectives of scheduling streaming applications is to maximize the frame rate, which corresponds to the number of instances of the datasets that can be processed per unit time.To achieve this goal, the scheduling algorithm must consider the data locality, resource heterogeneity, communicational aspects, and computational latencies.
Data locality and location awareness factors arise due to the high data transfer latency in cases in which the data sources often reside in distant DDSPSs [6], which can negatively impact the system performance [7].Researchers have addressed and solved this problem by performing the computing as close as possible to the data source [8].An efficient mapping strategy should thus constrain the significant data traffic onto the same machine or nearby machines to minimize the communication time and mitigate the data transferring latencies.
Furthermore, the presence of cluster heterogeneity in a distributed environment results in different capabilities for task execution and data transmission, because of which the related scheduling algorithm pertains to an NP-complete problem [9][10][11].Both heterogeneous DDSPSs and job applications can have a variety of resource capacities and task complexities, respectively [12].Consequently, a scheduling approach that does not consider the aspects of resource heterogeneity and task complexity variation in communication and computation might impact the performance and reduce the system frame rate [13][14][15][16].
In this work, we aim to minimize the bottleneck time for the transfer time and node computing time along the execution path to achieve the maximal frame rate for the streaming applications.The main contributions can be summarized as follows: I. We propose a maximum throughput scheduling algorithm named MT-Scheduler to maximize the system throughput by using dynamic programming.The maximization is performed by strategically assigning the task components to the appropriate nodes based on their computational and communicational requirements, based on our previous work [10].The MT-Scheduler supports scheduling applications that are structured as a directed acyclic graph (DAG), such as Amazon Timestream, Google Millwheel, Yahoo S4, and Twitter Heron [17][18][19][20][21][22][23].II.We implement the MT-Scheduler algorithm in a simulation environment.The testing results show that the MT-Scheduler can significantly improve the system throughput compared with the corresponding performance of the simulated round robin algorithm.III.Furthermore, we implement the MT-Scheduler in Apache Storm 0.9.7 [24] with a cluster of eight heterogeneous physical machines.For the evaluation, we test three well-known microbenchmark topologies [25][26][27][28], specifically, the linear, star, and diamond topologies.The results are compared with those for the default Apache Storm scheduler and an adaptive online scheduler [29].The

3
Job scheduler for streaming applications in heterogeneous… test results indicate that the MT-Scheduler outperforms both the schedulers in terms of the system latency and frame rate.IV.We propose a polynomial-time heuristic solution to a known NP-complete problem [9][10][11] by utilizing the dynamic programming technique in our MT-Scheduler algorithm.V.The proposed scheduling algorithm covers the knowledge gap in the existing literature, corresponding to both the cluster and topology characteristics as scheduling parameters, in addition to transparently allowing the user to control the data locality aspect.
The remaining paper is organized as follows.Section 2 provides a review of the related works.Section 3 presents the mathematical model for the system and the scheduling problem formulation.Section 4 describes the MT-Scheduler algorithm.Sections 5 and 6 present the evaluation results obtained using the simulation and real-environment experiments, respectively.Finally, Sect.7 concludes the paper and discusses future work.

Related work
Extensive research on scheduling strategies for distributed streaming processing systems has been performed [7, 16, 30−34].Most of the proposed algorithms aimed to improve the system performance by reducing the time and cost incurred by scheduling.In Apache Storm [24], a simple round robin (RR) was used as the default scheduler [35]; however, a satisfactory performance was not ensured.In addition, several Storm scheduler algorithms have been proposed to optimize the system performance.
Aniello et al. [29] proposed two types of scheduling algorithms for Storm, namely offline and online schedulers, using which the tuple transfer latency between the components could be reduced.The offline scheduler identified the most connected components from the job DAG topology and mapped them to the same node.During runtime, the online scheduler monitored the tuple transfer latency and adjusted the mapping schema accordingly by using a best-fit greedy approach to minimize the interslot and internode traffic.In this approach, each component task pair was examined separately from the other topology components, likely resulting in two extensively communicating components being mapped to different nodes.
Peng et al. [25] proposed an offline resource-aware scheduler, namely R-Storm, to achieve the maximum throughput and resource utilization within the userpredetermined resource budget.This algorithm conducts topological sorting by using the breadth-first search (BFS) principle to minimize the internode traffic latency.Later, the input information specified by the users regarding the resource constraints is passed as parameters to a quadratic multiple 3D knapsack problem.The R-Storm can outperform the default scheduler; however, the users are extensively involved in this process.
Likewise, a traffic-aware scheduler named T-Storm [26] was used to minimize the internode and interprocess traffic.This solution, in contrast to R-Storm, was transparent to users; however, the intercommunication between the tasks was ignored.
Cardellini et al. [36,37] and Nardelli et al. [38,39] performed task scheduling over geographically distributed heterogeneous clusters under the QoS constraints.The network-aware scheduling algorithm proposed by these researchers minimized the network traffic and improved the system efficiency in terms of the communication latency, cluster resource utilization, and application availability.
Li et al. [40] proposed a scheduling strategy by implementing the dynamic topology adjustment for Apache Storm.The topology optimization enabled the identification of the performance bottlenecks by examining the bolt capacity and the incoming/outgoing tuple transfer queue.
Zhang et al. [41] developed a latency-aware edge computing platform built on Apache Storm.This approach could be used to minimize the end-to-end latency in the case of a heterogeneous network and node resources (GPUs and CPUs).
Liu et al. [42] presented a heuristic scheduling algorithm for Apache Storm, in which the historical traffic latencies and task topology were used to predict the system performance.The tuple processing latency and tuple failure rate were reduced by identifying the overloaded node for task migration.However, this algorithm could only function in a homogenous cluster.
Shukla and Simmhan [43] proposed a heuristic algorithm that used a modeldriven approach from the queueing theory for the resource allocation prediction and task mapping to maximize the throughput.The same task threads were allocated and scheduled in the same machine or adjacent nodes to reduce the intercommunication and achieve the peak data rate.
Kombi et al. [44] introduced a holistic approach (DABS-Storm) that adapted the task requirements by dynamically controlling the resource usage as a latency-aware load balancing strategy in stream processing systems.
Eskandari et al. [28] presented an online scheduler based on the topology DAG partition as an extension to their P-Scheduler [27].The algorithm aimed to minimize the data transfer and maximize the resource utilization by considering the network and task characteristics.In addition, Liu et al. [45] proposed a dynamic resourceaware scheduler named D-Storm by using a greedy algorithm to solve the bin packing problem.
Among the aforementioned scheduling strategies, most of the algorithms consider the topology structure, intercommunication traffic, or computing node load aspects.However, the heterogeneity in the task, network, and computer resources is not always considered.The proposed scheduling algorithm overcomes these limitations pertaining to the algorithms reported in the existing literature.Unlike the existing approaches, MT-Scheduler maximizes the throughput of a heterogeneous DDSPS by considering both the cluster and application characteristics as scheduling parameters.In addition, the algorithm identifies and minimizes the potential computational or communicational bottlenecks by utilizing the dynamic programming technique.Furthermore, the proposed algorithm allows the users to transparently select the sites and configure the data locality configuration.

3
Job scheduler for streaming applications in heterogeneous… 3 Problem formulation

Problem definition
As in our previous work [10], an underlying node cluster is modeled as a graph , where V C denotes a cluster set that consists of geographically distributed heterogeneous nodes (vertices) denoted as where = 1, 2, … , .Node has an attribute of a processing power .
| denotes the set of cluster network links (edges), where is connected to its neighbor node succ with a network link of bandwidth , succ .The transport network may or may not be a complete graph, depending on whether the node deployment environment is the Internet or a network in single or multiple distributed sites.
An application in distributed data stream processing systems such as Apache Storm [24], Apache Flink [46], Apache Spark [47], S4 Platform [19], and Twitter Heron [20] can be represented as a DAG.Let the topology be represented as 1 , 2 , … , .Component 1 is the data source, namely Spout, which reads data from an external source and transmits it as a data tuple to the successor application components.
, termed as Bolt, where = 2, 3, … , , performs a computa- tional task of complexity on the incoming data sized −1 , sent from its pre- ceding task −1 .The computational components process the data tuples received from either a source or another computational component before transmitting the processed stream to another component.
| denotes a set of links (edges) that represents the dependency of the topological components and data transfer.
Based on the user preferences, all the cluster nodes and topological components are divided into geographical site tags S tag , tag ∈ 1, tags total , where tag = 1, 2, … tags total .After configuring the metadata, each cluster node and component is tagged with a metadata S tag .For tags total of unique metadata S tag ID, tags total number of groups exist, specifically, S 1 , S 2 , … , S total .Each group S tag consists of user-predetermined tasks and nodes with the same metadata value , where ∅ ∈ tag.
Figure 1 shows one of the possible undesirable scenarios that can be caused by implementing the round robin algorithm.The application has different components to be mapped to a heterogeneous cluster.Due to the even distribution strategy, the RR scheduler may assign C2, which is a CPU-intensive task component to machine N1, although other machines with a higher processing power are available.Furthermore, assigning an I/O-intensive task between C4 and C5 to nodes from different sites (N3 and N4) might incur a larger networking delay.

Objective function
Based on our previous work [10], the system performance optimization and throughput rate maximization can be realized by identifying and minimizing the potential performance bottleneck, in terms of both the computational and communicational latencies.
The computational task complexity is a parameter that determines the CPU bound jobs and the associated computational logic complexity as a data operator.Correspondingly, this parameter helps indicate the processing power necessary to compute a function of a task for its incoming data sized −1 .The output data with a size of +1 are in turn transferred to the incoming message queue of its succeeding component +1 for further processing.The processing power of a node in a heterogeneous cluster represents the assigned executors' capability for processing .Therefore, we can estimate the average computing time T compute for task on a node as follows: The estimated computing time T compute , is the average time required to compute a task with a computational complexity of for tuple data sized −1 , which is executed on a supervisor node with an executor of processing power , to produce a fully processed data unit.Practically, in the Storm environment, this time refers to the time period that starts as soon as the Storm _execute() method is called, which executes the required job of the task, and ends when the tuple is fully processed and ready to be transferred to the next subscribed components.In a heterogeneous cluster, the execution latency varies from Job scheduler for streaming applications in heterogeneous… high, as a potential bottleneck, to low.This latency depends on the task complexity and its tuple size, as well as the assigned executor processing power.In DDSPSs, the tasks are communicated through the transfer of messages over the underlying network links., succ denotes the bandwidth of the transferring link that transfers a data tuple of size between node and its successor node succ .We can compute the estimated average transfer time T transfer as The estimated tuple transfer latency T transfer is the average time to transfer an already processed tuple from the outgoing buffer of one component to its successor incoming queue.
The proposed mapping scheme divides the cluster nodes and topological components into user-defined geographical site tags S tag .Next, for each S tag , a group of components and nodes are used to combine the topological components into groups of tasks denoted by g 1 , g 1 , … , g q .These tasks are mapped onto a selected network path P of supervisors within the S tag from to d in the Storm cluster network, where ∈ (min ( , ), min ( , )) .The potential scheduling path P consists of a series of nodes, which are not necessarily distinct supervisors, based on the metadata configuration.The bottleneck time for each site tag T bottleneck S tag is the maximum required time by the distributed data stream pro- cessing system to compute and transfer a data unit (fully processed) by a time unit.The objective function of identifying and minimizing the bottleneck T bottleneck S tag can be defined as in Eq. (3):

Proposed MT-Scheduler
Achieving an optimal solution to the considered scheduling problem by maximizing the tuple processing rate can be difficult and computationally infeasible; thus, a simple yet effective algorithm is required.The scheduling of DAG jobs in a distributed stream processing system with different job requirements corresponds to an NP-complete problem [9][10][11].Thus, we propose a high-throughput scheduler for distributed data stream processing systems, based on our previous work [10].The MT-Scheduler algorithm considers, in addition to metadata groups, the topology job and node attributes, including the computational complexity, data size, node processing power, and link transfer bandwidth.The proposed algorithm achieves the maximum tuple processing rate by utilizing a dynamic programming technique for job mapping, which recursively minimizes the time incurred on the bottleneck and provides a polynomial-time solution.The maximal frame rate that a system can achieve is limited by the slowest element (bottleneck) in the transport link or computing node along the cluster.This work proposes two algorithms, namely Algorithm 1 (mapper), which is the main algorithm, and Algorithm 2, which is the MT-Scheduler (for linear critical path mapping).
The mapper algorithm, expressed as Algorithm 1, inputs the data details for the submitted topology (ID, Name, Submitted user) and the underlying cluster (nodes, worker slots, and executors).First, the directed acyclic graph topology is linearized by implementing a topological sorting process.Next, the critical path is identified by using the well-known polynomial longest path algorithm (LP).The linear critical path represents the most time-consuming sequence of topological components that the system must implement sequentially.Please note that we assume a homogenous network when identifying the critical path using this method.Although this case is not realistic, we adopt this assumption for simplification.Next, the mapper algorithm calls Algorithm 2 to determine the mapping schema for the topological components in the critical path CP .The topological components not on the critical path CP are mapped using a simple layer-oriented greedy method.We apply a topological sort to order the non-CP components into layers and sort these components in a descending order based on the T compute and T transfer .The components that require more computations and communications are assigned higher priorities.Subsequently, we map the components linearly layer by layer; the component with a higher priority is mapped to a node with higher resources.In the Storm cluster, two types of nodes exist: the master, which runs a daemon named Nimbus, and worker nodes that run a daemon named Supervisor.Nimbus periodically calls the scheduler to update the mapping process.The mapper algorithm verifies the topology scheduling if required, to avoid repetitive scheduling implementation and system overloading.Finally, the mapper algorithm utilizes the pluggable scheduler feature in Storm, and, via the Nimbus node, implements the final mapping schema by assigning all the critical and noncritical path components for the submitted topology TP to the underlying Cluster. 1 3 Job scheduler for streaming applications in heterogeneous… In Algorithm 2, the input for the MT-Scheduler is the underlying cluster data details along with the critical path list and MetaKeys{Stag} as the user-defined data list of the site ID/tag.First, the algorithm generates Tags_Pairs, which is a list of critical path pairs, with each pair consisting of a node and a task belonging to the critical path set {(node, task) ∈ CP} .Through the dynamic programming technique, the MT-Scheduler recursively chooses a critical topology path based on the previous round of calculation.At each step of the recursion, the algorithm maps the partial components pipeline to the underlying network nodes and calculates the new potential mapping cost.
The recursion process in the MT-Scheduler algorithm continues until the mapping results converge to a mapping scheme that achieves the objective and minimizes the system bottleneck for the critical path components in the submitted application.
Equation ( 4) presents the recursion based on dynamic programming, which leads to a potential mapping for the critical path components in the MT-Scheduler algorithm.Let 1/Pmap j ( ) denote the maximal tuple rate with the first j topology components mapped to a path from a source node s to a node in an arbitrary computer network.Let S J ( ) represent the sum of the tuple sizes of all the components on a node with the first j tasks mapped from node s to in metadata group S tag .Consequently, with the base conditions computed as and Pmap � s � = ∑ i=1 ( +1 ∕ s ) where = 1, 2, … , .Every link, node, or task is a potential bottleneck and needs to be checked.The recursive dynamic programming process expressed in Eq. 4 generates a 2D matrix [10].As shown in Algorithm 2, after calculating the recursion base conditions, at each step of the recursion process, the bottleneck times are calculated for all potential mapping schemas, and the minimum time is selected to achieve the maximum frame rate.
In a deployment over multiple sites, it may be essential to allow users to assign a particular topology component to a specific supervisor located at a specific site.However, Storm users, by using the default scheduler, cannot predict the mapping of the topological components in the Storm cluster.The MT-Scheduler allows the users to configure and regulate the data locality aspects by utilizing the metadata configurations of the Storm nodes to execute tasks as close to the data as possible, which leads to the minimization of the transfer cost.In Apache Storm, the users can transparently establish the metadata configuration by setting the supervisor.scheduler.metaStorm field in each supervisor's configuration file to specify the custom site tags.After tagging the supervisors, the users can tag the components accordingly to ensure that the scheduler can correctly associate the spouts/bolts with the supervisors.The Storm method addConfiguration processes the tagging configuration to allow the user to build the topology stage.The metadata for each supervisor can be obtained by calling the Storm method getSchedulerMeta, which returns the metadata in key-value pairs.By default, if no site configuration is specified the user, MT-Scheduler considers all the tasks and nodes tagged as one single group. (4) Job scheduler for streaming applications in heterogeneous… In Fig. 2, each cell Pmap j i in the matrix represents a partial mapping solution that maps the first j tasks to a path between s and , where both nodes have the same S tag .Each iteration step involves the calculation of the bottleneck value to fill in a new cell Pmap j−1 i and add new tasks to the partial scheduling schema.In the 2D matrix process, we consider two subcases, the minimum value of which is chosen as the minimum T bottleneck S tag .These cases can be described as follows.Case I: The new task is mapped to the same node that has executed the previous task.We directly place component at supervisor i , at which the last task −1 was executed in the previous mapping subproblem Pmap j−1 i .In other words, the last two or more components are scheduled to the same node i to minimize the internode communication latency.Therefore, we only need to add the computing time T compute of on node i to the Pmap j−1 i time.Case II: The new task is mapped to one of the neighbor nodes , where ∈ adj and has a direct link to , which is represented by a dotted line from a neighbor shaded cell on the left column to the supervisor i .We recursively calculate T bottelneck for all possible mappings to nodes and choose the minimal value.This minimal value is further compared with the value calculated in Case 1.The minimum of these two values is selected as the minimum T bottelneck for the partial mapping to a path between s and with the same S tag .For further clarification, we explain both the cases in the presented scheduling scenario in the matrix shown in Fig. 2. For scheduling the component y , the MT-Scheduler algorithm first calculates the bottleneck time if the component is assigned to the same node to which the previous component was mapped, and in this scenario, if the node is f , the case corresponds to Case I. Second, each bottleneck time is calculated if the component is assigned to one of the adjacent/neighbor nodes (shaded cells); in this scenario, the nodes are adj1, adj2, and adj3 (nodes e , g , and b , respectively), which correspond to Case II.Finally, the MT algorithm chooses the minimum bottleneck time and assigns the task to the correspondent node, namely f .Another example, as shown in Fig. 2, corresponds to the schedul- ing of the last component .
In contrast to in the previous example, instead of assigning this task to the same node executing the previous task ( g ) , the algorithm chooses to assign this task to one of the g adjacent nodes adj1 and adj2 (nodes e and b , respectively) as in Case II.The MT algorithm calculates the minimum bottleneck time achieved when assigning task to the adjacent node ( b ). Figure 3 shows the architecture and dataflow of the proposed scheduler.The MT-Scheduler algorithm inputs the user-defined list of site tags to generate a Tags_Pairs list from the Storm metadata configurations.Next, according to the input critical path topological components and cluster characteristic data, the MT-Scheduler algorithm uses the dynamic programming to generate MTPRHashMapping < node, component > for the critical path topological components.The main mapper algorithm builds the final mapping schema by calling the MT-Scheduler for the critical path components and integrates the components with the mapping schema for the noncritical path components.Finally, the pluggable scheduler feature in Storm is utilized to implement the final mapping schema via Nimbus over the underlying cluster.Job scheduler for streaming applications in heterogeneous… The MT-Scheduler, as shown in Fig. 4, solves the performance bottleneck problem arising in the default scheduler shown in Fig. 1.The proposed algorithm can minimize the computational bottlenecks by assigning C2 to node N3 with sufficient processing power and allowing the user to assign a GPU node N4 in Site 2 to execute the GPU required tasks of C3.Furthermore, the algorithm minimizes the communicational bottlenecks by assigning both C4 and C5 to nodes located at the same site to minimize the internode transfer latency.

Simulation results
The proposed MT-Scheduler is implemented in a simulation program, as described in our previous work [48] by using C++, and it runs on a Windows 10 machine featuring Intel(R) Core (TM) i7-8565U CPU @ 1.80 GHz, RAM 16 GB, and SATA disk of 1 TB.For comparison, we implement the RR default algorithm as the Storm default We conduct a simple experiment to illustrate the influence of the task parameters on the scheduling decision and performance.Scenario 1, which involves a task with low computing and networking load, and scenario 2, which involves a task with high computing and transfer load, are tested on a cluster of eight nodes.As shown in Fig. 5, scenario 1, which has lower loads, generally achieves a higher system performance compared to that in scenario 2, which has higher loads.The highest frame rates in scenario 1, as obtained using the default RR and MT-Scheduler, are 35 and 45 frames per second, respectively.In contrast, in scenario 2, the highest frame rates, as obtained using the default RR and MT-Scheduler, are 39 and 60 frames per second, respectively.The proposed MT-Scheduler algorithm scales better than the RR.Furthermore, we test the system throughput performance when the underlying cluster size scales up.The same three task topologies are used, as shown in Fig. 6 in three different colors, and the number of cluster nodes ranges from 4 to 200, as shown in the x-axis.Figure 6 demonstrates that the MT-Scheduler, as indicated by the height bars, maintains higher frame rates than those obtained using the default RR, which are represented as connected curves as the cluster scales up.Out of the three linear, diamond, and star topologies, the star topology scales the best.Job scheduler for streaming applications in heterogeneous…

Real Storm environmental results
In addition, we conduct experiments using an Apache Storm cluster of eight physical machines having hardware configurations as presented in Table 1.
Each machine runs Storm 0.9.7 on top of Ubuntu 10.4 with Java JDK 8u221, ZooKeeper 3.3.6,Zeromq 4.1.3,and the Java binding JZMQ in addition to other required Storm-dependent libraries.A heterogeneous Storm cluster has one node running Nimbus daemon and ZooKeeper [21] with a relatively high storage capacity Fig. 5 Impact of the computational complexity and data transfer rate in a distributed heterogeneous cluster scheduling Fig. 6 Simulated system average throughput, scalability, and throughput improvement percentage for log saving purposes.The other worker machines run supervisor daemon, each of which has a specific number of worker processes.Each worker process executes a subset of the topology, and each supervisor node has worker processes equal to the node's CPU cores.
We collect all the test results regarding the throughput and latency data from the Storm user interface (UI daemon).It is worth mentioning that the Storm system does not plot all the results and instead samples only 0.05% out of the total transactions to avoid overburdening the system.However, this aspect does not affect the average throughput because we run the test for 600 s, which represents adequate time for system stabilization and collecting sufficient samples to calculate the average throughput rates.In all the tests, the proposed algorithm assumes that a user preference exists in terms of the site location, and the cluster is distributed over at least two sites.
For our evaluation, the throughput of the overall topology (processed tuples per unit time) is limited by the performance bottleneck identified and minimized using the MT-Scheduler algorithm.
We use three commonly used microbenchmarks [25][26][27][28], namely linear, diamond, and star topologies from [35], as shown in Fig. 7.The linear topology, as shown in Fig. 7a, is the simplest structure and consists of six linear components.The diamond topology, as shown in Fig. 7b, includes five components, in which the spout feeds the middle three components, and the last bolt receives all the outgoing data.The star application, as shown in Fig. 7c, is a multiple spout topology that transmits data tuples to the central bolt, which in turn transfers its processed tuples to the remaining components.For comparison, we evaluate our MT-Scheduler against the RR default scheduler and the state-of-the-art adaptive scheduler [29].
The main goal of the proposed algorithm is to minimize the computational/communicational bottleneck time to achieve the maximum system throughput.Figure 8 shows that the proposed algorithm outperforms both the default RR and the adaptive scheduler in terms of the latency (the elapsed time to ack a tuple after it is transmitted) under all the three topologies.Similarly, the average system throughput of the MT-Scheduler is higher than that of both the algorithms under all the three topologies, as shown in Fig. 9.The star topology, which has a complicated dependency Job scheduler for streaming applications in heterogeneous… structure, achieves the best performance, compared with the linear and diamond topologies.

Conclusions and future work
The proposed MT-Scheduler algorithm aims to maximize the system throughput for streaming applications in a Storm environment.The simulation evaluation results show the impact of the task complexity and data transfer rates on the scheduling

Fig. 1
Fig. 1 Default Storm scheduler that does not take into account the data locality or performance bottleneck

Fig. 3
Fig. 3 Architecture and dataflow of the MT-Scheduler

Fig. 4
Fig. 4 MT-Scheduler minimizes the system performance bottlenecks

Table 1
Experimental cluster specification